Study design
This was a retrospective observational study. All data and databases used in this study were statistically deidentified, and all study procedures were compliant with the United States Health Insurance Portability and Accountability Act. The study therefore did not require Institutional Review Board approval or informed consent.
Study sample
The sample population was drawn from Optum’s Integrated Clinical + Claims Database, which combines adjudicated medical and pharmacy claims with EHRs from the Optum Research Clinical Database. The latter database currently has > 101 million unique patients from ~ 60 provider delivery organizations in the United States and Puerto Rico, with an average of 45 months of observed data per patient. The integrated database includes health plan enrollment data; clinical information, including medications prescribed and administered; lab results, vital signs, and body measurements; diagnoses and procedures; and information derived from provider notes using proprietary NLP methods.
The data used in this study were from 01 January 2016 through 31 March 2019. Inclusion and exclusion criteria were as described in Bali et al. 18. Briefly, eligible participants were enrolled in a national commercial or Medicare Advantage medical and pharmacy plan in the Optum Research Clinical Database between January 2016 and March 2017, with their earliest enrollment date set as the index date. Continuous enrollment for at least 24 months after and including the index date was required for inclusion in the data set, and eligible participants were 18–85 years of age as of their index year. Enrollees were excluded from the study if their EHRs included evidence of a pharmacy fill or written medication order for an ACE inhibitor, a class of blood pressure medication that can cause chronic cough. Eligible participants were divided into training/validation (75%, n = 245,161) and test (25%, n = 82,262) data sets. The number of positive sample class (i.e., with chronic cough) were 3,651 (1.49%) and 1,167 (1.42%) in the training and test set, respectively. The number of negative sample class (i.e., without chronic cough) were 241,510 (98.5%) and 81,095 (98.6%) in the training and test set, respectively.
Development of 3 claims-based chronic cough identification models
The gold standard positive class for the development of the claims-based algorithm was generated by implementing an NLP algorithm we developed previously using data available from EHRs28. This algorithm has also been replicated using the Kaiser Permanente Southern California Research Data Warehouse using the same integrated database as in the current study27. The algorithm defines chronic cough as the presence of at least 3 cough encounters (any combination of 3 sources of information: NLP-identified mentions of ‘tussis’ or any inflection of the word ‘cough’ in free-text provider notes, occurrences of acute cough-specific diagnosis code ICD-10 R05, or written medication orders for benzonatate or dextromethorphan) in EHRs within a 120-day period, with at least 56 days between the first and last cough encounters. The negative class comprised all individuals not identified as part of the gold standard positive class.
Three claims-based classification models were constructed using supervised machine learning methods to identify individuals with chronic cough from the sample population. The models were based on Extreme Gradient Boosted Trees (XG Boost, a decision tree ensemble), logistic regression, and neural network ensemble approaches; see below for methodological details specific to each model.
All models were constructed using diagnosis, procedure, prescription, provider specialty, and patient demographic information from the individuals in the sample. Diagnosis, procedure, provider specialty, and prescription features consisted of count values for each patient to represent the total number of occurrences (i.e., doctor visits, prescriptions filled, diagnoses received, etc.) of the same code the patient has received over the 24-month observation period. Demographic (age, regional location, & gender) and insurance type information was represented as indicators. In addition, the International Statistical Classification of Diseases and Related Health Problems Procedural Classification System (ICD-PCS) was used to map all procedures to higher-level (3-digit) procedure groupings using Clinical Classification Software developed by the Healthcare Cost and Utilization Project36. Generic Product Identifier groupings were used for National Drug Codes; we used the first 8 digits of the Generic Product Identifier to determine the generic drug name.
Feature selection was undertaken to reduce the number of independent variables used to identify chronic cough by removing irrelevant or less relevant features that negatively affected model performance. Feature importance was based on information gain for the XGBoost model and output of the LASSO procedure for the logistic model. Features were created using data from the index date (first chronic cough date for positive class or earliest enrollment date for negative class) through the next 24 months. A minimum code prevalence cutoff of 0.04 was used for all models. Since the data set comprised predominantly categorical variables, relevant classification features of the classifier were identified by setting odds ratio thresholds where the confidence intervals were < 0.8 or > 1.2. Any features not associated with chronic cough would have odds ratios close to 1.0 and would be effectively removed37. Implementing this process across all 3 machine learning models reduced the overall number of independent features by 90%. Additional feature selection processes that were unique to each model were also used; details are provided below.
Nested three-fold cross-validation (resampling to determine model performance on a held-out data set) was performed to select features and hyperparameters for all models38. Model performance during cross-validation was based on average precision (PR-AUC). Final model coefficients and hyperparameters were fit on the entire training set and performance was evaluated on the test set (25% of the data). We also assessed the performance of the 3 models with subpopulations from the test set comprising individuals with a diagnosis of cough, females, and individuals ≥ 65 years of age.
Different binary classification methods were developed and applied to determine the best algorithm to appropriately identify chronic cough. All 3 models incorporated cross-validation and hyperparameter tuning to optimize performance. For all models, a probability threshold of 0.5 was initially used to define accurate classification of chronic cough, and then adjusted to maximize precision and specificity. The final models used the following probability thresholds: XG Boost, 0.70; logistic regression, 0.30; neural network, 0.85. Secondary analyses stratified and evaluated model performance in the test data set among individuals with cough diagnoses (defined as any instance of an ICD-10 R05 code in the individual’s history), female participants, and participants ≥ 65 years of age. A probability threshold of 0.70 was used for the XG Boost model in the cough diagnosis subgroup due to the small sample size.
XG Boost-based model
XGBoost is an ensemble-based decision tree algorithm that uses boosting and gradient descent to assign and adjust the weights on each tree, minimizing loss39. Ensemble methods fit an initial tree-based algorithm to the data, then build a second version of the model to improve upon (boost) the classification of the first. The process is iterated until the classification error (mean squared error) no longer decreases with each subsequent boost (i.e., gradient descent).
A Bayesian search method was used in tandem with a threefold cross-validation to tune the model’s hyperparameters, in order to modify the constraints in constructing each tree during cross-validation. While optimizing for average precision we limited the maximum depth of the tree, the minimal weight needed for each child node, and the minimum loss reduction needed to progress further in a leaf node; we also tuned the balance between positive and negative weights and the subsampling ratio for each tree and training instance40. The final optimized hyperparameter values selected after tuning and cross-validation are listed in Supplemental Table 1. Diagnosis, procedure, provider specialty, and prescription features were represented as count values for each patient, corresponding to the total number of occurrences with the same code for each patient during the study period. Demographic data (age, regional location, binary gender) and insurance type information were represented as indicators.
Logistic regression model
For the logistic regression model, all features were converted to flags, so that counts became flags for 1 or more, 2 or more, and 3 or more. We constructed a logistic regression model with L1 regularization (i.e., least absolute shrinkage and selection operator [Lasso] regression)41. This technique adds a regularization term to the equation as a hyperparameter that acts as a penalty term to avoid overfitting. As a result, the most important features within the model may be assigned higher final coefficients, while less important features are ultimately set to zero. This is a particular strength in a data set with sparse data or a large number of features. In general, a hyperparameter that is too small will result in no regularization term and essentially an unpenalized logistic regression model. In such cases, the model will be overfit. Conversely, a larger hyperparameter will add too much weight and can lead to an underfit model.
Grid search was applied to the model along with a K-fold cross-validation method to tune for class weight and regularization strength (C) while optimizing for average precision. As the model was fit on the training set and internally validated on a held-out validation set, the K-fold cross-validation process (K = 3) was performed 3 times with a different randomly selected subsample held-out each time. Different values were tested for the C parameter (0.1, 1, and 10); the optimized value was 0.1. The class weight parameter was calibrated between a balanced weighting and its default weight assignment; the tuned model was optimized to use its default method where both positive and negative classes had equal weighting within the cost function of the algorithm42. All independent variables passed into the logistic regression model were represented as binary indicators of 0 or 1. Each diagnosis or procedure code was assigned one of 3 categories based on how frequently the code occurred for each patient (≥ 1, ≥ 2, or ≥ 3).
Neural network (deep learning)-based model
Neural networks, also known as deep learning models, comprise a series of algorithms developed to recognize patterns in data that identify an outcome—in this case, presence of chronic cough. Neural networks mimic the function of the human brain via layers of interconnected neurons. Simple neural networks comprise input, hidden, and output layers. We developed a neural network using Tensorflow software43. Hyperparameters were tuned using KerasTuner44. The model consisted of a multi-layered neural network, in which each layer contained 100 neurons and used the scaled exponential linear unit activation function45. The output layer used a sigmoid function to produce a probability that conveyed the likelihood of a given patient being classified as positive for chronic cough. The model used an L2 penalty as a regularization technique, which adds a penalty to the model that is equal to the square of the magnitude of the coefficients, to avoid overfitting46. Unlike L1/Lasso methods, L2 regularization does not drive coefficients down to zero and therefore does not reduce data dimensionality. We used LAMB as the optimizer and binary cross entropy as the loss function47,48.
Using the KerasTuner framework to hyperparameter tune the neural network, we implemented a Bayesian optimization method to tune the number of hidden layers in the model, the dropout rate, and the magnitude of the L2 regularization penalty to create a model optimized for minimal loss. During the hyperparameter optimization process we tuned the L2 penalty rate at 0.01, 0.001, and 0.0001; the optimized value was chosen to be 0.0001. We also tuned the overall length of the neural network to have 1–5 hidden layers, with each layer separately tuned to a dropout rate between 0 and 0.3; the optimized model was tuned to have only one hidden layer with a dropout rate of 0.3 (the optimized hyperparameters and additional dropout rates for more hidden layers are listed in Supplemental Table 1).
Statistical analysis
All study variables were analyzed descriptively. Means and SDs were calculated for continuous variables. Statistical analyses were performed using SAS software (SAS Institute, Cary, NC) and/or R, Python, and Spark.
Ethical approval and consent to participate
All data and databases were accessed in compliance with US patient confidentiality requirements, including the Health Insurance Portability and Accountability Act of 1996 regulations (HIPAA). All databases used are de-identified and fall under the US Department of Health and Human Services Policy for the Protection of Human Subjects, and therefore informed consent and Institutional Review Board approval were not required.