Large language models streamline automated machine learning for clinical studies

Ethics statement

The methods were performed in accordance with relevant guidelines and regulations and approved by the ethical committee of the Medical Faculty of RWTH Aachen University for this retrospective study (Reference No. EK 028/19).

Patient cohorts

The patient datasets were retrieved from public repositories as indicated in the original studies on metastatic disease prediction17, esophageal cancer screening20, hereditary hearing loss24, and cardiac amyloidosis25.

In the included Endocrinologic Oncology study17, cross-sectional data from Germany, Poland, the US, and the Netherlands was used to assess the ability of the dopamine metabolite methoxytyramine to identify metastatic disease in patients with pheochromocytoma or paraganglioma. To this end, ten features were available.

The included Esophageal Cancer study20 from China was centered on endoscopic screening and included multiple data sources from questionnaires to endoscopy data, i.e., cytologic and epidemiologic data.

The included Hereditary Hearing Loss study24 contained genetic sequencing data to diagnose this condition in a Chinese cohort. Individuals were categorized based on hearing loss severity and variations in three genes (GJB2, SLC26A4, MT-RNR1).

The included Cardiac Amyloidosis study25 utilized electronic health records to identify patients with cardiac amyloidosis from a dataset spanning 2008-2019, sourced from IQVIA, Inc., focusing on heart failure and amyloidosis. While the original study used external datasets for validation, these were inaccessible. Therefore, our analysis adhered to the original study’s internal validation strategy: 80% as the training set and 20% for testing, resulting in 1712 individuals for training and 430 for testing. For further information on the individual datasets, the reader is referred to Table 1 or the original studies.

Experimental design

We extracted the original training and test datasets from each clinical trial. All datasets were available in tabular format, albeit in various file formats such as comma-separated values or Excel (Microsoft Corporation). No modifications to the data format, specific data pre-processing or engineering, or selecting a particular ML method were necessary to prompt ChatGPT ADA. GPT-413, the current state-of-the-art version of ChatGPT, was accessed online (https://chat.openai.com/) following the activation of the Advanced Data Analysis feature. Initially, we operated the August 3 (2023) version, while, during the project, we transitioned to the September 25 version. A new chat session was started for each trial to exclude memory retention bias.

In the first phase, ChatGPT ADA was sequentially prompted by (i) providing a brief description of the study’s background, objectives, and dataset availability, (ii) asking for developing, refining, and executing the optimal ML model based on the individual study’s framework and design, and (iii) producing patient-specific predictions (classification probabilities) without revealing the ground truth. The same training and test datasets as in the original studies were used. We deliberately refrained from offering specific ML-related guidance when ChatGPT sought advice on improving prediction accuracy. Instead, ChatGPT ADA was tasked with (i) autonomously choosing the most suitable and precise ML model for the given dataset and (ii) generating predictions for the test data. Figure 2 provides an exemplary interaction with the model.

Using the provided ground-truth test set labels, we calculated the performance metrics for ChatGPT ADA’s results using Python (v3.9) using open-source libraries such as NumPy, SciPy, scikit-learn, and pandas.

The performance metrics were compared against those published in the original studies (“benchmark publication”). In some clinical trials, the clinical care specialists’ performance was also reported, and these metrics were included for comparison. Notably, inputting and analyzing each dataset through ChatGPT ADA took less than five minutes. Detailed transcripts of the interactions with ChatGPT ADA for every dataset are presented in Supplementary Notes 14.

Data pre-processing and ML model development

In the second phase, a seasoned data scientist re-implemented and optimized the best-performing ML model of the original studies using Python (v3.9) using open-source libraries such as NumPy, SciPy, scikit-learn, and pandas and the same training datasets as outlined above (“benchmark validatory re-implementation”). This re-implementation and optimization was necessary because individual patient predictions were unavailable in the original studies, precluding head-to-head model comparisons and detailed statistical analyses. More specifically, the data scientist optimized the data pre-processing and the ML model in close adherence to the original studies, yet complemented by his expertise and experience while aiming for peak accuracy.

The following provides trial-specific details on the data pre-processing and the conceptualization of the specific ML models.

Metastatic disease [endocrinologic oncology]

Re-implemented (validatory) ML model: The training set contained 30 missing values, while the test set contained 15 missing values. Median values from the training set were used to impute the missing values in both datasets. Ten distinct feature vectors were constructed from the dataset variables. The feature vectors were partially categorical and partially numerical. The categorical features were: (1) previous history of pheochromocytoma or paraganglioma (yes/no), (2) adrenal/extra-adrenal location of primary tumor (adrenal/extra-adrenal), (3) presence of Succinate Dehydrogenase Complex Iron-Sulfur Subunit B (SDHB) (yes/no/not tested), (4) tumor category of primary tumor (solitary, bilateral, multifocal), and 5) sex (female/male). The numerical features were: (1) age at diagnosis of first tumor [years], (2) spherical volume of primary tumor [cm3], (3) plasma concentration of metanephrine (MN) [pg/ml], (4) plasma concentration of normetanephrine (NMN) [pg/ml], and (5) plasma concentration of methoxytyramine (MTY) [pg/ml]. Categorical data were translated into numerical integer values, e.g., female (0) and male (1) for sex. An Adaptive Boosting (AdaBoost)18 ensemble tree classifier was employed and optimized using a 10-fold cross-validation grid search. This optimization led to selecting parameters like a maximum depth of 2 for individual decision trees, a count of 200 trees, and a learning rate of 0.01. Stagewise additive modeling was chosen, utilizing a multiclass exponential loss function.

ChatGPT ADA-crafted ML model: A check for missing data mirrored the findings above, leading the model to resort to a median imputation strategy. Numerical data were standardized using standard scaling, while categorical data were converted to integer values. The selected classification technique was a Gradient Boosting Machine (GBM)19 with parameters set as follows: maximum tree depth: 3, number of trees: 100, minimum samples per leaf: 1, minimum samples for split: 2, and learning rate: 0.1. The logarithmic loss function was the chosen evaluation metric, with the quality of splits being evaluated using the Friedman mean squared error53. No validation dataset was incorporated, and the model was not subjected to any specific regularization techniques.

Esophageal cancer [gastrointestinal oncology]

Re-implemented (validatory) ML model: The training dataset included 147 feature vectors, whereas the test dataset included 169. A comprehensive list of the feature vectors can be found in the literature:20. Excess feature vectors in the test set were excluded to maintain consistency, aligning it with the training dataset. Consequently, neither the training nor the test datasets contained missing values. Categorical data were mapped to numerical integer values. Imbalanced dataset distributions were addressed by conferring inverse frequency weights upon the data. In line with the original study, the DS selected the Light Gradient Boosting Machine (LightGBM)21 with the gradient boosting decision tree algorithm. The configuration for the classifier was as follows: an unspecified maximum tree depth, 300 trees, a cap of 31 leaves per tree, and a 0.1 learning rate. The logarithmic loss function served as the evaluation metric. The model integrated both L1 and L2 regularization techniques.

ChatGPT ADA-crafted ML model: The pre-processing mirrored the approach above, identifying a class imbalance. The selected classifier was the GBM with parameters including a maximum tree depth of 3, 100 trees, minimum samples per leaf of 1, minimum samples for a split of 2, and a learning rate of 0.1. The model’s performance was assessed using the logarithmic loss function, with the quality of tree splits evaluated using the Friedman mean squared error. No validation dataset was incorporated, and the model was not subjected to any specific regularization techniques.

Hereditary hearing loss [otolaryngology]

Re-implemented (validatory) ML model: The training and test sets included 144 feature vectors, i.e., sequence variants at 144 sites in three genes24. The values of the training set were numerical, i.e., 0 (individual has no copies of the altered allele [98.2% of the values]), 1 (individual has one copy of the altered allele [1.6%]), and 2 (individual has two copies of the altered allele [0.2%]), while only one value was missing. The values of the test set were numerical, too, with a similar distribution: 0 (98.3%), 1 (1.5%), and 2 (0.2%), while no values were missing. Missing data points were addressed by imputing the median of the training data. All feature vectors were then subject to MinMax scaling. A Support Vector Machine23 was the best-performing classifier per the original study, configured with the Radial Basis Function kernel, gamma set to 1, and enabled shrinking. Model optimization leveraged a 5-fold stratified cross-validation using grid search. The regularization cost parameter was defined at 100.

ChatGPT ADA-crafted ML model: The pre-processing was closely aligned with the methodology above, with one notable exception: Missing data was addressed by zero-imputation. The classifier chosen was the Random Forest (RF)22, with the following framework parameters: no explicitly defined maximum depth for individual trees, tree count of 100, minimum samples per leaf of 1, and minimum samples per split of 2. At each split, the features considered were the square root of the total features available. 5-fold cross-validation was employed without the use of a grid search. Regularization was achieved by averaging predictions across multiple trees. Bootstrapping was chosen to create diverse datasets for training each decision tree in the forest.

Cardiac amyloidosis [cardiology]

Re-implemented (validatory) ML model: The dataset comprised 1874 numerical (0 or 1, indicating the presence or absence) feature vectors25. There was no value missing in the dataset. The feature vectors underwent standard scaling for normalization. The classifier chosen was the RF, with the following parameters: maximum depth for individual trees of 20, total tree number of 200, minimum samples per leaf of 2, and minimum samples per split of 5. For each tree split, the square root of the total features determined the number of features considered. A 5-fold cross-validation was combined with a grid search for optimization. Regularization was effectuated by averaging the predictions over multiple trees. The model did not utilize bootstrapping.

ChatGPT ADA-crafted ML model: As there was no missing value in the dataset and the values were binary, the data underwent no scaling or standardization. The selected classifier was the RF. Parameters for the model were as follows: an unspecified maximum depth for individual trees, a tree count of 1000, minimum samples per leaf of 1, and minimum samples per split of 2. For each tree split, the features considered were the square root of the total feature count. The model was validated using 5-fold cross-validation without grid search. Regularization was achieved by averaging predictions across several trees, and the model utilized bootstrapping22,54.

Because ChatGPT ADA provides all intermediary Python code during data pre-processing and ML model development and execution, we meticulously analyzed the code for accuracy, consistency, and validity.

Explainability analysis

We used SHapley Additive exPlanations (SHAP)26 to analyze feature contributions to the model’s predictions. ChatGPT ADA was tasked with autonomously performing a SHAP analysis to be narrowed down to the top 10 features. To ensure accuracy, the seasoned data scientist (S.T.A. with five years of experience) reviewed the Python code provided by ChatGPT ADA and re-implemented the procedure in Python using SHAP library26 with TreeExplainer55 to confirm the model’s outputs.

Reproducibility analysis

We evaluated the consistency of the tool’s responses using separate chat sessions (to avoid memory retention bias), yet the same datasets, instructions, and prompts on three consecutive days. The model consistently reported the same responses and qualitative and quantitative findings.

Statistical analysis and performance evaluation

The quantitative performance evaluation was performed using Python (v3.9) and its open-source libraries, such as NumPy and SciPy. Unless noted otherwise, performance metrics are presented as mean, standard deviation, and 95% confidence interval (CI) values.

Using the published ground-truth labels from the original studies as reference (“benchmark publication”), we calculated a range of performance metrics based on ChatGPT ADA’s predictions of the test set labels: AUROC, accuracy, F1-score, sensitivity, and specificity. These performance metrics are presented alongside those reported in the original studies, if available (Table 2).

Once the per-patient predictions were available following the re-implementation and optimization of the select ML models (“benchmark validatory re-implementation”), we calculated the performance metrics outlined above using the ground-truth labels for the re-implemented (validatory) ML models and their ChatGPT ADA-based counterparts. We adopted bootstrapping54 with replacements and 1000 redraws on the test sets to ascertain the statistical spread (in terms of means, standard deviations, and 95% confidence intervals), and to determine if the metrics were significantly different. We adjusted for multiple comparisons based on the false discovery rate, setting the family-wise alpha threshold at 0.05. Notably, the comparative evaluation of the performance metrics was conducted in a paired manner. Bootstrapping was applied to both models. The threshold for calculating the F1-score, sensitivity, and specificity was chosen based on Youden’s criterion56.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *