Decoding Dataset Bias – Machine Learning Unveils Rifts in Human Risky Choices

Researchers from the Technical University of Darmstadt and Hessian Center for Artificial Intelligence, Germany, have delved into the intricate relationship between datasets and models in understanding human risky choices.

Their findings, published in the journal Nature Human Behaviour, bring to light the presence of dataset bias, shedding light on the nuanced differences in decision behaviors between participants engaged in online and laboratory experiments. Employing advanced machine learning techniques, the researchers not only identify the bias but also propose a novel hybrid model to bridge the gap created by increased decision noise in online datasets.

The interplay between decision datasets and ML models

Understanding the interplay between decision datasets and machine learning (ML) models is crucial in unraveling the complexities of human decision-making. The research conducted by the German team systematically examines this relationship, utilizing three distinctive datasets: Choice Prediction Competition 2015 (CPC15), Choice Prediction Competition 2018 (CPC18), and Choices13k.

These datasets represent a spectrum of choices made by participants in both controlled laboratory environments and large-scale online experiments. The research team employed various ML models, including classical methods and neural network architectures, training them on these datasets to gain insights into the performance variations and biases.

Delving deeper, the study found that models trained on the Choices13k dataset, indicative of online experiments, displayed poor generalization when applied to smaller laboratory datasets (CPC15 and CPC18). Likewise, models trained on CPC15 did not seamlessly transfer their predictive power to the Choices13k dataset, revealing a systematic dataset bias.

This bias pointed to notable differences in choice behaviors between participants engaged in laboratory experiments and those participating online. These findings underscore the importance of recognizing and addressing dataset bias, especially when dealing with diverse contexts and sources of data.

Unraveling the dataset bias

The revelation of dataset bias prompts further investigation into its roots and implications. The study uncovered that models trained on the Choices13k dataset exhibited a reluctance to predict extreme choice proportions, indicating a distinctive decision-making pattern in online participants compared to their laboratory counterparts.

To dissect the source of this bias, the researchers meticulously analyzed the features of gambles that were predictive of the difference in choice behavior between datasets. Utilizing techniques such as linear regressions and SHapley Additive exPlanations (SHAP), they quantified the importance of each feature. Surprisingly, features from psychology and behavioral economics literature, such as stochastic dominance, probability of winning, and the difference in expected value, played a pivotal role in influencing the bias.

These features, all revolving around the degree to which one gamble was expected to yield a higher payoff compared to another, underscored the complexity of human decision-making. Importantly, the study highlighted that the choice behavior in the Choices13k dataset appeared less sensitive to these features than in the CPC15 dataset, suggesting that online participants demonstrated more noise or indifference in their decision-making. This nuanced understanding of dataset bias and its roots sets the stage for the development of strategies to mitigate its impact and refine predictive models in diverse decision-making contexts.

Analyzing features and proposing a hybrid model

With a comprehensive understanding of dataset bias and its implications, the researchers proposed a novel solution – a hybrid model. This model aimed to address the increased decision noise observed in online datasets, introducing a probabilistic generative model alongside a neural network trained on the CPC15 dataset. The probabilistic generative model assumed that a proportion of participants in the online experiment were making random guesses, while the remaining participants were following the decision patterns learned from the laboratory dataset.

The integration of this hybrid model significantly improved prediction accuracy and reduced the observed differences with the traditional neural network trained solely on the CPC15 dataset. This innovative approach not only provided a practical solution to the dataset bias issue but also highlighted the importance of considering the unique characteristics of online datasets in developing accurate and robust predictive models for human decision-making.

The research showcased the intricate relationship between ML models and human decision datasets, emphasizing the presence and impact of dataset bias. The study highlighted the challenges posed by relying solely on large-scale online datasets for understanding general theories of human decision-making.

It underscored the necessity for a balanced approach, combining ML techniques, data analysis, and theory-driven reasoning to navigate the complexities of human risky choices. As the research opens avenues for future exploration, questions emerge: How can we refine and validate ML models to account for the variability and noise inherent in online data, paving the way for a more robust understanding of human decision-making across different contexts and experimental settings? The quest for answers continues, urging researchers to explore, refine, and integrate theoretical and analytical frameworks to unravel the mysteries of human decision-making in an increasingly digital age.

Source: https://www.cryptopolitan.com/dataset-bias-machine-learning-risky-choices/

Source link