How LinkedIn Uses Machine Learning to Address Content-related Threats and Abuse

To help detect and remove content that violates their standard policies, LinkedIn has been using its AutoML framework, which trains classifiers and experiments with multiple model architectures in parallel, explain LinkedIn engineers Shubham Agarwal and Rishi Gupta.

We use AutoML to continuously re-train our existing models, decreasing the time required from months to a matter of days, and to reduce the time needed to develop new baseline models. This enables us to take a proactive stance against emerging and adversarial threats.

One of the key points about content moderation is it needs to be enforced and tuned up continuously to address new strategies devised to circumvent it. Additionally, it must adapt to contextual changes. Those include data drift, i.e., inherent changes in content posted on the platform as conversations progress; global events, which tend to surface in discussions and trigger diverse viewpoints, frequently riddled with misinformation; and adversarial threats, which include fraudulent and deceptive practices like creating fake profiles, running scams, and so on.

To address all of those challenges, LinkedIn uses an approach aimed at “proactive detection”, which requires a process of continuously adapting and evolving its ML models and systems. AutoML, short for Automated Machine Learning, is a tool LinkedIn created internally to improve machine learning performance by continuously retraining models on new data, correcting them including false negatives and false positives, and fine-tuning their parameters.

Leveraging AutoML, we transformed what used to be a lengthy and intricate process into one which is both streamlined and efficient. […] After implementing AutoML, we saw the average time required for developing new baseline models and continuously re-training existing ones shrink from two months to less than a week.

Using AutoML, LinkedIn engineers automated the process of data preparation and feature transformation, including noise reduction, dimensionality reduction, and feature engineering, aiming at creating a high-quality training dataset for classifier training.

In a second phase, AutoML experiments with different classifier architectures by searching over a range of hyperparameters and optimization approaches and comparing the performance of the resulting models based on a set of specified evaluation metrics.

Finally, AutoML automates the deployment process by making the newly trained model available to production servers.

According to Agarwal and Gupta, there are still a few areas where their tool needs to mature, specifically to improve speed and efficiency and enable its adoption on a larger scale, which will eventually increase the requirements of computing power. Another promising area, they say, is using generative AI to improve the quality of datasets, both to reduce labeling noise as well as to generate synthetic data for model training.

While not all organizations operate at LinkedIn scale and have the resources to create their own ML automation tools, still the approach described by Agarwal and Gupta may be replicated at a smaller scale to relieve machine learning engineers from the most repetitive tasks associated with retraining existing models.

Source link