Generative AI is taking the tech world – and the broader world – by storm, but relatively little word has come out of the major supercomputer centers amid the influx of generative models. Now, Finnish IT center CSC has announced that its massive EuroHPC supercomputer, LUMI, powered a research project that produced the “largest Finnish language model ever.”
LUMI is still early in its lifespan, having just hit the Top500 list last May, where it placed third (a position it retained this November). The research team was able to complete the model – called TurkuNLP – because they had been granted special access as one of around thirty pilot projects for LUMI’s GPU partition.
The GPT-3-based model was trained on 13 billion parameters, entirely in the Finnish language. This makes TurkuNLP the largest Finnish language model, following in the footsteps of smaller Finnish language models that the group also developed during their pilot project. Further, the team “taught” Finnish to the 176 billion-parameter BLOOM model.
As a pilot project, the team did encounter some speed bumps, eventually achieving around 75-80% of the performance they expected on LUMI, which they called “probably a good number.” But interestingly, the team was restricted not by computational power, but by the amount of Finnish-language data available for training. “All in all, there is simply not enough Finnish available in digital format for us to train the model for more than 100 billion parameters using Finnish alone,” explained Filip Ginter, a professor at the University of Turku (the model’s namesake), in an interview with CSC’s Anni Jakobsson. He joked: “Finns talk little and there are not that many of them.”
Ensuring prosocial behavior of generative AI models is an increasing concern, with some early models going rogue in off-putting ways, or reflecting biases and prejudices in training datasets. This was top of mind for the researchers, who sought to ameliorate these issues using data labeling.
“We trained our language model with very high-quality data that meets EU requirements,” explained Sampo Pyysalo, a researcher at the University of Turku. “By classifying different text types, we have a better-than-average understanding of what kind of data the model has read, and we were able to eliminate the most toxic and problematic texts from the model. For example, compared to previous models, we were able to cut the model’s spontaneous swearing in half.”
“Data preprocessing is a very important part of the training of language models,” added Ginter. “We eliminated hate speech from the data entered into language models and also deleted any personal data, such as personal identity codes, phone numbers as well as physical and electronic addresses. This way, we control what the language model learns and what it generates when used.”
The open model has now been published on the internet, and the team will continue its work in the LUMI Extreme Scale project, aided by a grant of two million GPU-hours.
To learn more about this research, read the reporting from CSC’s Anni Jakobsson here.