Uncategorized

Electronics | Free Full-Text | Forward Learning of Large Language Models by Consumer Devices



Since its introduction in 2017 [1], the Transformer architecture has revolutionized the field of Natural Language Processing (NLP) achieving unprecedented accuracy over a plethora of complex tasks such as translation [2], question answering [3], sentiment analysis [4,5], text classification [6], textual entailment [7] and summarization [8]. With the staggering growth of IoT devices, the deployment of such powerful models to the edge has become more challenging than ever. Nevertheless, it is known from theory [9] that the performance of a transformer-based language model scales as a power law with respect to the number of parameters. For this reason, recent years have featured a trend towards ever-increasing model footprint resulting in the notorious 175-billion parameters GPT-3 [10]. The consequent high memory requirements and heavy computational workloads are incompatible with micro-controllers (MCUs) and sensors embedded assets, which are severely resource-constrained. In order to mitigate such issue, a broad line of research [11] has focused on reducing the computational complexity and memory usage running the inference by applying known techniques such as pruning [12,13], quantization [12,14] and knowledge distillation [15]. However, enabling only model’s inference on the device is not enough. The performance of the AI models, in fact, deteriorates as time passes since the last training cycle; phenomenon known as concept drift [16], hence mandates model’s parameter updates from time to time. The prominent field of on-device learning (ODL) [17] allows for machine learning (ML) models deployed on edge devices to adapt to the continuously changing data statistics, which are collected by the sensors, and performing model’s parameter training. This paper explores the application of novel learning methods, PEPITA and MEMPEPITA, to Transformer-based Large Language Models (LLMs) for ODL on edge devices. It quantitatively analyzes the improvements brought by these methods in reducing computational complexity and memory usage compared to traditional backpropagation (BP), especially in resource-constrained environments. The memory and complexity performance of BP, PEPITA and MEMPEPITA on state-of-the-art models like GPT-3 Small, DistilBERT, and AlexaTM is evaluated and reported. The suitability of these training methods is also investigated in real-world edge devices, considering the constraints of memory and processing power. This work is organized as follows: Section 2 sets the research question that will be addressed in the subsequent sections; Section 3 reports the background literature on Transformers and Large Language Models to this day; Section 4 review relevant and related works in the known literature; Section 5 proposes the PEPITA [18] training procedure applied to the Transformer architecture; Section 6 refers to the method proposed in Section 5, and quantitatively analyze different learning algorithms on DistilBERT [19], GPT-3 Small [10] and AlexaTM [20], reporting as well the results of the analysis; Section 7 discusses and compares BP [21], PEPITA and MEMPEPITA [22] application to Large Language Models on the basis of the 6 results; Section 8 provides total RAM usage and latency estimations, based on candidate microprocessors, and comments the applicability to consumer edge devices; Section 9 concludes the paper and proposes further and future developments.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *