Apple researchers have documented (pdf) a new method for allowing Large Language Models (LLMs), to run on-device with a unique method to overcome RAM limitations on mobile devices. The full version of an LLMs like Open AI’s ChatGPT 4 has around 1.7 trillion parameters and requires powerful servers to be able to handle the processing. However, Google’s new Gemini AI – which it claims can beat GPT-4 – comes in a ‘Nano’ flavor for smartphones and uses quantization techniques to cut down the model to either 1.8 billion parameters or 3.6 billion parameters. One of these variants of Gemini Nano is currently running on Google’s Pixel 8 Pro smartphones (curr. reduced to $799 from Amazon – normally $999).
Qualcomm claims that it’s new Snapdragon 8 Gen 3 SoC can support generative AI LLMs up to 10 billion parameters in size – while considerably more capable than what Google is able to get working on the Pixel 8 series, this is still a far cry from the 1.7 trillion parameters required to make GPT-4 function as impressively as it does. Quantization, which makes the LLMs easier for mobile SoCs to process, also means that they lose accuracy and effectiveness. As such, anything that can help increase the size of the models that can be shoehorned onto a mobile device, the better the performance of the LLM.
In order for smartphones to be able to handle gen AU on-device tasks, RAM requirements are also considerable. An LLM reduced to 8-bits per parameter model with 7 billion parameters (like Meta’s Llama 2 which is supported by the Snapdragon 8 Gen 3), would require a smartphone with at least 7GB of RAM. The iPhone 15 Pro series features 8GB of RAM, so this suggests that an Apple-developed LLM like Llama 2 would be at the upper limit of what the current iPhone’s would support. Apple’s researchers have found a way around this onboard RAM limit.
In a research paper titled “LLM in a flash: Efficient Large Language Model Inference with Limited Memory,” Apple’s generative AI researchers have developed a method of utilizing an iPhone’s flash storage to supplement a device’s onboard system RAM. Flash storage bandwidth is not in the same league as LDDR5/X mobile RAM, however Apple’s researchers have developed a method that overcomes this inherent limitation. By using a combination of “windowing” (where the AI model reuses some of the data stored on the flash storage that it has already processed) and “row-column bundling” (which groups data from the LLM in a way that is more efficiently processed, speeding up the read speed).
Of course, we are yet to see an LLM from Apple, although rumors suggest that we could see one a smarter version of Siri based on an LLM that is set to debut as part of iOS 18 and be able to run on-device on the next-generation iPhone 16 Pro models. But when we do, it seems there will be a good chance that Apple will utilize this method of RAM extension to ensure it delivers an LLM model with as many parameters as possible that it can effectively run on-device. With Samsung upping its generative AI game for the launch of the Galaxy S24 series next month, 2024 is shaping up as the year generative AI becomes commonplace on smartphones too.
I have been writing about consumer technology over the past ten years, previously with the former MacNN and Electronista, and now Notebookcheck since 2017. My first computer was an Apple ][c and this sparked a passion for Apple, but also technology in general. In the past decade, I’ve become increasingly platform agnostic and love to get my hands on and explore as much technology as I can get my hand on. Whether it is Windows, Mac, iOS, Android, Linux, Nintendo, Xbox, or PlayStation, each has plenty to offer and has given me great joy exploring them all. I was drawn to writing about tech because I love learning about the latest devices and also sharing whatever insights my experience can bring to the site and its readership.