We have had an introduction to Generative AI tuning and inference concepts in part 1 of the blog. In part 2 we looked at the concept of taking an open-source large language model and tuning it for a specific use case leveraging the latest Intel Xeon processors. In this final part 3, we will look at leveraging the latest Intel Xeon processors for Generative AI inference with a real world use case.
Inference with Xeon for Generative AI:
Intel Xeon offers a cost-effective, scalable, and versatile solution for LLM Inference. Intel Xeon is democratizing access to powerful generative models and unlocking their potential to use across various end user applications and industries.
Why CPUs for Inference?
While GPUs have long dominated AI training due to their parallel processing prowess, CPUs offer distinct advantages for inference:
- Cost-effectiveness: CPUs are generally more affordable and readily available than high-end GPUs, making them accessible to a wider range of developers and researchers.
- Scalability: CPU-based systems are easily scalable, allowing you to adapt your infrastructure to handle growing model sizes and computational demands.
- Versatility: CPUs excel at diverse tasks beyond just AI, making them valuable for general-purpose computing alongside inference workloads.
- New Instructions: Intel’s advancements like AMX and DL Boost provide hardware-accelerated support for specific AI operations, significantly boosting CPU performance for inference.
Optimizing for CPUs:
Unlocking the full potential of CPUs for generative AI inference requires careful optimization:
- Model Quantization: Reducing model precision from 32-bit to 8-bit can significantly shrink model size and accelerate inference without sacrificing accuracy.
- Knowledge Distillation: Transferring knowledge from a larger, pre-trained model to a smaller CPU-compatible model can maintain performance while reducing resource requirements.
Real-World Applications:
The power of CPUs for generative AI is already being harnessed in various fields:
- Drug Discovery: Researchers are using CPU-powered systems to generate novel drug candidates, accelerating the search for life-saving treatments.
- Materials Science: CPUs are being used to design new materials with desired properties, leading to breakthroughs in fields like energy and aerospace.
- Creative Content Generation: Artists and writers are exploring the potential of CPUs to generate original content, from poems and stories to music and paintings.
Inference for Falcon 7B with Amazon EC2 c7i instances:
The compute requirements for inference of the Falcon-7B model were analyzed through a sizing exercise. The metric used was latency seen for the inferencing run. The goal was to have a latency of less than 25 seconds for the response in the chat.
Category |
Attribute |
c7i |
Run Info |
|
|
|
Benchmark |
Inference Falcon 7-Billion Parameter Model with |
|
Date |
Nov 10-24, 2023 |
|
Test by |
Intel |
CSP and VM Config |
|
|
|
Cloud |
AWS |
|
Region |
us-east-1 |
|
Instance Type |
c7i.8xlarge |
|
CPU(s) |
16 cores |
|
Microarchitecture |
AWS Nitro |
|
Instance Cost |
1.428 USD per Hour |
|
Number of Instances or VMs (if cluster) |
1 |
|
Iterations and result choice (median, average, min, max) |
|
Memory |
|
|
|
Memory |
64 GB |
|
DIMM Config |
|
|
Memory Capacity / Instance |
|
Network Info |
|
|
|
Network BW / Instance |
12.5 Gbps |
|
NIC Summary |
|
Storage Info |
|
|
|
Storage: NW or Direct Att / Instance |
SSD GP2 |
|
Drive Summary |
1 volume 70 GB |
Table 4: Compute Infrastructure for Falcon-7B Inference
The model that was tuned in the earlier phase was deployed with the compute infrastructure shown in Table 4. The software components used in the inference are shown in Table 5.
Category |
Attribute |
c7i |
Run Info |
|
|
|
Benchmark |
Inference using fine-tuned Falcon 7-B Model with |
|
Dates |
Nov 10-24, 2023 |
|
Test by |
Intel |
Software |
|
|
|
Workload |
Generative AI Fine Tuning |
Workload Specific Details |
|
|
|
Command Line |
# Inference using fine-tuned Falcon 7B model: python vmw_peft-tuned-inference.py –checkpoints /mnt/data/llm/aws_best_model_dist1_aws/checkpoint-1100/ –max_length 200 –top_k 10 |
Table 5: Workload details for inference
Inference Results:
The effectiveness of the Falcon-7B model for inference was tested before and after tuning. With the untuned Falcon-7B model the chatbot produced short answers to user queries with very little detail as shown in Figure 5.
Figure 5: Falcon-7B chatbot responses pre-tuning
The tuned model was then deployed and the chatbot was tested with similar questions as before.
Figure 6: Falcon-7B chatbot responses post tuning
The response from the tuned Falcon-7B chatbot show a more comprehensive understanding of the topics queried compared to the untuned model. All inference queries met the set SLA for response time of the application.
Conclusion:
Intel and AWS customers can use Xeons for tuning small to medium sized LLMs for their specific use cases. The Falcon-7B Large Language Model was tuned with Intel 4th Generation Xeon based Amazon EC2 C7i instances in a reasonable time of a few hours, which is acceptable for tuning. This shows that Enterprise customers can effectively leverage open source LLMs like Falcon-7B and tune it for their domain specific use cases with Intel Xeon based cloud infrastructure.
Many of these Generative AI applications are deployed at the edge where there are limitations in the amount of compute available. A reasonable sized instance that can be made available at the edge such as the c7i.8xlarge with 16 cores and 64 GB RAM with the latest Xeon hardware along with SW optimizations was able to meet the set SLA for inference for an LLM like Falcon-7B.
Through this blog we have demonstrated the efficacy of using Intel Xeon based cloud instances effectively for tuning and inferencing of publicly existing LLMs such as Falcon-7B.
References:
[1] https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors-product-brief.html : With the most built-in accelerators of any CPU on the market, Intel® Xeon® Scalable processors offer the most choice and flexibility in cloud selection with smooth application portability.
[1] https://aws.amazon.com/ec2/instance-types/c7i/ : Amazon Elastic Compute Cloud (Amazon EC2) C7i instances are next-generation compute optimized instances powered by custom 4th Generation Intel Xeon Scalable processors (code named Sapphire Rapids) and feature a 2:1 ratio of memory to vCPU. EC2 instances powered by these custom processors, available only on AWS, offer the best performance among comparable Intel processors in the cloud – up to 15% better performance than Intel processors utilized by other cloud providers.
[1] https://huggingface.co/tiiuae/falcon-7b Falcon-7B is a 7B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. It is made available under the Apache 2.0 license.
[1] https://huggingface.co/datasets/timdettmers/openassistant-guanaco: The Guanaco[1] dataset is a subset of the Open Assistant dataset. This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/fine-tune-falcon-llm-with-hugging-face-oneapi.html: Fine Tuning Falcon-7B with hugging face and Intel OneAPI.
[1] https://www.youtube.com/watch?v=JNMVulH7fCo: Video showing techniques.