Uncategorized

Tuning and Inference Strategies on 4th Generation Intel Xeon Processors for Optimal Results


We have had an introduction to Generative AI tuning and inference concepts in part 1 of the blog. In part 2 we looked at the concept of taking an open-source large language model and tuning it for a specific use case leveraging the latest Intel Xeon processors. In this final part 3, we will look at leveraging the latest Intel Xeon processors for Generative AI inference with a real world use case.

Inference with Xeon for Generative AI:

Intel Xeon offers a cost-effective, scalable, and versatile solution for LLM Inference. Intel Xeon is democratizing access to powerful generative models and unlocking their potential to use across various end user applications and industries.

Why CPUs for Inference?

While GPUs have long dominated AI training due to their parallel processing prowess, CPUs offer distinct advantages for inference:

  • Cost-effectiveness: CPUs are generally more affordable and readily available than high-end GPUs, making them accessible to a wider range of developers and researchers.
  • Scalability: CPU-based systems are easily scalable, allowing you to adapt your infrastructure to handle growing model sizes and computational demands.
  • Versatility: CPUs excel at diverse tasks beyond just AI, making them valuable for general-purpose computing alongside inference workloads.
  • New Instructions: Intel’s advancements like AMX and DL Boost provide hardware-accelerated support for specific AI operations, significantly boosting CPU performance for inference.

Optimizing for CPUs:

Unlocking the full potential of CPUs for generative AI inference requires careful optimization:

  • Model Quantization: Reducing model precision from 32-bit to 8-bit can significantly shrink model size and accelerate inference without sacrificing accuracy.
  • Knowledge Distillation: Transferring knowledge from a larger, pre-trained model to a smaller CPU-compatible model can maintain performance while reducing resource requirements.

Real-World Applications:

The power of CPUs for generative AI is already being harnessed in various fields:

  • Drug Discovery: Researchers are using CPU-powered systems to generate novel drug candidates, accelerating the search for life-saving treatments.
  • Materials Science: CPUs are being used to design new materials with desired properties, leading to breakthroughs in fields like energy and aerospace.
  • Creative Content Generation: Artists and writers are exploring the potential of CPUs to generate original content, from poems and stories to music and paintings.

Inference for Falcon 7B with Amazon EC2 c7i instances:

The compute requirements for inference of the Falcon-7B model were analyzed through a sizing exercise. The metric used was latency seen for the inferencing run. The goal was to have a latency of less than 25 seconds for the response in the chat.

 

Category

Attribute

c7i

Run Info

 

 

 

Benchmark

Inference Falcon 7-Billion Parameter Model with
Hugging Face accelerate
PyTorch 2.0.1
Intel Extensions for PyTorch 2.0.100

 

Date

Nov 10-24, 2023

 

Test by

 Intel

CSP and VM Config

 

 

 

Cloud

AWS

 

Region

us-east-1

 

Instance Type

c7i.8xlarge

 

CPU(s)

16 cores

 

Microarchitecture

AWS Nitro

 

Instance Cost

1.428 USD per Hour

 

Number of Instances or VMs (if cluster)

 

Iterations and result choice (median, average, min, max)

 

Memory

 

 

 

Memory

64 GB

 

DIMM Config

 

 

Memory Capacity / Instance

 

Network Info

 

 

 

Network BW / Instance

 12.5 Gbps

 

NIC Summary

 

Storage Info

 

 

 

Storage: NW or Direct Att / Instance

SSD GP2  

 

Drive Summary

 1 volume 70 GB

 

Table 4: Compute Infrastructure for Falcon-7B Inference

The model that was tuned in the earlier phase was deployed with the compute infrastructure shown in Table 4.  The software components used in the inference are shown in Table 5.

 

Category

Attribute

c7i

Run Info

 

 

 

Benchmark

Inference using fine-tuned Falcon 7-B Model with
Hugging Face accelerate
PyTorch 2.0.1
Intel Extensions for PyTorch 2.0.100

 

Dates

Nov 10-24, 2023

 

Test by

Intel

Software

 

 

 

Workload

Generative AI Fine Tuning

Workload Specific Details

 

 

 

Command Line

# Inference using fine-tuned Falcon 7B model:

python vmw_peft-tuned-inference.py –checkpoints /mnt/data/llm/aws_best_model_dist1_aws/checkpoint-1100/  –max_length 200 –top_k 10

 

Table 5: Workload details for inference

Inference Results:

The effectiveness of the Falcon-7B model for inference was tested before and after tuning. With the untuned Falcon-7B model the chatbot produced short answers to user queries with very little detail as shown in Figure 5.

 

Mohan_Potheri_0-1702769504152.png

Figure 5: Falcon-7B chatbot responses pre-tuning

The tuned model was then deployed and the chatbot was tested with similar questions as before.

 

Mohan_Potheri_1-1702769504159.png

Figure 6: Falcon-7B chatbot responses post tuning

The response from the tuned Falcon-7B chatbot show a more comprehensive understanding of the topics queried compared to the untuned model. All inference queries met the set SLA for response time of the application.

Conclusion:

Intel and AWS customers can use Xeons for tuning small to medium sized LLMs for their specific use cases. The Falcon-7B Large Language Model was tuned with Intel 4th Generation Xeon based Amazon EC2 C7i instances in a reasonable time of a few hours, which is acceptable for tuning. This shows that Enterprise customers can effectively leverage open source LLMs like Falcon-7B and tune it for their domain specific use cases with Intel Xeon based cloud infrastructure.

Many of these Generative AI applications are deployed at the edge where there are limitations in the amount of compute available.  A reasonable sized instance that can be made available at the edge such as the c7i.8xlarge with 16 cores and 64 GB RAM with the latest Xeon hardware along with SW optimizations was able to meet the set SLA for inference for an LLM like Falcon-7B.

Through this blog we have demonstrated the efficacy of using Intel Xeon based cloud instances effectively for tuning and inferencing of publicly existing LLMs such as Falcon-7B.

References:

[1] https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors-product-brief.html : With the most built-in accelerators of any CPU on the market, Intel® Xeon® Scalable processors offer the most choice and flexibility in cloud selection with smooth application portability.

[1] https://aws.amazon.com/ec2/instance-types/c7i/ : Amazon Elastic Compute Cloud (Amazon EC2) C7i instances are next-generation compute optimized instances powered by custom 4th Generation Intel Xeon Scalable processors (code named Sapphire Rapids) and feature a 2:1 ratio of memory to vCPU. EC2 instances powered by these custom processors, available only on AWS, offer the best performance among comparable Intel processors in the cloud – up to 15% better performance than Intel processors utilized by other cloud providers.

[1] https://huggingface.co/tiiuae/falcon-7b Falcon-7B is a 7B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. It is made available under the Apache 2.0 license.

[1] https://huggingface.co/datasets/timdettmers/openassistant-guanaco: The Guanaco[1] dataset is a subset of the Open Assistant dataset. This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/fine-tune-falcon-llm-with-hugging-face-oneapi.html: Fine Tuning Falcon-7B with hugging face and Intel OneAPI.

[1] https://www.youtube.com/watch?v=JNMVulH7fCo: Video showing techniques.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *