3. Convert the model to gguf FP16 format:
4. Quantize the model to 4-bits (using Q4_K_M method):
5. Run the quantized model:
Apart from loss of quality, quantized models also have slower inference.
3. Start the server:
When “model loaded” is seen at the end of terminal, you can POST requests to the server via CURL from another terminal window.
2. Run the command in terminal
4. To customize the prompt, first pull the model
5. Create the model
GPT4All has a server mode. Check the official documentation for more info.
GPT4All Server Model in Settings
Major cloud providers
AWS
Amazon SageMaker — A comprehensive platform for machine learning and managing the entire life-cycle of LLMs. It supports custom model development, deployment and scaling with access to pre-trained models.
AWS Lambda — Serverless compute for loads that reach 0, or use State machines to orchestrate event driven pipelines.
AWS Elastic Kubernetes Service (EKS) — is a managed Kuberenetes service to orchestrate containers for microservices of your LLM based applications.
Azure
Azure Machine Learning — Offers various tools for deploying LLMs like Model Management, Endpoints, Batch Scoring and Managed Inference; to set up a scalable and managed infrastructure for real-time or batch inference.
Azure Kubernetes Service (AKS): is managed Kubernetes service by google. It managed your LLM model in containerized format for you.
Azure Functions: Serverless functions to deploy your LLM model for lightweight, event-driven interactions.
Google Cloud
Vertex AI — provides purpose-built MLOps tools for data scientists and ML engineers to automate, standardize, and manage ML projects. Some of the the functionalities include model management, managed inference, batch inference and custom containers.
Cloud Run and Cloud Functions — is a serverless platform to deploy LLMs as lightweight, event-driven applications, ideal for smaller models or microservices.
Note: all of them provide Nvidia GPUs with on-demand prices. Your new or personal account account may not qualify for high grade GPUs (thanks to bitcoin miners)!
Deploy LLMs from HuggingFace on Sagemaker Endpoint
Most easy way to quickly deploy HuggingFace model
If you got a new model you want to test quickly, go the model card page on HuggingFace:
Click on “Deploy” button
Typically you will always have Sagemaker as an option, click on it
Copy the boilerplate code and paste it in either sagemaker studio or your notebook
Deploy HuggingFace model on SageMaked endpoint
Now you must change few of the variables like:
Add your AWS accounts Sagemaker execution role. If you are running it from sagemaker studio then sagemaker.get_execution_role()
will suffice.
Adjust some of the model configurations (which will be part of endpoint configuration). For example instance type, number of GPUs (if supported) and instance count.
This will deploy a dedicated endpoint on our sagemaker domain.
To invoke any sagemaker endpoint you will need an environment with boto3
installed (except when using database triggers like AWS Aurora Postgres Sagemaker trigger).
Invoke SageMaker endpoint
Sagemaker Jumpstart
SageMaker Jumpstart model deployment code
SageMaker deployment of LLMs that you have pretrained or fine-tuned
To deploy custom LLMs see if the libraries used by the model is available in the in-built frameworks and using script mode to pass custom scripts. If your LLM uses custom packages the use Bring you own container (BYOC) mode using sagemaker inference toolkit to create custom inference container.
Check out more examples on sagemaker: https://github.com/aws/amazon-sagemaker-examples/tree/main
Benefits of using containers
In the world of Service Oriented Architecture (SOA), containers are a blessing. Orchestrating large number of containers are a challange, but the benefits of a containerized service has numerous benefits when compared with an app running on Virtual Machines.
Large Language Models have higher memory requirements compared to a classical web service. This means that we have to understand these memory requirements before containering LLMs or LLM based endpoints. Barring small number of cases, like when you generative model fits perfectly in 1 server and only 1 server is needed; barring such small number of instances, containerzing your LLM is advisable for production use cases.
Scalability and infrastructure optimization — fine-grained dynamic and elastic provisioning of resources (CPU, GPU, memory, persistent volumes), dynamic scaling and maximized component/resource density to make best use of infrastructure resources.
Operational consistency and Component portability — automation of build and deployment, reducing the range of skillsets required to operate many different environments. Portability across nodes, environments, and clouds, images can be built and run on any container platform enabling you to focus on open containerization standards such as Docker and Kubernetes.
Service resiliency — Rapid re-start, ability to implement clean re-instatement, safe independent deployment, removing risk of destabilizing existing components and fine-grained roll-out using rolling upgrades, canary releases, and A/B testing.
GPU and containers
You can use your dedicated GPU or cloud GPU with containers. If you have a laptop then check the GPU memory and model size before containerizing your app.
Containers when used with Docker or with a different runtime like containerd, CRI-O; uses the NVIDIA Container toolkit which installs NVIDIA Container Runtime (nvidia-container-runtime
) in the host machine.
Image source: https://developer.nvidia.com/blog/nvidia-docker-gpu-server-application-deployment-made-easy/
For containerd runtime, NVIDIA Container Runtime is configured as an OCI-compliant runtime and uses NVIDIA CUDA, NVML drivers at the lowest level via NVIDIA Container Runtime Hook (`nvidia-container-runtime-hook`), with the flow through the various components as shown in the following diagram:
Source: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/arch-overview.html
After installing the NVIDIA Container Toolkit , you can run a sample container to test the NVIDIA GPU driver. Official documentation to run a sample workload : https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html
My system is Alienware 15 (2014), has a discrete GPU — Nvidia GeForce 960M with 3GB GDDR5 memory and 8GB DDR3L 1600 MHz RAM. After the running the sample container, I got the following response:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
smi
— system management interface
To use custom containers with GPU during runtime, you specify --gpus
parameter to the docker run
command. For example:
docker run --gpus all tensorflow/tensorflow:latest-gpu
To utilise GPU during build time, you have two options:
Modify daemon.json
file inside /etc/docker
directory and change the default runtime to nvidia
.
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
2. Or, use one of the nvidia/cuda
images . Following is a sample Dockerfile that I have taken from here , it uses nvidia/cuda:11.4.0-base-ubuntu20.04
as the base image to check PyTorch GPU support from inside container.
FROM nvidia/cuda:11.4.0-base-ubuntu20.04ENV DEBIAN_FRONTEND=noninteractive
# Install python RUN apt-get update && \ apt-get install -y \ git \ python3-pip \ python3-dev \ python3-opencv \ libglib2.0-0
# Install PyTorch and torchvision RUN pip3 install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu111/torch_stable.html
WORKDIR /app
# COPY necessary files for inference
ENTRYPOINT [ "python3" ]
After building the image, verify PyTorch installation:
docker exec -it <Container name> /bin/bash
From inside the container run the commands one-by-one:
python3import torch
torch.cuda.current_device()
torch.cuda.get_device_name(0) # Change to your desired GPU if your machine has multiple
By using one of the nvidia cude base images you can execute LLM inference on many playforms, like:
LLM with GPU on Docker container locally
GPU on EC2
GPU on AWS Fargate
GPU on Kubernetes (https://thenewstack.io/install-a-nvidia-gpu-operator-on-rke2-kubernetes-cluster/ )
Another example of sample Dockerfile
with a nvidia cuda base image from here :
FROM --platform=amd64 nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu22.04 as baseARG MAX_JOBS
WORKDIR /workspace
RUN apt update && \ apt install -y python3-pip python3-packaging \ git ninja-build && \ pip3 install -U pip
# Tweak this list to reduce build time # https://developer.nvidia.com/cuda-gpus ENV TORCH_CUDA_ARCH_LIST "7.0;7.2;7.5;8.0;8.6;8.9;9.0"
# We have to manually install Torch otherwise apex & xformers won't build RUN pip3 install "torch>=2.0.0"
# This build is slow but NVIDIA does not provide binaries. Increase MAX_JOBS as needed. RUN git clone && \ cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 && \ sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)/d' setup.py && \ python3 setup.py install --cpp_ext --cuda_ext
RUN pip3 install "xformers==0.0.22" "transformers==4.34.0" "vllm==0.2.0" "fschat[model_worker]==0.2.30"
# COPY YOUR MODEL FILES AND SCRIPTS # COPY
# SET ENTRYPOINT TO SERVER # ENTRYPOINT
Using Ollama
In an earlier section we saw how to start Ollama server and execute curl commands via command line. You can write the Ollama installation and server execution commands in the Dockerfile to use Ollama.
You can also use the official Ollama docker image which is abailable on Docker hub. Make sure to install the NVIDIA Container Toolkit to use GPU.
AWS Inferentia
ml.inf2
family of instances are designed for deep learning and generative models inference. AWS claims these instances deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances.
AWS trainium can be used for training generative models while inferentia is used for inference.
You can use ml.inf2
instances to deploy SageMaker Jumpstart models, or any LLM deployed on SageMaker endpoint.
from sagemaker.jumpstart.model import JumpStartModelmodel_id = "meta-textgenerationneuron-llama-2-13b-f" model = JumpStartModel( model_id=model_id, env={ "OPTION_DTYPE": "fp16", "OPTION_N_POSITIONS": "4096", "OPTION_TENSOR_PARALLEL_DEGREE": "12", "OPTION_MAX_ROLLING_BATCH_SIZE": "4", }, instance_type="ml.inf2.24xlarge" ) pretrained_predictor = model.deploy(accept_eula=True)
payload = { "inputs": "I believe the meaning of life is", "parameters": { "max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, }, }response = pretrained_predictor.predict(payload)
Apple Neural engine
Apple ’s Neural Engine (ANE ) is the marketing name for a group of specialized cores functioning as a neural processing unit (NPU) dedicated to the acceleration of artificial intelligence operations and machine learning tasks. Source
Source: Apple 2020
The ANE isn’t the only NPU out there. Besides the Neural Engine, the most famous NPU is Google’s TPU (or Tensor Processing Unit).
Source: https://apple.fandom.com/wiki/Neural_Engine
To do inference with ANE you will have to install ane-transformers
package from pip (and then pray that it works, because apple hasn’t updated it in last 2 years).
Github repo of Apple’s ml-ane-tranformers.
Initialize baseline model
import transformers model_name = "distilbert-base-uncased-finetuned-sst-2-english" baseline_model = transformers.AutoModelForSequenceClassification.from_pretrained( model_name, return_dict=False, torchscript=True, ).eval()
Initialize the mathematically equivalent but optimized model, and we restore its parameters using that of the baseline model
from ane_transformers.huggingface import distilbert as ane_distilbert optimized_model = ane_distilbert.DistilBertForSequenceClassification( baseline_model.config).eval() optimized_model.load_state_dict(baseline_model.state_dict())
Create sample inputs for the model”
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) tokenized = tokenizer( ["Sample input text to trace the model"], return_tensors="pt", max_length=128, # token sequence length padding="max_length", )import torch traced_optimized_model = torch.jit.trace( optimized_model, (tokenized["input_ids"], tokenized["attention_mask"]) )
Use coremltools to generate the Core ML model package file and save it”
import coremltools as ct import numpy as np ane_mlpackage_obj = ct.convert( traced_optimized_model, convert_to="mlprogram", inputs=[ ct.TensorType( f"input_{name}", shape=tensor.shape, dtype=np.int32, ) for name, tensor in tokenized.items() ], compute_units=ct.ComputeUnit.ALL, ) out_path = "HuggingFace_ane_transformers_distilbert_seqLen128_batchSize1.mlpackage" ane_mlpackage_obj.save(out_path)
Use installation and troubleshooting from the official github repo .
Others
https://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/UCS_CVDs/flexpod_c480m5l_aiml_design.html
Different types of edge devices
There are different types of edge computing, we will discuss Internet of Things (IoT) edge. Some of the common IoT devices include:
Mobile devices
Connected cameras
Retail Kiosks
Sensors
Smart devices like smart parking meters
Cars and other similar products
Tensorflow Lite
For mobile devices with On-Device Machine Learning (ODML) capabilities, or even edge devices like Rasberry Pi, you can convert your existing LLM to a .tflite
i.e. TensorFlow Lite model and do inference on the mobile apps. TensorFlow Lite is a mobile library for deploying models on mobile, microcontrollers and other edge devices.
Conceptual architecture for TensorFlow Lite. Image source: https://github.com/tensorflow/codelabs/blob/main/KerasNLP/io2023_workshop.ipynb
The high level developer workflow for using TensorFlow Lite is: first convert a TensorFlow model to the more compact TensorFlow Lite format using the TensorFlow Lite converter , and then use the TensorFlow Lite interpreter , which is highly optimized for mobile devices, to run the converted model. During the conversion process, you can also leverage several techniques, such as quantization, to further optimize the model and accelerate inference.
Image source: https://github.com/tensorflow/codelabs/blob/main/KerasNLP/io2023_workshop.ipynb
Copy this official google colab to play with GPT2CausalLM model with TensorFlow Lite.
See more examples of TensorFlow Lite (for iOS, Android and Raspberry Pi) : https://www.tensorflow.org/lite/examples
SageMaker Neo
Amazon SageMaker Neo enables developers to optimize machine learning models for inference on SageMaker in the cloud and supported devices at the edge.
Steps to optimize ML models with SageMaker Neo:
Build and train an ML model using any of the frameworks SageMaker Neo supports .
Or upload an existing model’s artefacts in an S3 bucket.
Use SageMaker Neo to create an optimized deployment package for the ML model framework and target hardware, such as EC2 instances and edge devices. This is the only additional task compared to the usual ML deployment process.
Deploy the optimized ML model generated by SageMaker Neo on the target cloud or edge infrastructure.
Example of model compilation for some of the edge devices using SageMaker neo: https://github.com/neo-ai/neo-ai-dlr/tree/main/sagemaker-neo-notebooks/edge
Deploy LLM with SageMaker Neo. Source: https://d1.awsstatic.com/events/Summits/reinvent2022/AIM405_Train-and-deploy-large-language-models-on-Amazon-SageMaker.pdf
ONNX
ONNX is a community project, a format built to represent machine learning models. ONNX defines a common set of operators — the building blocks of machine learning and deep learning models — and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.
If you have a model in one of the ONNX supported frameworks , which includes all major ML frameworks, then it can optimize the model to maximize performance across hardware using one of the supported accelerators like Mace, NVIDIA, Optimum, Qualcomm, Synopsys, Tensorlfow, Windows, Vespa and more.
https://github.com/ggerganov/whisper.cpp
Using other tools
If your edge device has a kernel and supports containers then people have successfully run Code Llama and llama.cpp, for generative model inference.
https://github.com/ggerganov/whisper.cpp
If the edge device has its own developer kit like NVIDIA IGX Orin , then see the official documentation for edge deployment.
Your model pipeline will vary depending on the architecture. For a RAG architecture, you will want to update your vector storage with new knowledge bases or updated articles Updating embeddings of only updated articles is a better choice than embedding the whole corpus everytime there is an update to any article.
In a continous pretraining architecture, the Foundation model is continously pretrained no new data. To keep the model from degrading due to bad data, you need to have a robust pipeline with data checks, endpoint drift detection and rule/model based evaluations.
An architecture that has a fine-tuned generative model, you can add rule based checks that are triggered with pre-commit everytime code changes are commited by developers.
We discussed rule based and model based evaluations in Evaluating in CI/CD section of Evaluating LLMs chapter.
Fine-tuning Pipeline
In classical SageMaker model pipeline we use ScriptProcessor to execute custom scripts on custom libraries. We also use it when we have to install bunch of packages ourselves in the container and host the container image on our ECR.
Sagemaker already has public images that has packages installed to support training, processing and hosting deep learning models. These images include common packages like pytorch, tensorflow, transformers, huggingface and many more.
HuggingFace Processor:
In a production pipeline your LLM or any other model will typically do more than just return predictions. For that sagemaker provides different toolkits. Remember we used inference and training toolkit for our custom images in example of classical models, we also have sagemaker huggingface inference toolkit.
HuggingFace inference toolkit:
Capturing endpoint statistics and processing them is paramount to check for model degradation, scaling, and continuous improvement of service.
Your DevOps will typically have a data with all the hot data related to the operational metrics of all the infrastructure. Metrics like network bandwidth, CPU usage across all nodes, RAM usage, response time, number of nodes up/down, number of containers and more.
MLOps dashboard will usually have feature distributions, KL divergence, prediction value distribution, embedding distribution, memory usage, number of endpoints and other model related metrics like recall, F1, AUC, RMSE, MAP, BLEU, etc.
For an LLM endpoint, you will have the relevant MLOps metrics and few LLM sub-metrics.
Time to first token (TTFT): This is how quickly users start seeing the model’s output after entering their query.
Time per output token (TPOT): Time to generate an output token for each user that is querying the system.
Based on above metrics:
Latency = TTFT + (TPOT) * (the number of tokens to be generated)
Ways to capture endpoint statistics
For applications where latency is not crucial, you can add the inference output with endpoint metrics to persistent storage before returning the inference from your endpoint. If you are using serverless infrastructure like AWS Lambda, you can extend the inference lambda such that it will also add its ouput to an RDS, key-value store like DynamoDB, or an object storage like S3.
If calculating endpoint metrics within the endpoint code is not feasible then simply store them in the storage and process the output in batches later on.
For low-latency applications, adding logic to append the outputs to a persisten storage before returning the final predictions is not feasible. In such cases you can log/print the predictions and then process the logs async. If you using a log aggregator like loki , then you can calculate the endpoint statistics after they are indexed.
To decouple endpoint metric calculation from your main inference logic, you can use a data stream. Inference code will log the outputs. Another service will index the logs and add them to a data stream. You process the logs in the stream or deliver the logs to a persistent storage and process them in batches.
Apache Kafaka or AWS Kinesis Data stream can be used for data streams. Apache Flink is my favourite stream processing tool. If you use AWS and Lambda for inference, you can use stream CloudWatch logs to Kinesis in a few clicks. Once you have the logs in Kinesis, you can either use a stream processor like Flink or add a stream consumer and calculate the endpoint metrics.
Cloud provider endpoints
Major cloud providers like Google Cloud, AWS and Azure, provide a pre-defined set of endpoint metrics out-of-the-box. The metrics include, latency, model initialisation time, 4XX errors, 5XX errors, invocations per instance, CPU utilisation, memory usage, disk usage and other general metrics. These all are good operational metrics and are used for activites like auto-scaling your endpoint and health determination (i.e. HA and scalability).
The cloud providers also give an option to store the input and output data of the endpoint to persistent storage like S3.
SageMaker endpoint configuration Data Capture option
Utilise this option if you can process the logs in batches and don’t require hot-data on metrics. You can also add triggers, when new data arrives in S3 it is processed immediately to calculate your endpoint metrics. I will recommend to estimate the cost of this whole pipeline before committing to this approach.