Deploy and Run LLMs at the Edge

Learn about how LLMs can be used at the edge to generate insights for real-world use cases today.

Deploy and Run LLMs at the Edge

Generative AI and large language models (LLMs) are in the spotlight today, with applications like ChatGPT capturing the attention of tech aficionados and novices alike. But many who have been following the wave of AI innovation over the course of the last decade have their focus trained on a new frontier: AI at the edge. Gartner research estimates that by 2025, 75% of enterprise data will be generated at the edge. To uncover the insights hidden in troves of data being collected at the edge, organizations need AI and ML at the edge, and this includes Generative AI and LLMs.

In this blog, we’ll explore different definitions of “the edge,” and understand the factors driving AI/ML to the edge. We’ll examine why the trends of LLMs and edge computing are intersecting now, and how teams can take advantage of their combined power today. We’ll also demonstrate how LLMs can be used in an edge environment to generate insights for a real-world use case today.

Defining the Edge

Edge devices are more than single board computers (SBCs) – in fact, the edge is a heterogenous concept. For many, the misconception is that the edge is just Raspberry Pis, tablets, cellphones, or wearables. We think about the edge as a spectrum, with different categories of edge devices with correspondingly different power and network dependencies. The far edge is characterized by little network connectivity and low power consumption, whereas the cloud is hungry for both.

  • Cloud Edge – offers computing capabilities that you would find in a cloud service provider, e.g., MEC.
  • Compute Edge – functions as a localized, micro-data center that includes a limited range of resources and services you would find in the cloud, e.g., an edge line server racked or placed near or close to other devices or sensors.  
  • Device Edge – much smaller compute and processing capabilities, e.g., NVIDIA Jetson modules, Raspberry Pi, Intel NUC.  
  • Sensor Edge – comprises IoT sensors and devices that gather data, e.g., camera, and interact directly with the cloud, compute, or device.  
  • Far Edge – e.g., a microprocessor on board a robotic arm.  
The Cloud to Edge Spectrum

Why Run LLMs at the Edge?


One of the main trends driving LLMs to the edge is that teams need data processed at the point of collection rather than transferring it back and forth between the cloud, data center, and edge nodes for analysis. First, bandwidth or network access might not always be accessible, especially in remote environments or systems cut off from outside networks. But by running LLMs at the edge, you can reduce the concerns related to network outage related issues. In some scenarios, you might need real-time, low latency results, and so processing data at the edge yields time savings. Similarly, it can be incredibly costly to move data between an edge location and the cloud, so processing at the point of collection can yield significant cost savings.

According to the Tirias Research, GenAI Forecast and TCO Model, if 20% of GenAI processing workload could be offloaded from data centers by 2028 using on-device and hybrid processing, then the cost of data center infrastructure and operating cost for GenAI processing would decline by $15 billion.  

Finally, some enterprise organizations are too concerned about the privacy and security risks of internal data leakage using cloud hosted LLMs that they’d prefer to deploy and run their LLMs locally.

Considerations for Running LLMs at the Edge

When optimizing and preparing LLMs for production deployment, there are a few considerations for deploying LLMs anywhere, but especially at the edge. First, you’ll likely need to minify the model sizes with quantization or pruning. You can also accelerate model inference using TensorRT, OpenVINO, or ONXX runtime, and you should take advantage of accelerated hardware if possible. Other techniques to consider include knowledge distillation and model specialization. Next, your device configuration parameters will also impact how you should approach model optimization and distributing workloads between the cloud and the edge.

Some factors to consider to ensure you allocate resources accordingly:

  • Is there a GPU onboard to accelerate inference?
  • How much RAM is available? How much vRAM?
  • What CPU count do you have at your disposal?
  • Will you be running other software on the device or collection of devices?
  • How will you handle device(s) security?

When considering how to distribute your workloads between the cloud and at the edge, there are two different stages to consider: training workloads vs. inference workloads. From there, you have two subclasses to share in processing the workloads, on-device and a network of distributed edge nodes. Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing from IEEE breaks down six different paradigms that can be used to support your use case to distribute workloads between the cloud and edge.

Distributing workloads between cloud and edge


Consider a voice assistant used in a vehicle, which is representative of Level 4. In this case, the model was likely first trained in the cloud, then deployed out to a fleet of vehicles. In situations where the model needs fine tuning and maintenance, most likely the data will be offloaded from the device back to the cloud for retraining and distributing the best versions to vehicles in the fleet. On the other hand, any kind of at-home security or smart system that collects private and sensitive data will fall under Level 3 where fine tuning must be done on-device and none of the data is sent back to the cloud. The demo below shows a Level 2 example where the model is trained in the cloud, but inference is carried out at the network edge.

LLMs at the Edge: Using Code Llama at a Remote Oil Field

So how could LLMs be used in edge locations today? Consider a geologist working in a remote oil field who is responsible for building and analyzing 3D models of oil fields to determine production capacity and the impact on profitability. In this demo, we walk through how Code Llama, Chassisml.io, and Modzy could be used to build a dashboard that geologists could use to analyze well data in real-time in a remote, network restricted environment, allowing for LLM insights generated at the edge.

About Modzy

Modzy is a leading provider of enterprise and edge AI software for industrial, commercial, and public sector applications. Modzy’s software platform brings the power of advanced analytics and machine learning to the edge enabling a new class of AI-enabled solutions for monitoring and diagnostics, predictive maintenance, and safety and security use cases. Modzy’s software is ideally suited for OEMs, systems integrators, and end customers in manufacturing, pharmaceuticals, telecommunications, energy and utilities, infrastructure, retail, as well as smart cities and buildings.