As teams integrate ML/AI models into production systems running at-scale, they’re increasingly encountering a new obstacle: high GPU costs from running models in production at-scale. While GPUs are used in both model training and production inference, it’s tough to yield savings or efficiencies during the training process. Training is costly because it’s a time-intensive process, but fortunately, it’s likely not happening every day. This blog focuses on optimizations you can make to generate cost savings while using GPUs for running inferences in production. The first part provides some general recommendations for how to more efficiently use GPUs, while the second walks through steps you can take to optimize GPU usage with commonly used architectures.

Hardware Recommendations

1) Try using a CPU first

The first recommendation is simple – if you don’t have to, don’t use a GPU at all. While GPUs are often used for running Deep Neural Networks (DNNs), there are plenty of other types of models that will run just fine on CPUs. Additionally, CPUs can be a great alternative if model latency isn’t a super important metric for you. Some of the newer architectures, like Intel’s Cascade Lake, for example, can generate good performance on CPUs even for DNNs. The main point here is that there you have other hardware options that might work for your use case.

2) Try building your own GPU server

If you’re running your models regularly or constantly and are getting high utilization, try building your own GPU server. Cloud service providers do a markup on all infrastructure and instances they sell. For example, if you were to purchase an AWS G4dn.8xlarge instance, which is really just an NVIDIA T4 GPU on a server with 196 cores and a couple hundred gigs of RAM, and you were to run that on AWS 24/7 for a year, your bill will be around $19K. But if you were to purchase a physical server with the same specifications, an entire server costs only $15K, which yields considerable savings. The only thing you need to watch out for here is the color of money – an entire server is a Capital Expense (CapEx), whereas cloud hosting represents an Operating Expense (OpEx.) However, once you start scaling and running for years, you can generate significant savings for your organization.

3) Make sure you’re using GPUs as efficiently as you can

It’s no secret that GPUs do parallel processing really well and can execute multiple functions at the same time. Processing multiple inputs simultaneously in batch through a neural network can yield substantial cost savings if you’re smart about it. To take advantage of this, adjust your batch sizes to process as many inputs through a model at a time as your hardware will allow, instead of just submitting one input at a time. While this tip might require some initial experimentation to get right, you can generate significant cost savings by efficiently running the GPU, then shutting it down once you’re finished.

4) Take advantage of reserved instances

Cloud service providers offer reserved GPU instances with pricing dependent on the region and the type of instance you purchase. If you pay for the one year commitment for this option upfront, you can save anywhere from 40-70%, which works out to be a pretty good chunk of savings. This option isn’t necessarily best for R&D and prototyping, where your estimation of usage tends to be more of an unknown variable. But in scenarios where you know the amount of infrastructure required for production or if you know you’re going to be deploying models or new versions of models over the course of a year, you might consider paying the upfront cost. The only major downside to this option is that you don’t have the ability to turn off these instances, which can be risky.

5) For cloud users: take advantage of spot instances

Like reserved instances, spot instances can be high risk/high reward because they’re both flexible and variable. In some cases, if you can architect your platform to be able to take advantage of these instances, you can save up to 90% costs associated with running batch jobs – no small feat. At the same time, you can’t always guarantee you’ll get the infrastructure you want at the speed that you want, so this is a better option for batch vs. real-time processing.

Architectural Recommendations

Now that we’ve gone over all the different hardware options at your disposal, let’s dive into different architectural options you might consider when running ML inferences in production.

1) Batch Processing with a Single Model

The simplest architecture you could implement involves hosting a model service in a GPU hosted in the cloud. With some Python code and an AWS account, you can quickly turn a model into its own service by wrapping your model in FastAPI, which is an API service that turns a ML model into its own service. From there, you can submit batch data to the model service for processing, and then manually shut down the GPU. To create efficiencies, you’ll want to refer to point three above, and pack the model full of data to make sure that you’re getting the GPU utilization to be as high as possible. Another important point – you have to remember to shut down the GPU to avoid racking up unnecessary GPU costs.

The next option for batch processing with a single model involves queuing. If you create a queue, you can submit batch inference jobs to the queue where your data resides, for example, an S3 bucket or a file server. In this architecture, you send data through the model, add a queue, and then create a small service that watches the queue and dynamically spins up a piece of hardware when new jobs enter the queue. This is typically done in the cloud provider of your choice, which is where you’d host the model service. Same as before, you’d wrap your model in FastAPI, and once the queue is empty, the watcher service can shut down the GPU automatically, which gets you a scale to zero boost. While this process involves a little more work, planning, and testing, it is an automatic process which can be helpful because you won’t forget to shut down the GPU.

2) Processing real-time or streaming data with a single model

To support real-time or streaming processing, you’ll need a base amount of GPU hardware running 24/7 to ensure speedy results. Spinning up hardware takes time, so for batch processing where there aren’t the same SLA concerns with how fast models return results, it’s likely not as much of a concern. In the case of real-time or streaming where you need some level of hardware to be always on, your cost-savings come from architecting a solution that can scale up/down to accommodate peaks/troughs of activity. It might not yield the same cost savings as batch processing, but any bit helps.

The key here is to figure out how to optimize the extra GPU capacity you might need. You can use autoscalers like Keda’s Prometheus that can spin up GPUs when needed to support bursts of activity, or serverless inference via Kserve which is available on Azure Managed Kubernetes Service (AKS,) Google Kubernetes Engine (GKE,) and AWS Elastic Kubernetes Service (EKS) and can launch GPUs super quickly. Additionally, there are optimizations you could identify by analyzing the demand of your service once it’s live and running. For example, you might observe that you’re consistently overprovisioning your hardware, meaning you’re giving your service more hardware than what’s needed to meet your requirements. You might identify patterns of activity over time tied to seasonality, nights, or weekends, and use these patterns to scale hardware up/down as needed. By using your knowledge of how your organization operates, you can make smart decisions about how to optimize your GPUs for more efficient usage.

3) Processing batch and real-time with multiple models

This is the most complicated architecture we’ll discuss, and rather than providing step-by-step instructions for how to build a system, let’s review some concepts that will be useful in creating a system that can support these different paradigms.

Containers & container orchestration

Putting your machine learning models into containers and scaling up/down with an orchestrator like Kubernetes is a good idea when it comes to running multiple models. Using Kubernetes or Docker Swarm for scaling and managing infrastructure creates efficiencies when running multiple models because trying to scale an individual queue for each model tied to one piece of hardware quickly becomes a tedious process. By using a container orchestrator, not only can you organize and manage your hardware in a central place, but you can also use the same hardware to support running multiple models at a time.

Job queueing

Job queuing ensures that you’re able to process data no matter how many GPUs might be available at any given time. It also makes sure that you’re packing GPUs with available data 24/7, and as soon as the queue is drained, the model can be shut down, freeing up the GPU for another model. One of the challenges with using GPUs is that they can typically only run one model at a time, so you have to shut down models once they’re done to allow the next to run. NVIDIA introduced new features with the A100 graphics card series that support card splitting. The A100 is treated as seven GPUs, which means you can have seven models running simultaneously and interact with them all independently. Obviously, this is an expensive proposition to begin with, but if you’re providing support for many data scientists in an on-premises scenario, it might be more efficient to do this.

Event-based architecture

Adopting an event-based architecture allows you to asynchronously execute all activities so that your GPU is free to pick up another inference job as soon as space becomes available. In this scenario, your GPU is still always running, but it has lower idle time because it’s packed full of data. When setting up your container orchestrator, you can define multiple GPU nodes within Kubernetes and choose the smallest piece of hardware that your model can run on to make sure that you’re right-sizing your hardware for the job.


Wrangling GPU costs when running ML/AI models in production is no small undertaking. It can be especially difficult to try and manage your hardware efficiently and cost-effectively when you’re running multiple models. As mentioned earlier, with batch jobs, there is more you can do to manage costs, whereas for real-time/streaming, you have no choice but to always have dedicated hardware running to meet the speedy response times needed.

Unfortunately, even with all of the work you might do to optimize hardware consumption, you can still make mistakes and miss out on the GPU cost savings for production AI. A better alternative might be a commercial model serving platform that will require more upfront costs, but ultimately pays for itself in GPU savings. A commercial model serving platform can result in 7X GPU cost savings, making it a pretty attractive alternative. To ensure that you’re choosing a solution that will yield those high GPU savings you’re after, look for a solution that:
• Supports both GPU and CPU processing
• Supports infrastructure auto-scaling, which allows you to define the minimum/maximum number of copies of a model running at any given time
• Supports scaling to zero and will shut down all pieces of hardware once processing is complete
• Allows you to set custom infrastructure definitions to right-size your hardware
• Provides general infrastructure orchestration

For more information on commercial model serving platforms, see this resource from the AI Infrastructure Alliance.