K8s GPU Partitioning with MIG for Inference Throughput Optimization
The Underutilization Problem in Production ML Inference
In any mature MLOps platform built on Kubernetes, the cost-to-performance ratio of the GPU cluster is a constant focus. While large-scale training jobs can effectively saturate an entire NVIDIA A100 or H100 GPU, latency-sensitive inference workloads often present a starkly different utilization profile. A typical inference service, even one serving a moderately complex model like a distilled BERT variant, might only utilize 10-20% of the SMs (Streaming Multiprocessors) and a fraction of the available VRAM on a high-end GPU.
Assigning a full GPU to such a service via the standard Kubernetes resource request (nvidia.com/gpu: 1) is profoundly inefficient. It leads to stranded assets, inflated cloud bills, and an artificially low ceiling on workload density. While solutions like NVIDIA Triton Inference Server's dynamic batching can improve per-GPU throughput, they don't solve the fundamental issue of single-tenancy at the hardware level. The core challenge is to securely and efficiently share a single physical GPU among multiple, independent inference pods.
This is where NVIDIA's Multi-Instance GPU (MIG) technology becomes critical. Unlike software-based time-slicing or CUDA MPS (Multi-Process Service), MIG provides true hardware-level partitioning of a GPU's resources. Each MIG instance has its own dedicated SMs, L2 cache, and memory controllers, offering predictable performance and fault isolation—essential for multi-tenant production environments. This article details the advanced operational patterns for implementing and managing MIG in a production Kubernetes cluster to maximize inference workload density.
Prerequisite: Understanding MIG Architecture
Before diving into Kubernetes integration, a precise understanding of MIG's components is necessary. MIG divides a single GPU into up to seven independent GPU Instances (GIs). Each GI has its own memory, cache, and compute cores. GIs can be further partitioned into one or more Compute Instances (CIs). For the Kubernetes device plugin, the schedulable resource is the GI, which is exposed with a specific profile name, such as 1g.5gb (1 SM slice, 5GB memory).
For an A100 40GB, the available profiles are:
| MIG Profile | GPU Instances | VRAM per GI | Compute Slices | Memory Slices |
|---|---|---|---|---|
7g.40gb (Full) | 1 | 40GB | 7 | 8 |
4g.20gb | 2 | 20GB | 4 | 4 |
3g.20gb | 2 | 20GB | 3 | 4 |
2g.10gb | 3 | 10GB | 2 | 2 |
1g.5gb | 7 | 5GB | 1 | 1 |
It's crucial to note that not all combinations are valid. You cannot, for example, create one 4g.20gb instance and three 1g.5gb instances simultaneously. The partitioning must respect the underlying hardware layout. This constraint has significant implications for our Kubernetes scheduling strategy.
Step 1: Enabling and Configuring MIG on Worker Nodes
This is an operational task performed on the GPU worker nodes themselves, often managed via configuration management (Ansible, Puppet) or baked into a custom machine image (AMI).
First, you must disable any running GPU processes and then enable MIG mode. This requires root privileges.
# Ensure no processes are using the GPU
$ sudo nvidia-smi -i 0 --gpu-reset
# Enable MIG mode on GPU 0
$ sudo nvidia-smi -i 0 -mig 1
Once enabled, you can list the available profiles and create your desired GI configuration. Let's create a heterogeneous mix on a single A100: one 3g.20gb instance and two 2g.10gb instances. This is a common pattern for supporting a mix of medium and small models.
# List possible MIG device profiles
$ nvidia-smi mig -lgip
# Destroy any existing GPU Instances
$ sudo nvidia-smi mig -dgi
# Create the desired profile. The -cgi flag takes a comma-separated list of profiles.
# For an A100, profile IDs 5 (3g.20gb) and 9 (2g.10gb) are common.
# You must know the profile IDs from the -lgip output.
$ sudo nvidia-smi mig -cgi 5,9,9 -C
# Verify the created GPU Instances
$ nvidia-smi mig -lgi
#+---------------------------------------------------------------------------+
#| GPU instances: |
#| GPU GI-ID CI-ID MIG-Device |
#| Name |
#|===========================================================================|
#| 0 1 0 MIG 3g.20gb |
#| NVIDIA A100-SXM4-40GB |
#+---------------------------------------------------------------------------+
#| 0 2 0 MIG 2g.10gb |
#| NVIDIA A100-SXM4-40GB |
#+---------------------------------------------------------------------------+
#| 0 3 0 MIG 2g.10gb |
#| NVIDIA A100-SXM4-40GB |
#+---------------------------------------------------------------------------+
This manual configuration is suitable for static environments. In dynamic production clusters, this process should be automated, typically by a custom Kubernetes Operator or a node lifecycle hook that configures MIG based on node labels or other metadata.
Step 2: Configuring the NVIDIA Device Plugin for MIG Awareness
The NVIDIA Device Plugin for Kubernetes is the bridge between the kubelet and the GPU driver. It's responsible for discovering GPUs and their MIG profiles and advertising them to the Kubernetes scheduler as allocatable resources. The configuration is managed via a ConfigMap.
The key parameter is the migStrategy. It can be none, single, or mixed.
none: The plugin ignores MIG devices and only advertises full GPUs. Useless for our purposes.single: The plugin will only advertise MIG devices if all GPUs on the node are configured with the exact same MIG partitioning. This is too restrictive for heterogeneous nodes.mixed: The plugin advertises both full GPUs (on non-MIG nodes) and any available MIG devices on MIG-enabled nodes. This is the most flexible and powerful strategy for production.Here is a sample helm values override for the nvidia-device-plugin chart to enable the mixed strategy:
# nvidia-device-plugin-values.yaml
config:
map:
"gracely-shutdown-with-pod-delete": "false"
"mig-strategy": "mixed" # <- CRITICAL SETTING
"fail-on-init-error": "true"
"device-list-strategy": "envvar"
"device-id-strategy": "uuid"
"nvidia-driver-root": "/run/nvidia/driver"
Deploy or upgrade the plugin with these values. Once applied, you can inspect a MIG-enabled node to see the newly advertised resources:
$ kubectl describe node <your-gpu-node-name> | grep nvidia.com
# Expected Output:
# Capacity:
# ...
# nvidia.com/gpu: 1
# nvidia.com/mig-2g.10gb: 2
# nvidia.com/mig-3g.20gb: 1
# ...
# Allocatable:
# ...
# nvidia.com/gpu: 1
# nvidia.com/mig-2g.10gb: 2
# nvidia.com/mig-3g.20gb: 1
The Kubernetes scheduler is now aware of these fine-grained resources and can place pods that explicitly request them.
Step 3: Pod Scheduling with Specific MIG Profiles
With the node and device plugin configured, developers can now request specific MIG profiles in their pod specs. This allows for precise resource allocation tailored to the model's requirements.
Here is a deployment manifest for an inference service that requires a 2g.10gb slice:
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference-small-model
labels:
app: triton-small
spec:
replicas: 2
selector:
matchLabels:
app: triton-small
template:
metadata:
labels:
app: triton-small
spec:
containers:
- name: triton-server
image: nvcr.io/nvidia/tritonserver:23.10-py3
ports:
- containerPort: 8000
- containerPort: 8001
- containerPort: 8002
resources:
limits:
# Requesting a specific MIG profile
nvidia.com/mig-2g.10gb: 1
requests:
nvidia.com/mig-2g.10gb: 1
The Kubernetes scheduler will now only place these pods on nodes that have an available nvidia.com/mig-2g.10gb resource. On our example node, it could schedule two such pods. If a third replica were requested, it would remain in a Pending state until a suitable resource becomes available.
Advanced Pattern 1: Managing Heterogeneous MIG Profiles with Taints and Tolerations
In a large cluster, it's common to have different node pools with different MIG configurations to optimize for various model sizes. For example:
inference-pool-small: Nodes partitioned into many 1g.5gb instances.inference-pool-medium: Nodes partitioned into a mix of 2g.10gb and 3g.20gb instances.training-pool: Nodes with MIG disabled, offering full 7g.40gb GPUs.Simply relying on resource requests can lead to scheduling inefficiencies. A better approach is to use node taints and pod tolerations to enforce workload placement.
First, taint the nodes in each pool upon creation:
# Taint nodes in the small-model pool
$ kubectl taint nodes -l cloud.google.com/gke-nodepool=inference-pool-small mig-profile=small:NoSchedule
# Taint nodes in the medium-model pool
$ kubectl taint nodes -l cloud.google.com/gke-nodepool=inference-pool-medium mig-profile=medium:NoSchedule
Now, pods that need to run in these pools must explicitly tolerate these taints. This prevents, for example, a pod requesting a 3g.20gb slice from even being considered for scheduling on a node in the inference-pool-small.
Here is a pod manifest for a medium-sized model that targets the appropriate pool:
apiVersion: v1
kind: Pod
metadata:
name: inference-medium-model-pod
spec:
containers:
- name: my-container
image: my-inference-app:1.2
resources:
limits:
nvidia.com/mig-3g.20gb: 1
requests:
nvidia.com/mig-3g.20gb: 1
tolerations:
- key: "mig-profile"
operator: "Equal"
value: "medium"
effect: "NoSchedule"
This combination of taints, tolerations, and specific MIG resource requests provides a robust, multi-layered scheduling strategy that ensures workloads land on the most appropriate hardware, improving cluster efficiency and preventing scheduling conflicts.
Advanced Pattern 2: Enforcing Multi-Tenant Quotas with `ResourceQuota`
In a multi-tenant environment where different teams share a GPU cluster, it's essential to enforce fair resource allocation. Kubernetes ResourceQuota objects can be used to limit the consumption of specific MIG profiles on a per-namespace basis.
Imagine you have two teams, team-alpha and team-beta.
team-alpha works with smaller, experimental models and should be limited to 1g.5gb devices.team-beta runs larger production models and needs access to 3g.20gb devices.We can create ResourceQuota objects to enforce these policies.
# quota-team-alpha.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota-alpha
namespace: team-alpha
spec:
hard:
requests.nvidia.com/mig-1g.5gb: "4" # Team Alpha can request up to 4 small slices
limits.nvidia.com/mig-1g.5gb: "4"
requests.nvidia.com/mig-3g.20gb: "0" # Team Alpha cannot request medium slices
limits.nvidia.com/mig-3g.20gb: "0"
---
# quota-team-beta.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota-beta
namespace: team-beta
spec:
hard:
requests.nvidia.com/mig-1g.5gb: "2" # Team Beta gets a few small slices for utility tasks
limits.nvidia.com/mig-1g.5gb: "2"
requests.nvidia.com/mig-3g.20gb: "2" # Team Beta can request 2 medium slices
limits.nvidia.com/mig-3g.20gb: "2"
Applying these manifests will prevent users in team-alpha from creating pods that request 3g.20gb GPUs. If they try, the API server will reject the request with a quota violation error. This is a powerful governance mechanism for managing expensive and scarce GPU resources in a shared environment.
Performance Considerations and Edge Cases
While MIG is a powerful tool, it's not without its complexities and trade-offs.
1. Memory Bandwidth Contention:
While MIG provides isolated compute and memory slices, all instances on a GPU still share the same physical memory bus and PCIe interface. If you co-locate multiple memory-intensive inference workloads (e.g., large language models or high-resolution computer vision models), they can contend for memory bandwidth, leading to increased latency. Monitor memory bandwidth using DCGM (Data Center GPU Manager) and the dcgm-exporter for Prometheus. If contention is observed, you may need to use pod anti-affinity rules to spread high-bandwidth workloads across different physical GPUs or nodes.
2. The Overhead of MIG Reconfiguration:
Changing the MIG profile on a GPU is a destructive operation. It cannot be done while any process is using the GPU. In Kubernetes, this means the node must be cordoned and drained of all GPU-using pods. The MIG profile is then changed manually or via an automated script, and the node is uncordoned. This process introduces temporary capacity reduction and can be disruptive. Production-grade automation for MIG reconfiguration should use rolling strategies across node pools to maintain overall cluster capacity during the process.
3. The "Stranded Slices" Problem:
Consider a node with one 3g.20gb and two 2g.10gb slices. If two pods request the 2g.10gb slices, the 3g.20gb slice remains. If the incoming workload consists only of pods requesting 2g.10gb, this larger slice will be stranded and underutilized. This is a classic bin-packing problem. Solving it requires a sophisticated cluster autoscaler and scheduler that are aware of the MIG profiles. In some cases, it's more efficient to create homogeneous node pools (e.g., a pool where all nodes are partitioned into seven 1g.5gb slices) to simplify scheduling and avoid fragmentation, even if it seems less flexible.
4. Debugging Pending Pods:
A common issue is a pod stuck in the Pending state. When using MIG, the reason is often a mismatch between the requested profile and the available resources. Use kubectl describe pod to investigate. The Events section will typically show a FailedScheduling event with a message like 0/5 nodes are available: 5 Insufficient nvidia.com/mig-2g.10gb. This immediately tells you that no node in the cluster has the specific slice your pod needs, pointing you to a problem with either the node's MIG configuration or the pod's resource request.
Conclusion: MIG as a Strategic MLOps Tool
Effectively implementing NVIDIA MIG in Kubernetes is a significant step up from basic GPU management. It transforms GPUs from monolithic, single-tenant resources into fine-grained, shareable assets, dramatically improving workload density and cost-efficiency for inference services.
However, success requires moving beyond simple pod spec changes. A production-ready strategy involves a deep integration with Kubernetes scheduling primitives like taints and tolerations, robust multi-tenancy controls via ResourceQuota, and a keen awareness of the performance and operational edge cases. By mastering these advanced patterns, infrastructure and MLOps teams can build highly efficient, scalable, and economically viable ML inference platforms.