CI/CD for LLMs: GPU Layer Caching with Kaniko & Truss for Inference

September 30, 2025

21 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Senior Engineer's Dilemma: Why Your LLM CI/CD is So Slow

If you've moved beyond notebooks and are operationalizing Large Language Models (LLMs), you've inevitably hit a wall: your CI/CD pipeline. The familiar, snappy sub-minute builds for your stateless Go microservice have been replaced by agonizing 30-minute-plus slogs. Every minor code change in your inference server triggers a full re-download of the 10GB NVIDIA CUDA base image and a lengthy re-installation of PyTorch, Transformers, and a dozen other heavy dependencies. The resulting container image is a monstrous 25GB artifact that takes an eternity to push to your registry and even longer for your Kubernetes cluster to pull.

This isn't just an inconvenience; it's a critical bottleneck that stifles iteration speed, inflates infrastructure costs, and makes rollbacks a high-stakes affair. The root cause is a fundamental mismatch: CI/CD practices designed for small, stateless applications are being misapplied to large, stateful, hardware-accelerated models.

This article dissects this problem and provides a production-grade, prescriptive solution. We will not be covering the basics. We assume you understand Docker, Kubernetes, and the fundamentals of CI/CD. Instead, we will focus on an advanced, integrated strategy combining three key technologies:

Truss: A framework for packaging ML models that standardizes the serving layer, making it CI/CD-friendly and abstracting away boilerplate Flask/FastAPI code.

Kaniko: A tool for building container images from a Dockerfile inside a container or Kubernetes cluster, without relying on a Docker daemon. This is essential for modern, secure, and scalable CI environments.

Advanced CI/CD Patterns: Specifically, multi-stage Dockerfiles optimized for GPU layer caching and the strategic decoupling of model weights from the application container image.

By the end, you'll have a complete blueprint for a CI/CD pipeline that can build and deploy changes to your LLM inference code in under two minutes, not twenty.

The Anatomy of a Bloated LLM Service

Let's start by building the 'naive' implementation to establish a baseline. We'll use the popular mistralai/Mistral-7B-Instruct-v0.1 model. Our stack will be Truss for packaging and a simple Dockerfile.

1. Packaging with Truss

Truss provides a clean, declarative way to define our model server. Here’s the file structure:

text

/mistral_7b_truss
├── model
│   ├── __init__.py
│   └── model.py
└── config.yaml

config.yaml

This file declaratively defines the environment.

yaml

# config.yaml

apply_library_patches: false
model_name: Mistral7B-Instruct
environment_variables: {}
requirements:
  - torch==2.1.0
  - transformers==4.35.2
  - accelerate==0.25.0
  - bitsandbytes==0.41.2
  - sentencepiece==0.1.99
system_packages:
  - "git"
python_version: "py311"
resources:
  cpu: "3"
  memory: "14Gi"
  use_gpu: true
  accelerator: "A10G"
secrets: {}

model/model.py

This contains our inference logic. For this baseline, we'll package the model weights directly with the application.

python

# model/model.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class Model:
    def __init__(self, **kwargs):
        self._model = None
        self._tokenizer = None

    def load(self):
        # In the naive approach, the model is loaded from the local filesystem
        # as it was baked into the container image during the build process.
        # This requires pre-downloading the model before the 'docker build' command.
        model_name = "mistralai/Mistral-7B-Instruct-v0.1"

        self._tokenizer = AutoTokenizer.from_pretrained(model_name)
        self._model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            trust_remote_code=True,
        )

    def predict(self, model_input):
        prompt = model_input.pop("prompt")
        max_new_tokens = model_input.pop("max_new_tokens", 256)

        messages = [
            {"role": "user", "content": prompt}
        ]

        encodeds = self._tokenizer.apply_chat_template(messages, return_tensors="pt")
        model_inputs = encodeds.to("cuda")

        generated_ids = self._model.generate(
            model_inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True
        )
        decoded = self._tokenizer.batch_decode(generated_ids)
        return {"output": decoded[0]}

2. The Naive Dockerfile

To build this, we first need to download the model weights locally. We can use a simple Python script for that.

python

# download.py
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"

if __name__ == "__main__":
    AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True).save_pretrained('./mistral_7b_truss/model/model-weights')
    AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True).save_pretrained('./mistral_7b_truss/model/model-weights')

After running python download.py, our model directory now contains ~14GB of weights. Now, the naive Dockerfile:

dockerfile

# Dockerfile.naive

FROM nvidia/cuda:12.1.1-devel-ubuntu22.04

# Set up non-root user
RUN apt-get update && apt-get install -y --no-install-recommends \
    sudo \
    git \
    && useradd --create-home --shell /bin/bash appuser \
    && adduser appuser sudo \
    && echo 'appuser ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers

USER appuser
WORKDIR /home/appuser

# Install Python
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1
RUN sudo apt-get update && sudo apt-get install -y python3.11 python3-pip

# Install dependencies
COPY --chown=appuser:appuser mistral_7b_truss/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model code AND weights
COPY --chown=appuser:appuser mistral_7b_truss/ .

# Set the entrypoint to run the Truss model server
CMD ["python3", "-m", "truss.server"]

3. The Painful CI/CD Run

Imagine a simple CI/CD job (e.g., in GitLab CI):

yaml

# .gitlab-ci.yml (Naive)

build_llm_service:
  stage: build
  image: docker:20.10.16
  services:
    - docker:20.10.16-dind
  script:
    - echo "Downloading model weights..."
    - pip install transformers
    - python download.py # This step alone is slow and network-intensive
    - echo "Logging into container registry..."
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    - echo "Building and pushing image..."
    - docker build -f Dockerfile.naive -t $CI_REGISTRY_IMAGE:latest .
    - docker push $CI_REGISTRY_IMAGE:latest

Analysis of Failure:

* Build Time: The first run takes ~25-40 minutes. Pulling the CUDA base image, installing dependencies, and copying 14GB of model weights across the Docker daemon context is excruciatingly slow.

Cache Invalidation: Now, change one line* in model/model.py. The COPY mistral_7b_truss/ . layer is invalidated. Because the weights are in the same directory, Docker has to re-process that entire 14GB+ layer. Even worse, if you change requirements.txt, every subsequent layer is rebuilt, including the massive COPY.

* Image Size: The final image is ~28GB (11GB CUDA base + 2GB dependencies + 14GB weights + 1GB OS packages). Pushing this to a registry is slow and costly.

* Deployment Time: A Kubernetes node pulling this 28GB image can take 5-10 minutes, depending on network bandwidth. This makes autoscaling and rollouts painfully slow.

This is untenable for a production system.

The Optimized Strategy: Caching, Decoupling, and Daemonless Builds

Our advanced strategy attacks each bottleneck systematically.

Part 1: The Multi-Stage Dockerfile for Aggressive Caching

We will restructure our Dockerfile to create distinct, cacheable layers for components that change at different frequencies. The OS and CUDA drivers change almost never. Python dependencies change infrequently. Our application code changes frequently.

dockerfile

# Dockerfile.optimized

# Stage 1: 'base'. The most stable layer. Only rebuilds if the CUDA version changes.
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 as base

RUN apt-get update && apt-get install -y --no-install-recommends \
    git \
    python3.11 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Stage 2: 'builder'. Caches Python dependencies.
# This stage only rebuilds if requirements.txt changes.
FROM base as builder

WORKDIR /app

# Copy only the requirements file to leverage cache
COPY mistral_7b_truss/requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Stage 3: 'final'. The application layer.
# This is the only stage that rebuilds on a code change.
FROM base as final

WORKDIR /app

# Copy installed packages from the 'builder' stage
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin

# Copy our application code. Crucially, we are NOT copying model weights.
COPY mistral_7b_truss/model ./model
COPY mistral_7b_truss/config.yaml .

# Setup non-root user for security
RUN useradd --create-home appuser
USER appuser

WORKDIR /app

# Set the entrypoint to run the Truss model server
CMD ["python3", "-m", "truss.server"]

Why This Works:

* base Stage: The massive CUDA image and OS packages are isolated. This layer will be cached indefinitely unless you change the FROM line.

builder Stage: By copying only* requirements.txt, we ensure that the expensive pip install command only re-runs when the dependencies actually change. A code change in model.py will not invalidate this layer's cache.

* final Stage: This stage assembles the final image. It copies the pre-installed dependencies from builder and then copies our application code. When model.py changes, only the COPY instruction in this final stage and subsequent layers are re-run. The build becomes nearly instantaneous.

Notice the most important change: we are no longer copying the model weights into the image. This is the core of our decoupling strategy.

Part 2: Decoupling Model Weights

Baking weights into the image is an anti-pattern. Instead, treat the model as data, separate from the application code. We'll store the weights in a cloud object store (like GCS or S3) and download them at runtime.

Step 1: Upload Weights to an Artifact Store

First, run the download.py script locally, then upload the resulting model-weights directory to a bucket, e.g., gs://my-llm-models/mistral-7b-instruct-v0.1/.

Step 2: Modify model.py to Load from the Cloud

We update our load() method to fetch the model on startup. We'll use an environment variable to specify the model path.

python

# model/model.py (Optimized)

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from google.cloud import storage # Or boto3 for S3

class Model:
    def __init__(self, **kwargs):
        self._model = None
        self._tokenizer = None
        self._model_dir = "/app/model-weights"
        self._gcs_bucket_name = os.environ.get("GCS_BUCKET_NAME")
        self._gcs_model_path = os.environ.get("GCS_MODEL_PATH")

    def _download_from_gcs(self):
        if os.path.exists(self._model_dir) and os.listdir(self._model_dir):
            print("Model weights already exist, skipping download.")
            return

        storage_client = storage.Client()
        bucket = storage_client.bucket(self._gcs_bucket_name)
        blobs = bucket.list_blobs(prefix=self._gs_model_path)

        os.makedirs(self._model_dir, exist_ok=True)

        for blob in blobs:
            if blob.name.endswith('/'):
                continue
            
            destination_file_name = os.path.join(
                self._model_dir,
                os.path.relpath(blob.name, self._gcs_model_path)
            )
            os.makedirs(os.path.dirname(destination_file_name), exist_ok=True)
            blob.download_to_filename(destination_file_name)
            print(f"Downloaded {blob.name} to {destination_file_name}")

    def load(self):
        # Download weights from GCS before loading into GPU memory
        self._download_from_gcs()

        self._tokenizer = AutoTokenizer.from_pretrained(self._model_dir)
        self._model = AutoModelForCausalLM.from_pretrained(
            self._model_dir,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            trust_remote_code=True,
        )

    def predict(self, model_input):
        # ... (predict logic remains the same) ...
        prompt = model_input.pop("prompt")
        max_new_tokens = model_input.pop("max_new_tokens", 256)

        messages = [
            {"role": "user", "content": prompt}
        ]

        encodeds = self._tokenizer.apply_chat_template(messages, return_tensors="pt")
        model_inputs = encodeds.to("cuda")

        generated_ids = self._model.generate(
            model_inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True
        )
        decoded = self._tokenizer.batch_decode(generated_ids)
        return {"output": decoded[0]}

We also need to add google-cloud-storage to our requirements.txt.

Now our container image is lightweight. It contains only the code and its dependencies, not the massive model weights. The final image size drops from 28GB to ~13GB.

Part 3: Integrating Kaniko for Daemonless, In-Cluster Builds

Using docker-in-docker in CI/CD is a security risk and can be complex to manage. Kaniko is the modern standard for building images within a Kubernetes-native CI/CD system (like GitLab Runners on Kubernetes or self-hosted GitHub Actions runners).

Kaniko builds an image layer by layer within a container, pushing each layer to the registry as it's completed. Its killer feature for our use case is the remote cache.

Here is the complete, optimized .gitlab-ci.yml:

yaml

# .gitlab-ci.yml (Optimized)

variables:
  # The remote repository Kaniko will use to store cached layers
  KANIKO_CACHE_REPO: "$CI_REGISTRY_IMAGE/cache"

build_llm_service_optimized:
  stage: build
  # Use the official Kaniko debug image which includes a shell
  image: gcr.io/kaniko-project/executor:v1.9.0-debug
  script:
    - echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
    - /kaniko/executor \
        --context $CI_PROJECT_DIR \
        --dockerfile $CI_PROJECT_DIR/Dockerfile.optimized \
        --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA \
        --destination $CI_REGISTRY_IMAGE:latest \
        --cache=true \
        --cache-repo=$KANIKO_CACHE_REPO

Dissecting the Kaniko Command:

* --context: The build context directory.

* --dockerfile: Path to our optimized Dockerfile.

* --destination: The final image tag(s).

* --cache=true: Enables caching.

* --cache-repo=$KANIKO_CACHE_REPO: This is the magic. Kaniko will push the cached layers (like our stable base and builder stages) to this separate repository in our container registry. On subsequent runs, even on a completely new CI runner pod, Kaniko will first check this remote repository for existing layers before building them from scratch. This gives us lightning-fast builds across our entire CI fleet.

Performance Gains:

* First Run: Still takes ~15-20 minutes to build and push all the layers to the cache repo.

* Subsequent Runs (Code Change): The CI job now takes ~1-2 minutes. Kaniko pulls the cached base and builder layers from the remote cache, rebuilds only the small final stage, and pushes the new manifest and the few changed layers. The difference is staggering.

* Subsequent Runs (Dependency Change): The job takes ~5-7 minutes. The base layer is pulled from cache, but the builder stage is rebuilt and its new layers are pushed to the cache repo.

Part 4: Edge Cases - Mitigating Cold Start Latency with an `initContainer`

We've solved the build and image size problem, but we've introduced a new one: cold start latency. When a new pod starts, our load() method now has to download 14GB of model weights from GCS/S3, which can take several minutes. During this time, the pod is not ready to serve traffic, which is disastrous for autoscaling.

We can solve this elegantly using a Kubernetes initContainer.

An initContainer runs to completion before the main application container starts. We'll use it to pre-load the model weights into a shared volume.

yaml

# kubernetes-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-7b-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-7b-inference
  template:
    metadata:
      labels:
        app: mistral-7b-inference
    spec:
      # A volume that both containers can access
      volumes:
        - name: model-weights-volume
          emptyDir: {}
      
      # This container runs first and downloads the model
      initContainers:
        - name: model-downloader
          image: google/cloud-sdk:slim # A small image with gcloud tools
          command: ["/bin/sh", "-c"]
          args:
            - >
              gsutil -m cp -r gs://my-llm-models/mistral-7b-instruct-v0.1/* /weights/
          env:
            # Assuming Workload Identity is configured for GCS access
          volumeMounts:
            - name: model-weights-volume
              mountPath: /weights

      # This is our main application container
      containers:
        - name: mistral-truss-server
          image: your-registry/your-llm-repo:latest
          ports:
            - containerPort: 8080
          env:
            # We no longer need GCS env vars here since the files are local
            # We just need to tell the app where to find them
            - name: MODEL_WEIGHTS_DIR
              value: "/app/model-weights"
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: model-weights-volume
              mountPath: /app/model-weights # Mount the weights into the expected path
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4 # Or your GPU type

How this Pattern Works:

When the pod is scheduled, Kubernetes provisions an emptyDir volume.

The model-downloader initContainer starts. It uses the gsutil tool to efficiently download the model weights into the shared /weights directory (the mounted volume).

Only after the initContainer exits successfully does the main mistral-truss-server container start.

The shared volume is mounted at /app/model-weights inside the main container. Our load() method now finds the weights already on the local filesystem and skips the download, loading them directly into the GPU.

The Kubernetes readiness probe will only start checking the main container after the initContainer has finished. This means the pod will only be marked as 'Ready' and added to the service's load balancer pool once the weights are downloaded and the model is loaded into memory. This completely solves the cold start problem from a traffic-serving perspective.

Final Performance Comparison

Let's quantify the results of our efforts:

Metric	Naive Approach	Optimized Approach
CI Build Time (Code Change)	25-40 minutes	1-2 minutes
CI Build Time (Deps Change)	25-40 minutes	5-7 minutes
Final Image Size	~28 GB	~13 GB (and model-version independent)
Image Pull Time on Node	Very Slow (5-10 mins)	Fast (1-2 mins)
Pod Cold Start Latency	N/A (baked in)	High without initContainer, Low with initContainer
Flexibility	Low (code & model tied)	High (update model by changing env var, no rebuild)
CI Security	Requires Docker-in-Docker (less secure)	Daemonless with Kaniko (more secure)

Conclusion: Production-Grade MLOps is About Process, Not Just Code

We have systematically transformed a slow, brittle, and unmaintainable LLM deployment process into a fast, reliable, and scalable one. The key was not a magical new tool, but the application of advanced software engineering and DevOps principles to the unique challenges of machine learning artifacts.

The key takeaways for senior engineers are:

Isolate Volatility: Structure your Dockerfile around the rate of change of its components. Isolate the OS, system dependencies, language dependencies, and application code into separate, cacheable stages.

Decouple Code and Data: Never bake large, stateful artifacts like model weights into your application image. Treat them as data, store them in a dedicated artifact store, and fetch them at runtime.

Embrace Daemonless Builds: For secure and scalable CI in Kubernetes environments, move away from Docker-in-Docker and adopt tools like Kaniko. Leverage remote caching to share build artifacts across your entire CI infrastructure.

Manage Runtime State Intelligently: Acknowledge the operational consequences of your design choices. If decoupling weights introduces cold-start latency, use platform-native features like initContainers to mitigate it at the infrastructure level.

By implementing this strategy, you move from a state of friction and delay to one of high-velocity iteration, enabling your teams to deploy, test, and improve your LLM services at the speed modern development demands.

The Senior Engineer's Dilemma: Why Your LLM CI/CD is So Slow

The Anatomy of a Bloated LLM Service

1. Packaging with Truss

2. The Naive Dockerfile

3. The Painful CI/CD Run

The Optimized Strategy: Caching, Decoupling, and Daemonless Builds

Part 1: The Multi-Stage Dockerfile for Aggressive Caching

Part 2: Decoupling Model Weights

Part 3: Integrating Kaniko for Daemonless, In-Cluster Builds

Part 4: Edge Cases - Mitigating Cold Start Latency with an `initContainer`

Final Performance Comparison

Conclusion: Production-Grade MLOps is About Process, Not Just Code

Found this article helpful?