CI/CD for LLMs: GPU Layer Caching with Kaniko & Truss for Inference

21 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Senior Engineer's Dilemma: Why Your LLM CI/CD is So Slow

If you've moved beyond notebooks and are operationalizing Large Language Models (LLMs), you've inevitably hit a wall: your CI/CD pipeline. The familiar, snappy sub-minute builds for your stateless Go microservice have been replaced by agonizing 30-minute-plus slogs. Every minor code change in your inference server triggers a full re-download of the 10GB NVIDIA CUDA base image and a lengthy re-installation of PyTorch, Transformers, and a dozen other heavy dependencies. The resulting container image is a monstrous 25GB artifact that takes an eternity to push to your registry and even longer for your Kubernetes cluster to pull.

This isn't just an inconvenience; it's a critical bottleneck that stifles iteration speed, inflates infrastructure costs, and makes rollbacks a high-stakes affair. The root cause is a fundamental mismatch: CI/CD practices designed for small, stateless applications are being misapplied to large, stateful, hardware-accelerated models.

This article dissects this problem and provides a production-grade, prescriptive solution. We will not be covering the basics. We assume you understand Docker, Kubernetes, and the fundamentals of CI/CD. Instead, we will focus on an advanced, integrated strategy combining three key technologies:

  • Truss: A framework for packaging ML models that standardizes the serving layer, making it CI/CD-friendly and abstracting away boilerplate Flask/FastAPI code.
  • Kaniko: A tool for building container images from a Dockerfile inside a container or Kubernetes cluster, without relying on a Docker daemon. This is essential for modern, secure, and scalable CI environments.
  • Advanced CI/CD Patterns: Specifically, multi-stage Dockerfiles optimized for GPU layer caching and the strategic decoupling of model weights from the application container image.
  • By the end, you'll have a complete blueprint for a CI/CD pipeline that can build and deploy changes to your LLM inference code in under two minutes, not twenty.


    The Anatomy of a Bloated LLM Service

    Let's start by building the 'naive' implementation to establish a baseline. We'll use the popular mistralai/Mistral-7B-Instruct-v0.1 model. Our stack will be Truss for packaging and a simple Dockerfile.

    1. Packaging with Truss

    Truss provides a clean, declarative way to define our model server. Here’s the file structure:

    text
    /mistral_7b_truss
    ├── model
    │   ├── __init__.py
    │   └── model.py
    └── config.yaml

    config.yaml

    This file declaratively defines the environment.

    yaml
    # config.yaml
    
    apply_library_patches: false
    model_name: Mistral7B-Instruct
    environment_variables: {}
    requirements:
      - torch==2.1.0
      - transformers==4.35.2
      - accelerate==0.25.0
      - bitsandbytes==0.41.2
      - sentencepiece==0.1.99
    system_packages:
      - "git"
    python_version: "py311"
    resources:
      cpu: "3"
      memory: "14Gi"
      use_gpu: true
      accelerator: "A10G"
    secrets: {}

    model/model.py

    This contains our inference logic. For this baseline, we'll package the model weights directly with the application.

    python
    # model/model.py
    
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    class Model:
        def __init__(self, **kwargs):
            self._model = None
            self._tokenizer = None
    
        def load(self):
            # In the naive approach, the model is loaded from the local filesystem
            # as it was baked into the container image during the build process.
            # This requires pre-downloading the model before the 'docker build' command.
            model_name = "mistralai/Mistral-7B-Instruct-v0.1"
    
            self._tokenizer = AutoTokenizer.from_pretrained(model_name)
            self._model = AutoModelForCausalLM.from_pretrained(
                model_name,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                trust_remote_code=True,
            )
    
        def predict(self, model_input):
            prompt = model_input.pop("prompt")
            max_new_tokens = model_input.pop("max_new_tokens", 256)
    
            messages = [
                {"role": "user", "content": prompt}
            ]
    
            encodeds = self._tokenizer.apply_chat_template(messages, return_tensors="pt")
            model_inputs = encodeds.to("cuda")
    
            generated_ids = self._model.generate(
                model_inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True
            )
            decoded = self._tokenizer.batch_decode(generated_ids)
            return {"output": decoded[0]}

    2. The Naive Dockerfile

    To build this, we first need to download the model weights locally. We can use a simple Python script for that.

    python
    # download.py
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"
    
    if __name__ == "__main__":
        AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True).save_pretrained('./mistral_7b_truss/model/model-weights')
        AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True).save_pretrained('./mistral_7b_truss/model/model-weights')

    After running python download.py, our model directory now contains ~14GB of weights. Now, the naive Dockerfile:

    dockerfile
    # Dockerfile.naive
    
    FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
    
    # Set up non-root user
    RUN apt-get update && apt-get install -y --no-install-recommends \
        sudo \
        git \
        && useradd --create-home --shell /bin/bash appuser \
        && adduser appuser sudo \
        && echo 'appuser ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
    
    USER appuser
    WORKDIR /home/appuser
    
    # Install Python
    ENV PYTHONDONTWRITEBYTECODE 1
    ENV PYTHONUNBUFFERED 1
    RUN sudo apt-get update && sudo apt-get install -y python3.11 python3-pip
    
    # Install dependencies
    COPY --chown=appuser:appuser mistral_7b_truss/requirements.txt .
    RUN pip install --no-cache-dir -r requirements.txt
    
    # Copy model code AND weights
    COPY --chown=appuser:appuser mistral_7b_truss/ .
    
    # Set the entrypoint to run the Truss model server
    CMD ["python3", "-m", "truss.server"]
    

    3. The Painful CI/CD Run

    Imagine a simple CI/CD job (e.g., in GitLab CI):

    yaml
    # .gitlab-ci.yml (Naive)
    
    build_llm_service:
      stage: build
      image: docker:20.10.16
      services:
        - docker:20.10.16-dind
      script:
        - echo "Downloading model weights..."
        - pip install transformers
        - python download.py # This step alone is slow and network-intensive
        - echo "Logging into container registry..."
        - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
        - echo "Building and pushing image..."
        - docker build -f Dockerfile.naive -t $CI_REGISTRY_IMAGE:latest .
        - docker push $CI_REGISTRY_IMAGE:latest

    Analysis of Failure:

    * Build Time: The first run takes ~25-40 minutes. Pulling the CUDA base image, installing dependencies, and copying 14GB of model weights across the Docker daemon context is excruciatingly slow.

    Cache Invalidation: Now, change one line* in model/model.py. The COPY mistral_7b_truss/ . layer is invalidated. Because the weights are in the same directory, Docker has to re-process that entire 14GB+ layer. Even worse, if you change requirements.txt, every subsequent layer is rebuilt, including the massive COPY.

    * Image Size: The final image is ~28GB (11GB CUDA base + 2GB dependencies + 14GB weights + 1GB OS packages). Pushing this to a registry is slow and costly.

    * Deployment Time: A Kubernetes node pulling this 28GB image can take 5-10 minutes, depending on network bandwidth. This makes autoscaling and rollouts painfully slow.

    This is untenable for a production system.


    The Optimized Strategy: Caching, Decoupling, and Daemonless Builds

    Our advanced strategy attacks each bottleneck systematically.

    Part 1: The Multi-Stage Dockerfile for Aggressive Caching

    We will restructure our Dockerfile to create distinct, cacheable layers for components that change at different frequencies. The OS and CUDA drivers change almost never. Python dependencies change infrequently. Our application code changes frequently.

    dockerfile
    # Dockerfile.optimized
    
    # Stage 1: 'base'. The most stable layer. Only rebuilds if the CUDA version changes.
    FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 as base
    
    RUN apt-get update && apt-get install -y --no-install-recommends \
        git \
        python3.11 \
        python3-pip \
        && rm -rf /var/lib/apt/lists/*
    
    # Stage 2: 'builder'. Caches Python dependencies.
    # This stage only rebuilds if requirements.txt changes.
    FROM base as builder
    
    WORKDIR /app
    
    # Copy only the requirements file to leverage cache
    COPY mistral_7b_truss/requirements.txt .
    
    # Install dependencies
    RUN pip install --no-cache-dir -r requirements.txt
    
    # Stage 3: 'final'. The application layer.
    # This is the only stage that rebuilds on a code change.
    FROM base as final
    
    WORKDIR /app
    
    # Copy installed packages from the 'builder' stage
    COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
    COPY --from=builder /usr/local/bin /usr/local/bin
    
    # Copy our application code. Crucially, we are NOT copying model weights.
    COPY mistral_7b_truss/model ./model
    COPY mistral_7b_truss/config.yaml .
    
    # Setup non-root user for security
    RUN useradd --create-home appuser
    USER appuser
    
    WORKDIR /app
    
    # Set the entrypoint to run the Truss model server
    CMD ["python3", "-m", "truss.server"]
    

    Why This Works:

    * base Stage: The massive CUDA image and OS packages are isolated. This layer will be cached indefinitely unless you change the FROM line.

    builder Stage: By copying only* requirements.txt, we ensure that the expensive pip install command only re-runs when the dependencies actually change. A code change in model.py will not invalidate this layer's cache.

    * final Stage: This stage assembles the final image. It copies the pre-installed dependencies from builder and then copies our application code. When model.py changes, only the COPY instruction in this final stage and subsequent layers are re-run. The build becomes nearly instantaneous.

    Notice the most important change: we are no longer copying the model weights into the image. This is the core of our decoupling strategy.

    Part 2: Decoupling Model Weights

    Baking weights into the image is an anti-pattern. Instead, treat the model as data, separate from the application code. We'll store the weights in a cloud object store (like GCS or S3) and download them at runtime.

    Step 1: Upload Weights to an Artifact Store

    First, run the download.py script locally, then upload the resulting model-weights directory to a bucket, e.g., gs://my-llm-models/mistral-7b-instruct-v0.1/.

    Step 2: Modify model.py to Load from the Cloud

    We update our load() method to fetch the model on startup. We'll use an environment variable to specify the model path.

    python
    # model/model.py (Optimized)
    
    import os
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from google.cloud import storage # Or boto3 for S3
    
    class Model:
        def __init__(self, **kwargs):
            self._model = None
            self._tokenizer = None
            self._model_dir = "/app/model-weights"
            self._gcs_bucket_name = os.environ.get("GCS_BUCKET_NAME")
            self._gcs_model_path = os.environ.get("GCS_MODEL_PATH")
    
        def _download_from_gcs(self):
            if os.path.exists(self._model_dir) and os.listdir(self._model_dir):
                print("Model weights already exist, skipping download.")
                return
    
            storage_client = storage.Client()
            bucket = storage_client.bucket(self._gcs_bucket_name)
            blobs = bucket.list_blobs(prefix=self._gs_model_path)
    
            os.makedirs(self._model_dir, exist_ok=True)
    
            for blob in blobs:
                if blob.name.endswith('/'):
                    continue
                
                destination_file_name = os.path.join(
                    self._model_dir,
                    os.path.relpath(blob.name, self._gcs_model_path)
                )
                os.makedirs(os.path.dirname(destination_file_name), exist_ok=True)
                blob.download_to_filename(destination_file_name)
                print(f"Downloaded {blob.name} to {destination_file_name}")
    
        def load(self):
            # Download weights from GCS before loading into GPU memory
            self._download_from_gcs()
    
            self._tokenizer = AutoTokenizer.from_pretrained(self._model_dir)
            self._model = AutoModelForCausalLM.from_pretrained(
                self._model_dir,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                trust_remote_code=True,
            )
    
        def predict(self, model_input):
            # ... (predict logic remains the same) ...
            prompt = model_input.pop("prompt")
            max_new_tokens = model_input.pop("max_new_tokens", 256)
    
            messages = [
                {"role": "user", "content": prompt}
            ]
    
            encodeds = self._tokenizer.apply_chat_template(messages, return_tensors="pt")
            model_inputs = encodeds.to("cuda")
    
            generated_ids = self._model.generate(
                model_inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True
            )
            decoded = self._tokenizer.batch_decode(generated_ids)
            return {"output": decoded[0]}

    We also need to add google-cloud-storage to our requirements.txt.

    Now our container image is lightweight. It contains only the code and its dependencies, not the massive model weights. The final image size drops from 28GB to ~13GB.

    Part 3: Integrating Kaniko for Daemonless, In-Cluster Builds

    Using docker-in-docker in CI/CD is a security risk and can be complex to manage. Kaniko is the modern standard for building images within a Kubernetes-native CI/CD system (like GitLab Runners on Kubernetes or self-hosted GitHub Actions runners).

    Kaniko builds an image layer by layer within a container, pushing each layer to the registry as it's completed. Its killer feature for our use case is the remote cache.

    Here is the complete, optimized .gitlab-ci.yml:

    yaml
    # .gitlab-ci.yml (Optimized)
    
    variables:
      # The remote repository Kaniko will use to store cached layers
      KANIKO_CACHE_REPO: "$CI_REGISTRY_IMAGE/cache"
    
    build_llm_service_optimized:
      stage: build
      # Use the official Kaniko debug image which includes a shell
      image: gcr.io/kaniko-project/executor:v1.9.0-debug
      script:
        - echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
        - /kaniko/executor \
            --context $CI_PROJECT_DIR \
            --dockerfile $CI_PROJECT_DIR/Dockerfile.optimized \
            --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA \
            --destination $CI_REGISTRY_IMAGE:latest \
            --cache=true \
            --cache-repo=$KANIKO_CACHE_REPO

    Dissecting the Kaniko Command:

    * --context: The build context directory.

    * --dockerfile: Path to our optimized Dockerfile.

    * --destination: The final image tag(s).

    * --cache=true: Enables caching.

    * --cache-repo=$KANIKO_CACHE_REPO: This is the magic. Kaniko will push the cached layers (like our stable base and builder stages) to this separate repository in our container registry. On subsequent runs, even on a completely new CI runner pod, Kaniko will first check this remote repository for existing layers before building them from scratch. This gives us lightning-fast builds across our entire CI fleet.

    Performance Gains:

    * First Run: Still takes ~15-20 minutes to build and push all the layers to the cache repo.

    * Subsequent Runs (Code Change): The CI job now takes ~1-2 minutes. Kaniko pulls the cached base and builder layers from the remote cache, rebuilds only the small final stage, and pushes the new manifest and the few changed layers. The difference is staggering.

    * Subsequent Runs (Dependency Change): The job takes ~5-7 minutes. The base layer is pulled from cache, but the builder stage is rebuilt and its new layers are pushed to the cache repo.

    Part 4: Edge Cases - Mitigating Cold Start Latency with an `initContainer`

    We've solved the build and image size problem, but we've introduced a new one: cold start latency. When a new pod starts, our load() method now has to download 14GB of model weights from GCS/S3, which can take several minutes. During this time, the pod is not ready to serve traffic, which is disastrous for autoscaling.

    We can solve this elegantly using a Kubernetes initContainer.

    An initContainer runs to completion before the main application container starts. We'll use it to pre-load the model weights into a shared volume.

    yaml
    # kubernetes-deployment.yaml
    
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: mistral-7b-inference
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: mistral-7b-inference
      template:
        metadata:
          labels:
            app: mistral-7b-inference
        spec:
          # A volume that both containers can access
          volumes:
            - name: model-weights-volume
              emptyDir: {}
          
          # This container runs first and downloads the model
          initContainers:
            - name: model-downloader
              image: google/cloud-sdk:slim # A small image with gcloud tools
              command: ["/bin/sh", "-c"]
              args:
                - >
                  gsutil -m cp -r gs://my-llm-models/mistral-7b-instruct-v0.1/* /weights/
              env:
                # Assuming Workload Identity is configured for GCS access
              volumeMounts:
                - name: model-weights-volume
                  mountPath: /weights
    
          # This is our main application container
          containers:
            - name: mistral-truss-server
              image: your-registry/your-llm-repo:latest
              ports:
                - containerPort: 8080
              env:
                # We no longer need GCS env vars here since the files are local
                # We just need to tell the app where to find them
                - name: MODEL_WEIGHTS_DIR
                  value: "/app/model-weights"
              resources:
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
                - name: model-weights-volume
                  mountPath: /app/model-weights # Mount the weights into the expected path
          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-l4 # Or your GPU type

    How this Pattern Works:

  • When the pod is scheduled, Kubernetes provisions an emptyDir volume.
  • The model-downloader initContainer starts. It uses the gsutil tool to efficiently download the model weights into the shared /weights directory (the mounted volume).
  • Only after the initContainer exits successfully does the main mistral-truss-server container start.
  • The shared volume is mounted at /app/model-weights inside the main container. Our load() method now finds the weights already on the local filesystem and skips the download, loading them directly into the GPU.
  • The Kubernetes readiness probe will only start checking the main container after the initContainer has finished. This means the pod will only be marked as 'Ready' and added to the service's load balancer pool once the weights are downloaded and the model is loaded into memory. This completely solves the cold start problem from a traffic-serving perspective.

    Final Performance Comparison

    Let's quantify the results of our efforts:

    MetricNaive ApproachOptimized Approach
    CI Build Time (Code Change)25-40 minutes1-2 minutes
    CI Build Time (Deps Change)25-40 minutes5-7 minutes
    Final Image Size~28 GB~13 GB (and model-version independent)
    Image Pull Time on NodeVery Slow (5-10 mins)Fast (1-2 mins)
    Pod Cold Start LatencyN/A (baked in)High without initContainer, Low with initContainer
    FlexibilityLow (code & model tied)High (update model by changing env var, no rebuild)
    CI SecurityRequires Docker-in-Docker (less secure)Daemonless with Kaniko (more secure)

    Conclusion: Production-Grade MLOps is About Process, Not Just Code

    We have systematically transformed a slow, brittle, and unmaintainable LLM deployment process into a fast, reliable, and scalable one. The key was not a magical new tool, but the application of advanced software engineering and DevOps principles to the unique challenges of machine learning artifacts.

    The key takeaways for senior engineers are:

  • Isolate Volatility: Structure your Dockerfile around the rate of change of its components. Isolate the OS, system dependencies, language dependencies, and application code into separate, cacheable stages.
  • Decouple Code and Data: Never bake large, stateful artifacts like model weights into your application image. Treat them as data, store them in a dedicated artifact store, and fetch them at runtime.
  • Embrace Daemonless Builds: For secure and scalable CI in Kubernetes environments, move away from Docker-in-Docker and adopt tools like Kaniko. Leverage remote caching to share build artifacts across your entire CI infrastructure.
  • Manage Runtime State Intelligently: Acknowledge the operational consequences of your design choices. If decoupling weights introduces cold-start latency, use platform-native features like initContainers to mitigate it at the infrastructure level.
  • By implementing this strategy, you move from a state of friction and delay to one of high-velocity iteration, enabling your teams to deploy, test, and improve your LLM services at the speed modern development demands.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles