GitOps-Driven MLOps: Atomic Feature Store Versioning with DVC & ArgoCD

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Reproducibility Crisis in Production MLOps

For senior engineers operating ML systems at scale, the fundamental promise of CI/CD—predictable, repeatable deployments—often shatters. A git push triggering a Jenkins or GitHub Actions pipeline is insufficient because an ML system's state is a complex tuple: (code, model, configuration, feature_logic, training_data). Traditional CI/D pipelines excel at managing the code and configuration components but treat model, feature_logic, and training_data as external, opaque artifacts.

This leads to critical production failure modes:

* Training-Serving Skew: The most insidious issue. A model is trained on a feature (user_transaction_count_90d) but deployed into an environment where the feature pipeline calculates it slightly differently (user_transaction_count_89d due to a subtle logic change). The model's performance silently degrades.

* Non-Reproducible Models: A request to retrain a model from six months ago fails because the exact version of the training data has been overwritten, or the feature transformation logic that existed at that point is lost to history.

Untraceable Lineage: When a model behaves unexpectedly, it's nearly impossible to definitively trace its behavior back to the exact version of the data and* feature definitions that created it. Audits and debugging become forensic nightmares.

The solution is not to simply bolt on ML-specific tools. The solution is to redefine our source of truth. We must architect a system where a Git commit hash deterministically represents the entire state of the ML application. This is the core principle of GitOps, applied with surgical precision to the MLOps lifecycle.

This article details an advanced, production-proven architecture that achieves this by integrating Feast (for feature store management), DVC (for data versioning), and ArgoCD (for GitOps orchestration) into a single, cohesive, and atomically consistent deployment workflow.


Architecting the GitOps-Native ML State

Our goal is to create a single Git repository that acts as the declarative source of truth for the entire ML system. A change to any component—from a Kubernetes deployment spec to a feature definition—is managed through a pull request, providing auditability, peer review, and a clear rollback path.

The Monorepo Structure

A well-structured monorepo is the foundation. Here is a battle-tested layout:

plaintext
ml-system-repo/
├── .dvc/                   # DVC internal configuration
├── .github/                # CI checks (linting, unit tests)
├── apps/
│   └── inference-service/  # Inference service code (e.g., FastAPI)
├── data/
│   ├── raw/
│   │   └── user_profiles.parquet.dvc  # DVC pointer to raw data in S3
│   └── processed/
│       └── training_set.parquet.dvc # DVC pointer to processed data
├── features/
│   ├── definitions/
│   │   ├── user_features.py         # Feast feature view definitions
│   │   └── transaction_features.py
│   └── feature_store.yaml           # Feast repository configuration
├── infra/
│   ├── base/
│   │   ├── kustomization.yaml
│   │   └── namespace.yaml
│   ├── staging/
│   │   ├── feast_apply_job.yaml     # K8s Job to apply feature changes
│   │   ├── kustomization.yaml
│   │   └── training_pipeline_trigger.yaml
│   └── production/
│       └── ...
├── models/
│   └── fraud_detector.joblib.dvc # DVC pointer to the trained model artifact
├── pipelines/
│   └── training_workflow.yaml    # Argo Workflow definition for training
└── argocd/
    └── app-of-apps.yaml        # ArgoCD ApplicationSet to manage everything

Key Components:

  • data/ & models/ with DVC: These directories don't contain the large binary files. Instead, they hold small .dvc pointer files. These are plaintext files that contain hashes and remote storage locations (e.g., an S3 bucket) for the actual data/model files. This keeps the Git repository small and fast while providing cryptographic guarantees about the data version.
  • features/ with Feast: This directory defines our feature store as code. feature_store.yaml configures connections to offline (Snowflake, BigQuery) and online (Redis, DynamoDB) stores. The .py files contain the declarative definitions of feature views and entities.
  • infra/ with Kustomize: We use Kustomize to manage environment-specific configurations (staging vs. production) for our Kubernetes resources without duplicating YAML.
  • pipelines/ with Argo Workflows: The entire ML training process is codified as a DAG in an Argo Workflow manifest. This workflow will have steps to pull DVC data, run feature engineering, train the model, and push the resulting artifact back to DVC.
  • argocd/: The brain of the operation. This holds the ArgoCD Application or ApplicationSet manifests that tell ArgoCD how to reconcile the state of our Git repo with the live Kubernetes cluster.

  • Deep Dive: Atomic Feature Versioning with Feast and Git

    The most common point of failure is a mismatch between feature logic in the offline training environment and the online serving environment. We solve this by making Feast's state part of the GitOps loop.

    1. Defining Features as Code

    First, we define our features declaratively. This example defines a user entity and a feature view that computes aggregations from a raw data source.

    features/feature_store.yaml

    yaml
    project: fraud_detection
    registry: data/registry.db # Using a file-based registry for simplicity, use SQL for production
    provider: gcp # or 'aws', 'local'
    online_store:
        type: redis
        connection_string: "redis-master.redis.svc.cluster.local:6379"
    offline_store:
        type: bigquery
        project_id: "my-gcp-project"
        dataset: "fraud_detection_offline_store"

    features/definitions/user_features.py

    python
    from google.protobuf.duration_pb2 import Duration
    from feast import Entity, Feature, FeatureView, FileSource, ValueType
    
    # Define the raw data source. In a real scenario, this would be a BigQuerySource or similar.
    user_data_source = FileSource(
        path="/data/raw/user_profiles.parquet", # This path will be mounted into our job
        event_timestamp_column="event_timestamp",
        created_timestamp_column="created_timestamp",
    )
    
    # Define an entity for the user
    user = Entity(name="user_id", value_type=ValueType.INT64, description="User ID")
    
    # Define a Feature View for user features
    user_feature_view = FeatureView(
        name="user_profile_features",
        entities=["user_id"],
        ttl=Duration(seconds=86400 * 7), # 7 days
        features=[
            Feature(name="age", dtype=ValueType.INT32),
            Feature(name="avg_daily_spend_90d", dtype=ValueType.FLOAT),
            # NEW FEATURE TO BE ADDED IN A PR
            # Feature(name="account_age_days", dtype=ValueType.INT32),
        ],
        online=True,
        source=user_data_source,
        tags={},
    )

    2. The `feast apply` Synchronization Problem

    When a developer adds the new account_age_days feature and merges the PR, how does the Feast registry get updated? We cannot rely on manual feast apply commands. This must be automated and tied to the Git commit.

    We solve this by creating a Kubernetes Job that is triggered by ArgoCD whenever it detects a change in the features/ directory. This job will mount the Git repository, cd into the features directory, and execute feast apply.

    infra/staging/feast_apply_job.yaml

    yaml
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: feast-apply-job
      namespace: ml-ops
      annotations:
        argocd.argoproj.io/hook: Sync # This is an ArgoCD Sync Hook
        argocd.argoproj.io/hook-delete-policy: HookSucceeded
        # This ensures the job is deleted after a successful run to avoid conflicts on next sync
    
    spec:
      template:
        spec:
          containers:
          - name: feast-applier
            image: my-custom-ml-image:1.2.0 # An image with feast CLI and GCP/AWS creds
            command: ["/bin/sh", "-c"]
            args:
              - |
                echo "Synchronizing Feast feature repository..."
                cd /src/features
                feast apply
                echo "Feast apply completed successfully."
            volumeMounts:
            - name: repo-source
              mountPath: /src
          restartPolicy: Never
          # NOTE: In production, you'd use a Kubernetes service account with permissions
          # to modify the feature store's underlying infrastructure (e.g., BigQuery tables).
      backoffLimit: 2

    Crucial Elements:

    argocd.argoproj.io/hook: Sync: This annotation tells ArgoCD to run this Job as part of its synchronization process. It will be executed before* other resources (like the training pipeline or inference service) are reconciled.

    * hook-delete-policy: HookSucceeded: This cleans up the Job object from the cluster after it succeeds, making the process idempotent.

    * The container image must have the Feast CLI installed and the necessary cloud credentials to interact with the online/offline stores.

    Now, a git push that changes any file under features/ will automatically and atomically update the live feature store registry to match the definition in Git.


    Integrating DVC for Data and Model Immutability

    Versioning feature definitions is only half the battle. We need to lock the exact version of the data used for training to the same Git commit.

    1. Versioning Data with DVC

    When a data scientist acquires a new dataset, the process is:

    bash
    # Configure DVC to use an S3 bucket for remote storage
    $ dvc remote add -d s3storage s3://my-ml-data-versioned
    
    # Start tracking the raw data file
    $ dvc add data/raw/user_profiles.parquet
    
    # This creates the pointer file `data/raw/user_profiles.parquet.dvc`
    # Now, push the actual data to S3
    $ dvc push
    
    # Commit the small .dvc file to Git
    $ git add data/raw/user_profiles.parquet.dvc .dvc/config
    $ git commit -m "feat(data): Add initial user profiles dataset v1.0"
    $ git push

    The Git repository now contains a pointer, not the 10GB Parquet file. Anyone who clones the repo can retrieve the exact version of the data with dvc pull.

    2. The DVC-Aware Training Pipeline

    Our Argo Workflow for training must be able to use this DVC pointer to fetch the correct data. We achieve this using an initContainer.

    pipelines/training_workflow.yaml

    yaml
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      name: training-workflow
      namespace: ml-ops
    spec:
      entrypoint: training-dag
      volumes:
      - name: shared-data
        emptyDir: {}
      - name: repo-source
        emptyDir: {}
    
      templates:
      - name: training-dag
        dag:
          tasks:
          - name: run-training-pipeline
            template: training-container-template
    
      - name: training-container-template
        inputs:
          artifacts:
          - name: repo
            path: /src
            git:
              repo: https://github.com/your-org/ml-system-repo.git
              revision: "{{workflow.parameters.git_revision}}" # Parameterized revision
    
        initContainers:
        - name: dvc-pull
          image: iterativeai/dvc:latest # Official DVC image
          command: ["/bin/sh", "-c"]
          args:
            - |
              set -ex
              cd /src
              # Configure credentials for remote storage (e.g., S3)
              # In a real setup, use secrets!
              dvc remote modify s3storage endpointurl $S3_ENDPOINT
              dvc remote modify s3storage access_key_id $S3_ACCESS_KEY_ID
              dvc remote modify s3storage secret_access_key $S3_SECRET_ACCESS_KEY
              
              echo "Pulling DVC-tracked data..."
              dvc pull data/raw/user_profiles.parquet -f
              echo "Data pull complete."
          env:
          - name: S3_ENDPOINT
            valueFrom: { secretKeyRef: { name: dvc-s3-creds, key: endpoint } }
          - name: S3_ACCESS_KEY_ID
            valueFrom: { secretKeyRef: { name: dvc-s3-creds, key: accessKey } }
          - name: S3_SECRET_ACCESS_KEY
            valueFrom: { secretKeyRef: { name: dvc-s3-creds, key: secretKey } }
          volumeMounts:
          - name: repo-source
            mountPath: /src
    
        container:
          image: my-custom-ml-image:1.2.0
          command: ["python", "/src/apps/training/train.py"]
          args:
            - "--input-data=/src/data/raw/user_profiles.parquet"
            - "--model-output-path=/src/models/fraud_detector.joblib"
          volumeMounts:
          - name: repo-source
            mountPath: /src
    
        # This is the critical final step: push the new model back to DVC
        # and create the new .dvc file in the workspace.
        # A subsequent CI/CD step would commit this back to Git.
        outputs:
          artifacts:
          - name: updated-repo
            path: /src

    Workflow Analysis:

  • Git Checkout: The workflow starts by checking out the specific Git revision that triggered it.
  • initContainer: Before the main training container starts, the dvc-pull container runs. It uses the .dvc file from the checked-out repo to pull the corresponding large data file from S3 into a shared volume.
  • Training Container: The main container now runs. It sees the user_profiles.parquet file on the shared volume as if it were a local file. It trains the model and saves the artifact (fraud_detector.joblib).
  • Model Versioning (Post-training): The training script, after saving the model, must execute dvc add models/fraud_detector.joblib and dvc push. This versions the output of the pipeline. The final step, committing the new models/fraud_detector.joblib.dvc file back to the repository, is typically handled by a subsequent step in the pipeline that uses a Git token.

  • The ArgoCD Orchestration Layer: An End-to-End Vision

    ArgoCD is the conductor that orchestrates this entire symphony. We use an ApplicationSet to manage the deployment across different environments based on Git branches or directories.

    argocd/app-of-apps.yaml

    yaml
    apiVersion: argoproj.io/v1alpha1
    kind: ApplicationSet
    metadata:
      name: ml-system
      namespace: argocd
    spec:
      generators:
      - git:
          repoURL: https://github.com/your-org/ml-system-repo.git
          revision: HEAD
          directories:
          - path: infra/*
      template:
        metadata:
          name: 'ml-system-{{path.basename}}' # e.g., ml-system-staging
        spec:
          project: default
          source:
            repoURL: https://github.com/your-org/ml-system-repo.git
            targetRevision: HEAD
            path: '{{path}}'
            kustomize: {}
          destination:
            server: https://kubernetes.default.svc
            namespace: ml-ops
          syncPolicy:
            automated:
              prune: true
              selfHeal: true
            syncOptions:
            - CreateNamespace=true

    This ApplicationSet will automatically create an ArgoCD Application for each directory inside infra/ (e.g., staging, production).

    The Complete, Atomically Consistent Workflow

    Let's trace the full lifecycle of adding a new feature:

  • PR Created: A data scientist creates a branch feature/add-account-age. They:
  • * Uncomment the account_age_days feature in features/definitions/user_features.py.

    * Update the training script apps/training/train.py to use this new feature.

    * Push the branch and create a PR.

  • CI Checks: GitHub Actions runs basic linting and unit tests.
  • ArgoCD Sync (Staging): The PR is merged into the staging branch. ArgoCD, watching this branch, detects a change.
  • Sync Wave 1 (Hooks): ArgoCD first runs the feast-apply-job because it's a Sync hook. This job updates the staging* Feast registry with the new account_age_days feature definition.

    * Sync Wave 2 (Workflows): After the hook succeeds, ArgoCD triggers the training Argo Workflow. This workflow:

    * Checks out the staging branch commit.

    * The dvc-pull initContainer pulls the data corresponding to this commit.

    The training script runs. It connects to the staging* Feast service, which now knows about account_age_days, and successfully uses it for training.

    * The new model artifact is versioned with dvc add and dvc push.

    * A subsequent step commits the new models/fraud_detector.joblib.dvc file back to the staging branch.

    * Sync Wave 3 (Deployments): ArgoCD deploys the inference service, pointing it to the newly trained model version.

  • Promotion to Production: The process is repeated by merging the staging branch into main. ArgoCD will perform the exact same sequence of operations, but targeted at the production infrastructure and feature store.
  • Because the feature store update, the data version pull, and the model training are all orchestrated declaratively from a single Git commit, we have achieved atomic consistency. It is impossible to deploy a model version without its corresponding feature definitions and data lineage being in place first.


    Edge Cases and Performance Considerations

    This architecture is powerful but introduces complexities that require careful management.

    Schema Evolution and Rollbacks: What happens if you need to roll back a change? A git revert is your primary tool. Reverting the merge commit will trigger ArgoCD to re-run the feast-apply-job with the old* feature definitions, effectively rolling back the feature store schema. It will then re-trigger the training pipeline, which will check out the reverted code, pull the old data version via DVC, and rebuild/re-deploy the previous model version. The key is ensuring your feast apply process can handle an older definition being applied over a newer one (Feast is generally good at this).

    * Materialization Latency: When a new feature is applied via feast apply, Feast needs to backfill historical data for it, a process called materialization. This can take minutes or hours. The training pipeline must not start until materialization is complete. A robust solution involves modifying the feast-apply-job to be an Argo Workflow itself. This workflow would run feast apply, then poll the status of the materialization job using the Feast SDK, only exiting successfully once the data is ready. The main training workflow would be downstream of this materialization workflow.

    * DVC Performance at Scale: For terabyte-scale datasets, dvc pull can be a bottleneck. Consider these optimizations:

    * Shared Volumes: In Kubernetes, use a ReadWriteMany (RWX) volume (like NFS or EFS) as a DVC cache across multiple workflow pods. Subsequent runs won't need to re-download data.

    Direct Data Access: For data lakes like Delta Lake or Iceberg, you might not need to dvc pull the data at all. DVC can version the pointer to a specific table version or snapshot, and your training job (e.g., a Spark job) can read directly from the data lake using that version identifier. DVC's role shifts from data transport to data pointer* versioning.

    * Breaking Feature Changes: Deleting a feature or changing its data type is a breaking change. This requires a more careful, multi-step deployment. You might first deploy a new version of the inference service that no longer requests the old feature, then in a subsequent commit, remove the feature from the Feast definition. This prevents production failures where the service requests a feature that has just been deleted.

    Conclusion: True Reproducibility as a Solved Problem

    By rejecting the traditional separation of code, data, and infrastructure, we have constructed a system where the Git repository is the ultimate, holistic source of truth. The integration of a declarative feature store (Feast), immutable data versioning (DVC), and a GitOps controller (ArgoCD) elevates MLOps from a series of imperative scripts to a truly declarative, auditable, and reproducible practice.

    This architecture isn't simple, but it directly addresses the most challenging and costly failure modes in production ML. For senior engineers tasked with building reliable and maintainable ML systems, this GitOps-driven approach provides a blueprint for moving past the chaos of artifact management and achieving the same level of deterministic control we expect from modern software engineering.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles