GitOps-Driven MLOps: Atomic Feature Store Versioning with DVC & ArgoCD

October 12, 2025

18 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Reproducibility Crisis in Production MLOps

For senior engineers operating ML systems at scale, the fundamental promise of CI/CD—predictable, repeatable deployments—often shatters. A git push triggering a Jenkins or GitHub Actions pipeline is insufficient because an ML system's state is a complex tuple: (code, model, configuration, feature_logic, training_data). Traditional CI/D pipelines excel at managing the code and configuration components but treat model, feature_logic, and training_data as external, opaque artifacts.

This leads to critical production failure modes:

* Training-Serving Skew: The most insidious issue. A model is trained on a feature (user_transaction_count_90d) but deployed into an environment where the feature pipeline calculates it slightly differently (user_transaction_count_89d due to a subtle logic change). The model's performance silently degrades.

* Non-Reproducible Models: A request to retrain a model from six months ago fails because the exact version of the training data has been overwritten, or the feature transformation logic that existed at that point is lost to history.

Untraceable Lineage: When a model behaves unexpectedly, it's nearly impossible to definitively trace its behavior back to the exact version of the data and* feature definitions that created it. Audits and debugging become forensic nightmares.

The solution is not to simply bolt on ML-specific tools. The solution is to redefine our source of truth. We must architect a system where a Git commit hash deterministically represents the entire state of the ML application. This is the core principle of GitOps, applied with surgical precision to the MLOps lifecycle.

This article details an advanced, production-proven architecture that achieves this by integrating Feast (for feature store management), DVC (for data versioning), and ArgoCD (for GitOps orchestration) into a single, cohesive, and atomically consistent deployment workflow.

Architecting the GitOps-Native ML State

Our goal is to create a single Git repository that acts as the declarative source of truth for the entire ML system. A change to any component—from a Kubernetes deployment spec to a feature definition—is managed through a pull request, providing auditability, peer review, and a clear rollback path.

The Monorepo Structure

A well-structured monorepo is the foundation. Here is a battle-tested layout:

plaintext

ml-system-repo/
├── .dvc/                   # DVC internal configuration
├── .github/                # CI checks (linting, unit tests)
├── apps/
│   └── inference-service/  # Inference service code (e.g., FastAPI)
├── data/
│   ├── raw/
│   │   └── user_profiles.parquet.dvc  # DVC pointer to raw data in S3
│   └── processed/
│       └── training_set.parquet.dvc # DVC pointer to processed data
├── features/
│   ├── definitions/
│   │   ├── user_features.py         # Feast feature view definitions
│   │   └── transaction_features.py
│   └── feature_store.yaml           # Feast repository configuration
├── infra/
│   ├── base/
│   │   ├── kustomization.yaml
│   │   └── namespace.yaml
│   ├── staging/
│   │   ├── feast_apply_job.yaml     # K8s Job to apply feature changes
│   │   ├── kustomization.yaml
│   │   └── training_pipeline_trigger.yaml
│   └── production/
│       └── ...
├── models/
│   └── fraud_detector.joblib.dvc # DVC pointer to the trained model artifact
├── pipelines/
│   └── training_workflow.yaml    # Argo Workflow definition for training
└── argocd/
    └── app-of-apps.yaml        # ArgoCD ApplicationSet to manage everything

Key Components:

data/ & models/ with DVC: These directories don't contain the large binary files. Instead, they hold small .dvc pointer files. These are plaintext files that contain hashes and remote storage locations (e.g., an S3 bucket) for the actual data/model files. This keeps the Git repository small and fast while providing cryptographic guarantees about the data version.

features/ with Feast: This directory defines our feature store as code. feature_store.yaml configures connections to offline (Snowflake, BigQuery) and online (Redis, DynamoDB) stores. The .py files contain the declarative definitions of feature views and entities.

infra/ with Kustomize: We use Kustomize to manage environment-specific configurations (staging vs. production) for our Kubernetes resources without duplicating YAML.

pipelines/ with Argo Workflows: The entire ML training process is codified as a DAG in an Argo Workflow manifest. This workflow will have steps to pull DVC data, run feature engineering, train the model, and push the resulting artifact back to DVC.

argocd/: The brain of the operation. This holds the ArgoCD Application or ApplicationSet manifests that tell ArgoCD how to reconcile the state of our Git repo with the live Kubernetes cluster.

Deep Dive: Atomic Feature Versioning with Feast and Git

The most common point of failure is a mismatch between feature logic in the offline training environment and the online serving environment. We solve this by making Feast's state part of the GitOps loop.

1. Defining Features as Code

First, we define our features declaratively. This example defines a user entity and a feature view that computes aggregations from a raw data source.

features/feature_store.yaml

yaml

project: fraud_detection
registry: data/registry.db # Using a file-based registry for simplicity, use SQL for production
provider: gcp # or 'aws', 'local'
online_store:
    type: redis
    connection_string: "redis-master.redis.svc.cluster.local:6379"
offline_store:
    type: bigquery
    project_id: "my-gcp-project"
    dataset: "fraud_detection_offline_store"

features/definitions/user_features.py

python

from google.protobuf.duration_pb2 import Duration
from feast import Entity, Feature, FeatureView, FileSource, ValueType

# Define the raw data source. In a real scenario, this would be a BigQuerySource or similar.
user_data_source = FileSource(
    path="/data/raw/user_profiles.parquet", # This path will be mounted into our job
    event_timestamp_column="event_timestamp",
    created_timestamp_column="created_timestamp",
)

# Define an entity for the user
user = Entity(name="user_id", value_type=ValueType.INT64, description="User ID")

# Define a Feature View for user features
user_feature_view = FeatureView(
    name="user_profile_features",
    entities=["user_id"],
    ttl=Duration(seconds=86400 * 7), # 7 days
    features=[
        Feature(name="age", dtype=ValueType.INT32),
        Feature(name="avg_daily_spend_90d", dtype=ValueType.FLOAT),
        # NEW FEATURE TO BE ADDED IN A PR
        # Feature(name="account_age_days", dtype=ValueType.INT32),
    ],
    online=True,
    source=user_data_source,
    tags={},
)

2. The `feast apply` Synchronization Problem

When a developer adds the new account_age_days feature and merges the PR, how does the Feast registry get updated? We cannot rely on manual feast apply commands. This must be automated and tied to the Git commit.

We solve this by creating a Kubernetes Job that is triggered by ArgoCD whenever it detects a change in the features/ directory. This job will mount the Git repository, cd into the features directory, and execute feast apply.

infra/staging/feast_apply_job.yaml

yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: feast-apply-job
  namespace: ml-ops
  annotations:
    argocd.argoproj.io/hook: Sync # This is an ArgoCD Sync Hook
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
    # This ensures the job is deleted after a successful run to avoid conflicts on next sync

spec:
  template:
    spec:
      containers:
      - name: feast-applier
        image: my-custom-ml-image:1.2.0 # An image with feast CLI and GCP/AWS creds
        command: ["/bin/sh", "-c"]
        args:
          - |
            echo "Synchronizing Feast feature repository..."
            cd /src/features
            feast apply
            echo "Feast apply completed successfully."
        volumeMounts:
        - name: repo-source
          mountPath: /src
      restartPolicy: Never
      # NOTE: In production, you'd use a Kubernetes service account with permissions
      # to modify the feature store's underlying infrastructure (e.g., BigQuery tables).
  backoffLimit: 2

Crucial Elements:

argocd.argoproj.io/hook: Sync: This annotation tells ArgoCD to run this Job as part of its synchronization process. It will be executed before* other resources (like the training pipeline or inference service) are reconciled.

* hook-delete-policy: HookSucceeded: This cleans up the Job object from the cluster after it succeeds, making the process idempotent.

* The container image must have the Feast CLI installed and the necessary cloud credentials to interact with the online/offline stores.

Now, a git push that changes any file under features/ will automatically and atomically update the live feature store registry to match the definition in Git.

Integrating DVC for Data and Model Immutability

Versioning feature definitions is only half the battle. We need to lock the exact version of the data used for training to the same Git commit.

1. Versioning Data with DVC

When a data scientist acquires a new dataset, the process is:

bash

# Configure DVC to use an S3 bucket for remote storage
$ dvc remote add -d s3storage s3://my-ml-data-versioned

# Start tracking the raw data file
$ dvc add data/raw/user_profiles.parquet

# This creates the pointer file `data/raw/user_profiles.parquet.dvc`
# Now, push the actual data to S3
$ dvc push

# Commit the small .dvc file to Git
$ git add data/raw/user_profiles.parquet.dvc .dvc/config
$ git commit -m "feat(data): Add initial user profiles dataset v1.0"
$ git push

The Git repository now contains a pointer, not the 10GB Parquet file. Anyone who clones the repo can retrieve the exact version of the data with dvc pull.

2. The DVC-Aware Training Pipeline

Our Argo Workflow for training must be able to use this DVC pointer to fetch the correct data. We achieve this using an initContainer.

pipelines/training_workflow.yaml

yaml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: training-workflow
  namespace: ml-ops
spec:
  entrypoint: training-dag
  volumes:
  - name: shared-data
    emptyDir: {}
  - name: repo-source
    emptyDir: {}

  templates:
  - name: training-dag
    dag:
      tasks:
      - name: run-training-pipeline
        template: training-container-template

  - name: training-container-template
    inputs:
      artifacts:
      - name: repo
        path: /src
        git:
          repo: https://github.com/your-org/ml-system-repo.git
          revision: "{{workflow.parameters.git_revision}}" # Parameterized revision

    initContainers:
    - name: dvc-pull
      image: iterativeai/dvc:latest # Official DVC image
      command: ["/bin/sh", "-c"]
      args:
        - |
          set -ex
          cd /src
          # Configure credentials for remote storage (e.g., S3)
          # In a real setup, use secrets!
          dvc remote modify s3storage endpointurl $S3_ENDPOINT
          dvc remote modify s3storage access_key_id $S3_ACCESS_KEY_ID
          dvc remote modify s3storage secret_access_key $S3_SECRET_ACCESS_KEY
          
          echo "Pulling DVC-tracked data..."
          dvc pull data/raw/user_profiles.parquet -f
          echo "Data pull complete."
      env:
      - name: S3_ENDPOINT
        valueFrom: { secretKeyRef: { name: dvc-s3-creds, key: endpoint } }
      - name: S3_ACCESS_KEY_ID
        valueFrom: { secretKeyRef: { name: dvc-s3-creds, key: accessKey } }
      - name: S3_SECRET_ACCESS_KEY
        valueFrom: { secretKeyRef: { name: dvc-s3-creds, key: secretKey } }
      volumeMounts:
      - name: repo-source
        mountPath: /src

    container:
      image: my-custom-ml-image:1.2.0
      command: ["python", "/src/apps/training/train.py"]
      args:
        - "--input-data=/src/data/raw/user_profiles.parquet"
        - "--model-output-path=/src/models/fraud_detector.joblib"
      volumeMounts:
      - name: repo-source
        mountPath: /src

    # This is the critical final step: push the new model back to DVC
    # and create the new .dvc file in the workspace.
    # A subsequent CI/CD step would commit this back to Git.
    outputs:
      artifacts:
      - name: updated-repo
        path: /src

Workflow Analysis:

Git Checkout: The workflow starts by checking out the specific Git revision that triggered it.

initContainer: Before the main training container starts, the dvc-pull container runs. It uses the .dvc file from the checked-out repo to pull the corresponding large data file from S3 into a shared volume.

Training Container: The main container now runs. It sees the user_profiles.parquet file on the shared volume as if it were a local file. It trains the model and saves the artifact (fraud_detector.joblib).

Model Versioning (Post-training): The training script, after saving the model, must execute dvc add models/fraud_detector.joblib and dvc push. This versions the output of the pipeline. The final step, committing the new models/fraud_detector.joblib.dvc file back to the repository, is typically handled by a subsequent step in the pipeline that uses a Git token.

The ArgoCD Orchestration Layer: An End-to-End Vision

ArgoCD is the conductor that orchestrates this entire symphony. We use an ApplicationSet to manage the deployment across different environments based on Git branches or directories.

argocd/app-of-apps.yaml

yaml

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: ml-system
  namespace: argocd
spec:
  generators:
  - git:
      repoURL: https://github.com/your-org/ml-system-repo.git
      revision: HEAD
      directories:
      - path: infra/*
  template:
    metadata:
      name: 'ml-system-{{path.basename}}' # e.g., ml-system-staging
    spec:
      project: default
      source:
        repoURL: https://github.com/your-org/ml-system-repo.git
        targetRevision: HEAD
        path: '{{path}}'
        kustomize: {}
      destination:
        server: https://kubernetes.default.svc
        namespace: ml-ops
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
        - CreateNamespace=true

This ApplicationSet will automatically create an ArgoCD Application for each directory inside infra/ (e.g., staging, production).

The Complete, Atomically Consistent Workflow

Let's trace the full lifecycle of adding a new feature:

PR Created: A data scientist creates a branch feature/add-account-age. They:

* Uncomment the account_age_days feature in features/definitions/user_features.py.

* Update the training script apps/training/train.py to use this new feature.

* Push the branch and create a PR.

CI Checks: GitHub Actions runs basic linting and unit tests.

ArgoCD Sync (Staging): The PR is merged into the staging branch. ArgoCD, watching this branch, detects a change.

Sync Wave 1 (Hooks): ArgoCD first runs the feast-apply-job because it's a Sync hook. This job updates the staging* Feast registry with the new account_age_days feature definition.

* Sync Wave 2 (Workflows): After the hook succeeds, ArgoCD triggers the training Argo Workflow. This workflow:

* Checks out the staging branch commit.

* The dvc-pull initContainer pulls the data corresponding to this commit.

The training script runs. It connects to the staging* Feast service, which now knows about account_age_days, and successfully uses it for training.

* The new model artifact is versioned with dvc add and dvc push.

* A subsequent step commits the new models/fraud_detector.joblib.dvc file back to the staging branch.

* Sync Wave 3 (Deployments): ArgoCD deploys the inference service, pointing it to the newly trained model version.

Promotion to Production: The process is repeated by merging the staging branch into main. ArgoCD will perform the exact same sequence of operations, but targeted at the production infrastructure and feature store.

Because the feature store update, the data version pull, and the model training are all orchestrated declaratively from a single Git commit, we have achieved atomic consistency. It is impossible to deploy a model version without its corresponding feature definitions and data lineage being in place first.

Edge Cases and Performance Considerations

This architecture is powerful but introduces complexities that require careful management.

Schema Evolution and Rollbacks: What happens if you need to roll back a change? A git revert is your primary tool. Reverting the merge commit will trigger ArgoCD to re-run the feast-apply-job with the old* feature definitions, effectively rolling back the feature store schema. It will then re-trigger the training pipeline, which will check out the reverted code, pull the old data version via DVC, and rebuild/re-deploy the previous model version. The key is ensuring your feast apply process can handle an older definition being applied over a newer one (Feast is generally good at this).

* Materialization Latency: When a new feature is applied via feast apply, Feast needs to backfill historical data for it, a process called materialization. This can take minutes or hours. The training pipeline must not start until materialization is complete. A robust solution involves modifying the feast-apply-job to be an Argo Workflow itself. This workflow would run feast apply, then poll the status of the materialization job using the Feast SDK, only exiting successfully once the data is ready. The main training workflow would be downstream of this materialization workflow.

* DVC Performance at Scale: For terabyte-scale datasets, dvc pull can be a bottleneck. Consider these optimizations:

* Shared Volumes: In Kubernetes, use a ReadWriteMany (RWX) volume (like NFS or EFS) as a DVC cache across multiple workflow pods. Subsequent runs won't need to re-download data.

Direct Data Access: For data lakes like Delta Lake or Iceberg, you might not need to dvc pull the data at all. DVC can version the pointer to a specific table version or snapshot, and your training job (e.g., a Spark job) can read directly from the data lake using that version identifier. DVC's role shifts from data transport to data pointer* versioning.

* Breaking Feature Changes: Deleting a feature or changing its data type is a breaking change. This requires a more careful, multi-step deployment. You might first deploy a new version of the inference service that no longer requests the old feature, then in a subsequent commit, remove the feature from the Feast definition. This prevents production failures where the service requests a feature that has just been deleted.

Conclusion: True Reproducibility as a Solved Problem

By rejecting the traditional separation of code, data, and infrastructure, we have constructed a system where the Git repository is the ultimate, holistic source of truth. The integration of a declarative feature store (Feast), immutable data versioning (DVC), and a GitOps controller (ArgoCD) elevates MLOps from a series of imperative scripts to a truly declarative, auditable, and reproducible practice.

This architecture isn't simple, but it directly addresses the most challenging and costly failure modes in production ML. For senior engineers tasked with building reliable and maintainable ML systems, this GitOps-driven approach provides a blueprint for moving past the chaos of artifact management and achieving the same level of deterministic control we expect from modern software engineering.

The Reproducibility Crisis in Production MLOps

Architecting the GitOps-Native ML State

The Monorepo Structure

Deep Dive: Atomic Feature Versioning with Feast and Git

1. Defining Features as Code

2. The `feast apply` Synchronization Problem

Integrating DVC for Data and Model Immutability

1. Versioning Data with DVC

2. The DVC-Aware Training Pipeline

The ArgoCD Orchestration Layer: An End-to-End Vision

The Complete, Atomically Consistent Workflow

Edge Cases and Performance Considerations

Conclusion: True Reproducibility as a Solved Problem

Found this article helpful?