GitOps-Driven MLOps: Atomic Feature Store Versioning with DVC & ArgoCD
The Reproducibility Crisis in Production MLOps
For senior engineers operating ML systems at scale, the fundamental promise of CI/CD—predictable, repeatable deployments—often shatters. A git push
triggering a Jenkins or GitHub Actions pipeline is insufficient because an ML system's state is a complex tuple: (code, model, configuration, feature_logic, training_data)
. Traditional CI/D pipelines excel at managing the code
and configuration
components but treat model
, feature_logic
, and training_data
as external, opaque artifacts.
This leads to critical production failure modes:
* Training-Serving Skew: The most insidious issue. A model is trained on a feature (user_transaction_count_90d
) but deployed into an environment where the feature pipeline calculates it slightly differently (user_transaction_count_89d
due to a subtle logic change). The model's performance silently degrades.
* Non-Reproducible Models: A request to retrain a model from six months ago fails because the exact version of the training data has been overwritten, or the feature transformation logic that existed at that point is lost to history.
Untraceable Lineage: When a model behaves unexpectedly, it's nearly impossible to definitively trace its behavior back to the exact version of the data and* feature definitions that created it. Audits and debugging become forensic nightmares.
The solution is not to simply bolt on ML-specific tools. The solution is to redefine our source of truth. We must architect a system where a Git commit hash deterministically represents the entire state of the ML application. This is the core principle of GitOps, applied with surgical precision to the MLOps lifecycle.
This article details an advanced, production-proven architecture that achieves this by integrating Feast (for feature store management), DVC (for data versioning), and ArgoCD (for GitOps orchestration) into a single, cohesive, and atomically consistent deployment workflow.
Architecting the GitOps-Native ML State
Our goal is to create a single Git repository that acts as the declarative source of truth for the entire ML system. A change to any component—from a Kubernetes deployment spec to a feature definition—is managed through a pull request, providing auditability, peer review, and a clear rollback path.
The Monorepo Structure
A well-structured monorepo is the foundation. Here is a battle-tested layout:
ml-system-repo/
├── .dvc/ # DVC internal configuration
├── .github/ # CI checks (linting, unit tests)
├── apps/
│ └── inference-service/ # Inference service code (e.g., FastAPI)
├── data/
│ ├── raw/
│ │ └── user_profiles.parquet.dvc # DVC pointer to raw data in S3
│ └── processed/
│ └── training_set.parquet.dvc # DVC pointer to processed data
├── features/
│ ├── definitions/
│ │ ├── user_features.py # Feast feature view definitions
│ │ └── transaction_features.py
│ └── feature_store.yaml # Feast repository configuration
├── infra/
│ ├── base/
│ │ ├── kustomization.yaml
│ │ └── namespace.yaml
│ ├── staging/
│ │ ├── feast_apply_job.yaml # K8s Job to apply feature changes
│ │ ├── kustomization.yaml
│ │ └── training_pipeline_trigger.yaml
│ └── production/
│ └── ...
├── models/
│ └── fraud_detector.joblib.dvc # DVC pointer to the trained model artifact
├── pipelines/
│ └── training_workflow.yaml # Argo Workflow definition for training
└── argocd/
└── app-of-apps.yaml # ArgoCD ApplicationSet to manage everything
Key Components:
data/
& models/
with DVC: These directories don't contain the large binary files. Instead, they hold small .dvc
pointer files. These are plaintext files that contain hashes and remote storage locations (e.g., an S3 bucket) for the actual data/model files. This keeps the Git repository small and fast while providing cryptographic guarantees about the data version.features/
with Feast: This directory defines our feature store as code. feature_store.yaml
configures connections to offline (Snowflake, BigQuery) and online (Redis, DynamoDB) stores. The .py
files contain the declarative definitions of feature views and entities.infra/
with Kustomize: We use Kustomize to manage environment-specific configurations (staging vs. production) for our Kubernetes resources without duplicating YAML.pipelines/
with Argo Workflows: The entire ML training process is codified as a DAG in an Argo Workflow manifest. This workflow will have steps to pull DVC data, run feature engineering, train the model, and push the resulting artifact back to DVC.argocd/
: The brain of the operation. This holds the ArgoCD Application
or ApplicationSet
manifests that tell ArgoCD how to reconcile the state of our Git repo with the live Kubernetes cluster.Deep Dive: Atomic Feature Versioning with Feast and Git
The most common point of failure is a mismatch between feature logic in the offline training environment and the online serving environment. We solve this by making Feast's state part of the GitOps loop.
1. Defining Features as Code
First, we define our features declaratively. This example defines a user entity and a feature view that computes aggregations from a raw data source.
features/feature_store.yaml
project: fraud_detection
registry: data/registry.db # Using a file-based registry for simplicity, use SQL for production
provider: gcp # or 'aws', 'local'
online_store:
type: redis
connection_string: "redis-master.redis.svc.cluster.local:6379"
offline_store:
type: bigquery
project_id: "my-gcp-project"
dataset: "fraud_detection_offline_store"
features/definitions/user_features.py
from google.protobuf.duration_pb2 import Duration
from feast import Entity, Feature, FeatureView, FileSource, ValueType
# Define the raw data source. In a real scenario, this would be a BigQuerySource or similar.
user_data_source = FileSource(
path="/data/raw/user_profiles.parquet", # This path will be mounted into our job
event_timestamp_column="event_timestamp",
created_timestamp_column="created_timestamp",
)
# Define an entity for the user
user = Entity(name="user_id", value_type=ValueType.INT64, description="User ID")
# Define a Feature View for user features
user_feature_view = FeatureView(
name="user_profile_features",
entities=["user_id"],
ttl=Duration(seconds=86400 * 7), # 7 days
features=[
Feature(name="age", dtype=ValueType.INT32),
Feature(name="avg_daily_spend_90d", dtype=ValueType.FLOAT),
# NEW FEATURE TO BE ADDED IN A PR
# Feature(name="account_age_days", dtype=ValueType.INT32),
],
online=True,
source=user_data_source,
tags={},
)
2. The `feast apply` Synchronization Problem
When a developer adds the new account_age_days
feature and merges the PR, how does the Feast registry get updated? We cannot rely on manual feast apply
commands. This must be automated and tied to the Git commit.
We solve this by creating a Kubernetes Job
that is triggered by ArgoCD whenever it detects a change in the features/
directory. This job will mount the Git repository, cd
into the features
directory, and execute feast apply
.
infra/staging/feast_apply_job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: feast-apply-job
namespace: ml-ops
annotations:
argocd.argoproj.io/hook: Sync # This is an ArgoCD Sync Hook
argocd.argoproj.io/hook-delete-policy: HookSucceeded
# This ensures the job is deleted after a successful run to avoid conflicts on next sync
spec:
template:
spec:
containers:
- name: feast-applier
image: my-custom-ml-image:1.2.0 # An image with feast CLI and GCP/AWS creds
command: ["/bin/sh", "-c"]
args:
- |
echo "Synchronizing Feast feature repository..."
cd /src/features
feast apply
echo "Feast apply completed successfully."
volumeMounts:
- name: repo-source
mountPath: /src
restartPolicy: Never
# NOTE: In production, you'd use a Kubernetes service account with permissions
# to modify the feature store's underlying infrastructure (e.g., BigQuery tables).
backoffLimit: 2
Crucial Elements:
argocd.argoproj.io/hook: Sync
: This annotation tells ArgoCD to run this Job as part of its synchronization process. It will be executed before* other resources (like the training pipeline or inference service) are reconciled.
* hook-delete-policy: HookSucceeded
: This cleans up the Job object from the cluster after it succeeds, making the process idempotent.
* The container image must have the Feast CLI installed and the necessary cloud credentials to interact with the online/offline stores.
Now, a git push
that changes any file under features/
will automatically and atomically update the live feature store registry to match the definition in Git.
Integrating DVC for Data and Model Immutability
Versioning feature definitions is only half the battle. We need to lock the exact version of the data used for training to the same Git commit.
1. Versioning Data with DVC
When a data scientist acquires a new dataset, the process is:
# Configure DVC to use an S3 bucket for remote storage
$ dvc remote add -d s3storage s3://my-ml-data-versioned
# Start tracking the raw data file
$ dvc add data/raw/user_profiles.parquet
# This creates the pointer file `data/raw/user_profiles.parquet.dvc`
# Now, push the actual data to S3
$ dvc push
# Commit the small .dvc file to Git
$ git add data/raw/user_profiles.parquet.dvc .dvc/config
$ git commit -m "feat(data): Add initial user profiles dataset v1.0"
$ git push
The Git repository now contains a pointer, not the 10GB Parquet file. Anyone who clones the repo can retrieve the exact version of the data with dvc pull
.
2. The DVC-Aware Training Pipeline
Our Argo Workflow for training must be able to use this DVC pointer to fetch the correct data. We achieve this using an initContainer
.
pipelines/training_workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: training-workflow
namespace: ml-ops
spec:
entrypoint: training-dag
volumes:
- name: shared-data
emptyDir: {}
- name: repo-source
emptyDir: {}
templates:
- name: training-dag
dag:
tasks:
- name: run-training-pipeline
template: training-container-template
- name: training-container-template
inputs:
artifacts:
- name: repo
path: /src
git:
repo: https://github.com/your-org/ml-system-repo.git
revision: "{{workflow.parameters.git_revision}}" # Parameterized revision
initContainers:
- name: dvc-pull
image: iterativeai/dvc:latest # Official DVC image
command: ["/bin/sh", "-c"]
args:
- |
set -ex
cd /src
# Configure credentials for remote storage (e.g., S3)
# In a real setup, use secrets!
dvc remote modify s3storage endpointurl $S3_ENDPOINT
dvc remote modify s3storage access_key_id $S3_ACCESS_KEY_ID
dvc remote modify s3storage secret_access_key $S3_SECRET_ACCESS_KEY
echo "Pulling DVC-tracked data..."
dvc pull data/raw/user_profiles.parquet -f
echo "Data pull complete."
env:
- name: S3_ENDPOINT
valueFrom: { secretKeyRef: { name: dvc-s3-creds, key: endpoint } }
- name: S3_ACCESS_KEY_ID
valueFrom: { secretKeyRef: { name: dvc-s3-creds, key: accessKey } }
- name: S3_SECRET_ACCESS_KEY
valueFrom: { secretKeyRef: { name: dvc-s3-creds, key: secretKey } }
volumeMounts:
- name: repo-source
mountPath: /src
container:
image: my-custom-ml-image:1.2.0
command: ["python", "/src/apps/training/train.py"]
args:
- "--input-data=/src/data/raw/user_profiles.parquet"
- "--model-output-path=/src/models/fraud_detector.joblib"
volumeMounts:
- name: repo-source
mountPath: /src
# This is the critical final step: push the new model back to DVC
# and create the new .dvc file in the workspace.
# A subsequent CI/CD step would commit this back to Git.
outputs:
artifacts:
- name: updated-repo
path: /src
Workflow Analysis:
initContainer
: Before the main training container starts, the dvc-pull
container runs. It uses the .dvc
file from the checked-out repo to pull the corresponding large data file from S3 into a shared volume.user_profiles.parquet
file on the shared volume as if it were a local file. It trains the model and saves the artifact (fraud_detector.joblib
).dvc add models/fraud_detector.joblib
and dvc push
. This versions the output of the pipeline. The final step, committing the new models/fraud_detector.joblib.dvc
file back to the repository, is typically handled by a subsequent step in the pipeline that uses a Git token.The ArgoCD Orchestration Layer: An End-to-End Vision
ArgoCD is the conductor that orchestrates this entire symphony. We use an ApplicationSet
to manage the deployment across different environments based on Git branches or directories.
argocd/app-of-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: ml-system
namespace: argocd
spec:
generators:
- git:
repoURL: https://github.com/your-org/ml-system-repo.git
revision: HEAD
directories:
- path: infra/*
template:
metadata:
name: 'ml-system-{{path.basename}}' # e.g., ml-system-staging
spec:
project: default
source:
repoURL: https://github.com/your-org/ml-system-repo.git
targetRevision: HEAD
path: '{{path}}'
kustomize: {}
destination:
server: https://kubernetes.default.svc
namespace: ml-ops
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
This ApplicationSet
will automatically create an ArgoCD Application
for each directory inside infra/
(e.g., staging
, production
).
The Complete, Atomically Consistent Workflow
Let's trace the full lifecycle of adding a new feature:
feature/add-account-age
. They: * Uncomment the account_age_days
feature in features/definitions/user_features.py
.
* Update the training script apps/training/train.py
to use this new feature.
* Push the branch and create a PR.
staging
branch. ArgoCD, watching this branch, detects a change. Sync Wave 1 (Hooks): ArgoCD first runs the feast-apply-job
because it's a Sync
hook. This job updates the staging* Feast registry with the new account_age_days
feature definition.
* Sync Wave 2 (Workflows): After the hook succeeds, ArgoCD triggers the training Argo Workflow. This workflow:
* Checks out the staging
branch commit.
* The dvc-pull
initContainer pulls the data corresponding to this commit.
The training script runs. It connects to the staging* Feast service, which now knows about account_age_days
, and successfully uses it for training.
* The new model artifact is versioned with dvc add
and dvc push
.
* A subsequent step commits the new models/fraud_detector.joblib.dvc
file back to the staging
branch.
* Sync Wave 3 (Deployments): ArgoCD deploys the inference service, pointing it to the newly trained model version.
staging
branch into main
. ArgoCD will perform the exact same sequence of operations, but targeted at the production infrastructure and feature store.Because the feature store update, the data version pull, and the model training are all orchestrated declaratively from a single Git commit, we have achieved atomic consistency. It is impossible to deploy a model version without its corresponding feature definitions and data lineage being in place first.
Edge Cases and Performance Considerations
This architecture is powerful but introduces complexities that require careful management.
Schema Evolution and Rollbacks: What happens if you need to roll back a change? A git revert
is your primary tool. Reverting the merge commit will trigger ArgoCD to re-run the feast-apply-job
with the old* feature definitions, effectively rolling back the feature store schema. It will then re-trigger the training pipeline, which will check out the reverted code, pull the old data version via DVC, and rebuild/re-deploy the previous model version. The key is ensuring your feast apply
process can handle an older definition being applied over a newer one (Feast is generally good at this).
* Materialization Latency: When a new feature is applied via feast apply
, Feast needs to backfill historical data for it, a process called materialization. This can take minutes or hours. The training pipeline must not start until materialization is complete. A robust solution involves modifying the feast-apply-job
to be an Argo Workflow itself. This workflow would run feast apply
, then poll the status of the materialization job using the Feast SDK, only exiting successfully once the data is ready. The main training workflow would be downstream of this materialization workflow.
* DVC Performance at Scale: For terabyte-scale datasets, dvc pull
can be a bottleneck. Consider these optimizations:
* Shared Volumes: In Kubernetes, use a ReadWriteMany (RWX) volume (like NFS or EFS) as a DVC cache across multiple workflow pods. Subsequent runs won't need to re-download data.
Direct Data Access: For data lakes like Delta Lake or Iceberg, you might not need to dvc pull
the data at all. DVC can version the pointer to a specific table version or snapshot, and your training job (e.g., a Spark job) can read directly from the data lake using that version identifier. DVC's role shifts from data transport to data pointer* versioning.
* Breaking Feature Changes: Deleting a feature or changing its data type is a breaking change. This requires a more careful, multi-step deployment. You might first deploy a new version of the inference service that no longer requests the old feature, then in a subsequent commit, remove the feature from the Feast definition. This prevents production failures where the service requests a feature that has just been deleted.
Conclusion: True Reproducibility as a Solved Problem
By rejecting the traditional separation of code, data, and infrastructure, we have constructed a system where the Git repository is the ultimate, holistic source of truth. The integration of a declarative feature store (Feast), immutable data versioning (DVC), and a GitOps controller (ArgoCD) elevates MLOps from a series of imperative scripts to a truly declarative, auditable, and reproducible practice.
This architecture isn't simple, but it directly addresses the most challenging and costly failure modes in production ML. For senior engineers tasked with building reliable and maintainable ML systems, this GitOps-driven approach provides a blueprint for moving past the chaos of artifact management and achieving the same level of deterministic control we expect from modern software engineering.