Production-Ready Optimistic Locking in DynamoDB with Conditional Writes
The Inevitability of Race Conditions in High-Throughput Systems
In any distributed system operating at scale, the classic read-modify-write cycle is a ticking time bomb for data integrity. Consider an inventory management system built on DynamoDB: two concurrent requests attempt to decrement the stock count for the last available item. Without a concurrency control mechanism, both requests might read the stock count as '1', both will calculate the new stock as '0' in their respective processes, and both will attempt to write '0' back to the database. The first write succeeds. The second write, unaware of the first, also succeeds, overwriting the same value. The result? One item was sold, but the system might have processed two orders, leading to a stockout and a failure to fulfill an order. This is a classic lost update anomaly.
Traditional relational databases solve this with pessimistic locking (e.g., SELECT ... FOR UPDATE), where a row is locked for the duration of a transaction, forcing other transactions to wait. This approach guarantees consistency but introduces contention, reduces throughput, and is fundamentally at odds with DynamoDB's architecture, which is optimized for massive parallelism and low-latency access patterns.
The DynamoDB-native solution is Optimistic Concurrency Control (OCC), also known as optimistic locking. Instead of preventing concurrent access, OCC allows it and provides a mechanism to verify that the data has not been modified by another process since it was read. If a conflict is detected, the transaction is aborted, and the client is responsible for handling the conflict, typically by retrying the operation.
This article is not an introduction to OCC. It is a detailed guide for implementing a robust, production-ready OCC pattern in DynamoDB using its core features: version attributes, Conditional Writes, and the explicit handling of ConditionalCheckFailedException. We will dissect the implementation, build resilient retry mechanisms, and explore its application in complex transactional scenarios.
Anatomy of the Lost Update Anomaly
Let's formalize the inventory problem with a concrete DynamoDB item structure and a naive implementation that demonstrates the flaw.
Item Structure:
* productId (Partition Key, String)
* stockCount (Number)
* lastUpdatedAt (String, ISO 8601)
Our goal is to implement a function, decrease_inventory, that safely decrements stockCount.
A naive, and incorrect, implementation using Python's boto3 might look like this:
# WARNING: This code contains a race condition and is NOT for production use.
import boto3
def naive_decrease_inventory(table, product_id: str, quantity_to_decrease: int):
"""This function is vulnerable to lost updates."""
# 1. Read Phase
response = table.get_item(Key={'productId': product_id})
item = response.get('Item')
if not item:
raise ValueError(f"Product {product_id} not found.")
current_stock = int(item.get('stockCount', 0))
if current_stock < quantity_to_decrease:
raise ValueError("Insufficient stock.")
# 2. Modify Phase (in-memory)
new_stock = current_stock - quantity_to_decrease
# 3. Write Phase
table.put_item(
Item={
'productId': product_id,
'stockCount': new_stock,
'lastUpdatedAt': '...' # current timestamp
}
)
print(f"Successfully updated stock for {product_id} to {new_stock}")
return new_stock
Visualizing the Race Condition:
Let's assume stockCount is 10.
| Timeline | Process A (Request for 3 items) | Process B (Request for 5 items) | Database State (stockCount) |
|---|---|---|---|
| T1 | get_item -> reads stockCount: 10 | 10 | |
| T2 | get_item -> reads stockCount: 10 | 10 | |
| T3 | Calculates new_stock = 10 - 3 = 7 | 10 | |
| T4 | Calculates new_stock = 10 - 5 = 5 | 10 | |
| T5 | put_item with stockCount: 7 | 7 | |
| T6 | put_item with stockCount: 5 | 5 |
Outcome: The final stockCount is 5. We have sold 8 items (3 + 5), but the database reflects a reduction of only 5. Process A's update was completely lost. This is the exact scenario OCC is designed to prevent.
The Versioned Optimistic Locking Pattern
The solution is to introduce a version attribute. A common convention is to name it _version or version. This is simply a number that is incremented with every successful write.
New Item Structure:
* productId (PK, String)
* stockCount (Number)
* _version (Number)
* lastUpdatedAt (String)
The modified, correct workflow is as follows:
_version._version number in memory.put_item (or update_item) operation with a ConditionExpression. This expression asserts that the _version of the item in the database must match the _version that was read in step 1. If the condition is met, the write succeeds. If another process modified the item between our read and write (the race), its own write would have incremented the _version. Our condition will then fail, and DynamoDB will reject our write, throwing a ConditionalCheckFailedException.
Production-Grade Implementation
Here is the corrected implementation incorporating the versioning pattern.
import boto3
from botocore.exceptions import ClientError
import time
import random
dynamodb = boto3.resource('dynamodb')
class InsufficientStockError(Exception):
pass
class ConcurrencyError(Exception):
pass
def decrease_inventory_with_occ(table_name: str, product_id: str, quantity_to_decrease: int, max_retries: int = 5):
"""Safely decrements inventory using optimistic locking with versioning."""
table = dynamodb.Table(table_name)
retries = 0
while retries < max_retries:
try:
# 1. Read Phase
response = table.get_item(Key={'productId': product_id})
item = response.get('Item')
if not item:
raise ValueError(f"Product {product_id} not found.")
current_stock = int(item.get('stockCount', 0))
expected_version = int(item.get('_version', 0))
if current_stock < quantity_to_decrease:
raise InsufficientStockError("Insufficient stock.")
# 2. Modify Phase (in-memory)
new_stock = current_stock - quantity_to_decrease
new_version = expected_version + 1
# 3. Conditional Write Phase
print(f"Attempt {retries + 1}: Trying to update {product_id} from version {expected_version} to {new_version}")
table.put_item(
Item={
'productId': product_id,
'stockCount': new_stock,
'_version': new_version,
'lastUpdatedAt': '...' # current timestamp
},
ConditionExpression='#v = :ev',
ExpressionAttributeNames={'#v': '_version'},
ExpressionAttributeValues={':ev': expected_version}
)
print(f"Successfully updated stock for {product_id} to {new_stock}. Final version: {new_version}")
return new_stock
except ClientError as e:
if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
print(f"Concurrency conflict for {product_id}. Version {expected_version} is stale. Retrying...")
retries += 1
# Exponential backoff with jitter
sleep_time = (2 ** retries) * 0.1 + random.uniform(0, 0.1)
time.sleep(sleep_time)
else:
# Re-raise other DynamoDB errors
raise
except (InsufficientStockError, ValueError) as e:
# Business logic errors that should not be retried
raise e
raise ConcurrencyError(f"Failed to update {product_id} after {max_retries} retries due to high contention.")
Dissecting the ConditionExpression:
* ConditionExpression='#v = :ev': This is the core of the lock. It instructs DynamoDB to only proceed with the put_item operation IF the condition is true.
* ExpressionAttributeNames={'#v': '_version'}: This is a best practice to avoid conflicts with DynamoDB reserved words. We map the placeholder #v to the actual attribute name _version.
* ExpressionAttributeValues={':ev': expected_version}: This maps the placeholder :ev to the actual value of the version we read from the database. It's crucial that this value is a number, matching the type of the _version attribute.
`ConditionalCheckFailedException`: A Feature, Not a Bug
The most critical mindset shift for engineers new to this pattern is understanding that ConditionalCheckFailedException is not an error to be logged and alerted on. It is an expected, normal part of the control flow. It is the signal from DynamoDB that a race condition occurred and your process lost. Your application logic must handle this exception gracefully.
Our implementation above demonstrates the standard response: a client-side retry loop.
Designing a Resilient Retry Strategy
A naive while True retry loop is dangerous. In a high-contention scenario, it can lead to a thundering herd problem, where multiple clients are retrying aggressively, increasing database load and the probability of further collisions. A production-grade retry mechanism must include:
max_retries = 5.(2 retries) * 0.1 + random.uniform(0, 0.1) for this purpose.This retry logic should be encapsulated within your data access layer, making it transparent to the higher-level business logic.
Advanced Scenarios and Edge Cases
While the versioned put_item covers many use cases, real-world systems often have more complex requirements.
Edge Case 1: Atomic Counters with `UpdateItem`
For simple atomic increments or decrements, the read-modify-write cycle can be optimized away. DynamoDB's UpdateItem operation can perform atomic operations directly on the server.
def atomic_decrease_inventory(table, product_id: str, quantity: int):
"""Atomically decreases stock, but only if sufficient stock exists."""
try:
response = table.update_item(
Key={'productId': product_id},
UpdateExpression='SET stockCount = stockCount - :q',
# Condition ensures we don't go below zero
ConditionExpression='stockCount >= :q',
ExpressionAttributeValues={':q': quantity},
ReturnValues='UPDATED_NEW'
)
return response['Attributes']['stockCount']
except ClientError as e:
if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
# This means stockCount was less than quantity
raise InsufficientStockError("Insufficient stock for atomic update.")
else:
raise
This is highly efficient. However, it doesn't solve the full OCC problem. What if you need to update the stockCount AND another attribute (e.g., status) based on the original state of the item? In that case, you still need the versioning pattern. You can combine them:
# ... inside the retry loop ...
table.update_item(
Key={'productId': product_id},
UpdateExpression='SET stockCount = :ns, #v = :nv, #s = :new_status',
ConditionExpression='#v = :ev',
ExpressionAttributeNames={
'#v': '_version',
'#s': 'status'
},
ExpressionAttributeValues={
':ns': new_stock,
':nv': new_version,
':new_status': 'LOW_STOCK', # Example of another change
':ev': expected_version
}
)
Here, the update_item call performs the entire state transition atomically, but only if the version matches. This is often more efficient than put_item as it only sends the changed attributes over the wire.
Edge Case 2: Transactions with `TransactWriteItems`
Optimistic locking truly shines when coordinating changes across multiple items. Imagine a user placing an order. This requires two operations that must succeed or fail together:
- Decrement the inventory for the product.
- Create a new order item for the user.
DynamoDB's TransactWriteItems allows you to group up to 100 write operations into a single, all-or-nothing transaction. Each operation within the transaction can have its own ConditionExpression.
Item Structures:
* Products Table: productId (PK), stockCount, _version
* Orders Table: orderId (PK), userId, productId, status
import uuid
def place_order_transaction(product_id: str, user_id: str, quantity: int):
transact_client = boto3.client('dynamodb')
products_table = dynamodb.Table('Products')
# In a real app, this would be inside a retry loop
# For brevity, showing a single attempt.
# 1. READ PHASE (outside the transaction)
product_response = products_table.get_item(Key={'productId': product_id})
product = product_response.get('Item')
if not product or product['stockCount'] < quantity:
raise InsufficientStockError("Insufficient stock.")
expected_version = int(product.get('_version', 0))
new_stock = int(product['stockCount']) - quantity
new_version = expected_version + 1
new_order_id = str(uuid.uuid4())
# 2. TRANSACTIONAL WRITE PHASE
try:
transact_client.transact_write_items(
TransactItems=[
{
'Update': {
'TableName': 'Products',
'Key': {'productId': {'S': product_id}},
'UpdateExpression': 'SET stockCount = :ns, #v = :nv',
'ConditionExpression': '#v = :ev',
'ExpressionAttributeNames': {'#v': '_version'},
'ExpressionAttributeValues': {
':ns': {'N': str(new_stock)},
':nv': {'N': str(new_version)},
':ev': {'N': str(expected_version)}
}
}
},
{
'Put': {
'TableName': 'Orders',
'Item': {
'orderId': {'S': new_order_id},
'userId': {'S': user_id},
'productId': {'S': product_id},
'status': {'S': 'PENDING'}
},
# Ensure this order doesn't already exist (idempotency)
'ConditionExpression': 'attribute_not_exists(orderId)'
}
}
]
)
print(f"Order {new_order_id} placed successfully.")
return new_order_id
except ClientError as e:
if e.response['Error']['Code'] == 'TransactionCanceledException':
# The entire transaction was rolled back.
# Check the CancellationReasons to see which condition failed.
reasons = e.response['CancellationReasons']
print(f"Transaction failed: {reasons}")
# One of the reasons will be 'ConditionalCheckFailed'.
# This is the signal to retry the entire read-and-transact operation.
raise ConcurrencyError("Order transaction failed due to contention.")
else:
raise
In this advanced pattern, if another process updates the product's stock (and thus its version) between our read and our transaction, the Update operation's condition will fail. This causes the entire transaction to be rolled back atomically. The Put operation for the new order will never be committed. This guarantees consistency across tables without complex distributed locks.
Performance and Cost Considerations
* Failed Writes Consume Capacity: A crucial point to remember is that a failed conditional write still consumes one Write Capacity Unit (WCU). If your system has extremely high contention, you could be paying for a large number of failed writes. This is a signal that you may need to reconsider your data model or access patterns.
* Monitoring Contention: Monitor the ConditionalCheckFailedRequests metric in CloudWatch for your DynamoDB tables. A persistently high number indicates significant contention. This is your primary tool for understanding the level of concurrency conflicts in your system.
* The Cost of Reads: The OCC pattern requires at least one GetItem call before every write attempt. In a low-contention environment, this is a small overhead. In a high-contention scenario requiring multiple retries, you will perform multiple reads for a single logical update, increasing your Read Capacity Unit (RCU) consumption.
Testing Your Concurrency Logic
Never assume your OCC logic is correct without testing it under concurrent load. You can simulate this locally using Python's multiprocessing or threading modules.
Here's a conceptual test setup:
from multiprocessing import Pool, cpu_count
# Assume you have a function setup_test_table() that creates a table
# with a product 'PROD123' with stockCount = 100 and _version = 1
TABLE_NAME = 'InventoryTest'
def worker_task(task_id):
"""A single worker trying to decrement inventory."""
try:
print(f"Worker {task_id} starting...")
final_stock = decrease_inventory_with_occ(
table_name=TABLE_NAME,
product_id='PROD123',
quantity_to_decrease=1
)
print(f"Worker {task_id} finished successfully. Stock is now {final_stock}")
return 'SUCCESS'
except Exception as e:
print(f"Worker {task_id} failed: {e}")
return 'FAILURE'
if __name__ == '__main__':
# Setup initial state in DynamoDB
# ... setup_test_table(initial_stock=100)
num_workers = 20 # More workers than available stock to force contention
with Pool(processes=cpu_count()) as pool:
results = pool.map(worker_task, range(num_workers))
success_count = results.count('SUCCESS')
print(f"\nTotal successful decrements: {success_count}")
# Verify final state in DynamoDB
# The final stockCount should be 100 - success_count
# The final _version should be 1 + success_count
# ... verification logic ...
Running this test will produce logs showing the retry attempts and ConditionalCheckFailedException being handled, proving that your locking mechanism correctly serializes access and prevents lost updates.
Conclusion: Embrace Controlled Failure
Optimistic Concurrency Control in DynamoDB is more than a feature; it's a design philosophy. It requires shifting from a mindset of preventing concurrent access to one of embracing and managing the resulting conflicts. By instrumenting your data with a version attribute and using ConditionExpression, you gain a powerful, scalable mechanism to ensure data integrity without sacrificing the performance and parallelism that make DynamoDB a compelling choice for high-throughput applications.
The key takeaways for senior engineers are:
ConditionalCheckFailedException is the success path for handling contention. It must be caught and handled with a robust retry strategy.TransactWriteItems, allowing for complex, atomic operations across your data model without distributed locks.ConditionalCheckFailedRequests. This metric is your window into the level of contention in your application and is a critical input for performance tuning.By mastering this pattern, you are equipping your applications to handle the concurrency inherent in distributed systems, ensuring that your data remains correct and consistent, even under extreme load.