PostgreSQL Advisory Locks for Fault-Tolerant Distributed Cron Jobs
The Ubiquitous Challenge: Cron Jobs in a Scaled-Out World
In a monolithic, single-instance architecture, scheduling tasks is trivial. A standard cron daemon or an in-process scheduler like node-cron works flawlessly. However, the moment you containerize and horizontally scale that service to N instances for high availability and load distribution, this simplicity shatters. The same cron job, defined in the same codebase, will now attempt to execute simultaneously on all N instances. For tasks like 'send a daily summary email' or 'process a nightly data aggregation batch', this leads to disastrous duplication of work, data corruption, and wasted resources.
The knee-jerk reaction is often to build a simplistic locking mechanism. A common anti-pattern involves adding a is_running boolean or a locked_at timestamp to a database table. A worker tries to acquire the lock by setting the flag within a transaction. This approach is fraught with peril and fundamentally broken in a concurrent environment:
false flag simultaneously before either has a chance to write true, leading to both proceeding with the task.To solve this correctly, we need an atomic, database-native locking primitive. While external systems like Zookeeper, etcd, or Redis (with the Redlock algorithm) can provide distributed locking, they introduce significant operational overhead and architectural complexity. For services already relying on PostgreSQL, a far more elegant, lightweight, and robust solution exists: Advisory Locks.
This article is not an introduction to advisory locks. It is a deep dive into a production-ready pattern for leveraging them to build a fault-tolerant distributed scheduler, focusing on the nuanced differences between lock types and handling real-world failure modes.
Why Advisory Locks Are Uniquely Suited for This Task
PostgreSQL's advisory locks are a cooperative, application-level locking mechanism. Unlike standard row or table locks, they do not lock any specific data tuple. Instead, they lock an arbitrary bigint or a pair of integer values that exist only in the lock manager's shared memory. This is their superpower for our use case:
* Decoupled from Data: We can acquire a lock for a conceptual entity—like "the-daily-report-job"—without needing a corresponding table or row. This means our locking mechanism has zero impact on data access performance.
* High Performance: Since they operate entirely in memory, acquiring and releasing advisory locks is exceptionally fast, typically measured in microseconds.
* Database-Native Atomicity: They are managed by PostgreSQL itself, inheriting its battle-tested concurrency control and transactional guarantees.
PostgreSQL provides several functions for managing these locks, but for our scheduler, the most critical distinction is between session-level and transaction-level locks, particularly their non-blocking variants.
pg_try_advisory_lock(key): Attempts to acquire a session-level lock. Returns true if successful, false if the lock is already held by another session. The lock persists until explicitly released with pg_advisory_unlock(key) or until the database session ends.pg_try_advisory_xact_lock(key): Attempts to acquire a transaction-level lock. Returns true if successful, false otherwise. The lock is automatically released at the end of the current transaction (either COMMIT or ROLLBACK).This automatic release mechanism is the cornerstone of our fault-tolerant pattern.
Pattern 1: A Basic Scheduler with Session-Level Locks
Let's start with a foundational implementation using session-level locks. This pattern is suitable for very short-lived tasks where the risk of a worker crash during execution is low and a temporary job outage due to a stale lock is acceptable.
We'll build a higher-order function in TypeScript that wraps a job's logic, ensuring it only executes if it can acquire a unique lock.
Lock ID Generation
First, we need a deterministic way to convert a job's name into a unique bigint key. A non-cryptographic hash is perfect for this. We'll use a simple string hashing function. For production, you might consider a more robust hashing library like fnv1a to generate a 64-bit integer, but this illustrates the principle.
// A simple string to 32-bit integer hash function.
// PostgreSQL advisory locks can use a bigint (64-bit), so a more robust hash is recommended for production
// to avoid collisions, but this serves for demonstration.
function simpleHash(str: string): number {
let hash = 0;
for (let i = 0; i < str.length; i++) {
const char = str.charCodeAt(i);
hash = (hash << 5) - hash + char;
hash |= 0; // Convert to 32bit integer
}
return Math.abs(hash); // Use a positive integer for the lock key
}
The Scheduler Wrapper
We'll use the pg library for Node.js. The wrapper will accept a database pool client, a unique job name, and the async function to execute.
import { Pool } from 'pg';
// (simpleHash function from above)
async function withDistributedLock(pool: Pool, jobName: string, jobFn: () => Promise<void>): Promise<void> {
const lockId = simpleHash(jobName);
const client = await pool.connect();
try {
// Attempt to acquire the lock without blocking
const { rows } = await client.query('SELECT pg_try_advisory_lock($1)', [lockId]);
const lockAcquired = rows[0].pg_try_advisory_lock;
if (lockAcquired) {
console.log(`[${new Date().toISOString()}] Lock for job '${jobName}' (ID: ${lockId}) acquired. Running job.`);
try {
// Execute the actual job logic
await jobFn();
} catch (error) {
console.error(`[${new Date().toISOString()}] Error executing job '${jobName}':`, error);
} finally {
// CRITICAL: Always release the lock
await client.query('SELECT pg_advisory_unlock($1)', [lockId]);
console.log(`[${new Date().toISOString()}] Lock for job '${jobName}' (ID: ${lockId}) released.`);
}
} else {
console.log(`[${new Date().toISOString()}] Could not acquire lock for job '${jobName}' (ID: ${lockId}). Another instance is running.`);
}
} finally {
// Release the client back to the pool
client.release();
}
}
// --- Example Usage ---
const pool = new Pool({
connectionString: 'postgresql://user:password@host:port/database'
});
const dailyReportJob = async () => {
console.log('Generating daily report...');
// Simulate work
await new Promise(resolve => setTimeout(resolve, 5000));
console.log('Daily report generation complete.');
};
// In a real app, this would be triggered by a cron scheduler (e.g., node-cron)
// setInterval will simulate multiple instances trying to run the job every 10 seconds
setInterval(() => {
withDistributedLock(pool, 'daily-report', dailyReportJob).catch(console.error);
}, 10000);
Analysis of the Session-Level Pattern
Pros:
* Simple to implement.
* Low overhead; no transactions are explicitly managed by our code.
Cons (Critical):
No Fault Tolerance: If the Node.js process crashes after acquiring the lock but before* the finally block that releases it, the session-level lock is orphaned. It will persist until the database server detects the dead TCP connection and terminates the backend process. This tcp_keepalives_idle setting can be minutes or even hours by default, effectively disabling the job for that duration.
Connection Pooling Issues: The pg_advisory_unlock call must* be executed on the exact same connection that acquired the lock. While pool.connect() and client.release() manage this correctly for a single execution, complex application logic could inadvertently use a different client from the pool, failing to release the lock.
This pattern is simply not robust enough for most production systems. It highlights the problem that transaction-level locks are designed to solve.
Pattern 2: The Fault-Tolerant Scheduler with Transaction-Level Locks
This is the definitive, production-grade pattern. By scoping the lock to a database transaction, we leverage PostgreSQL's own lifecycle management to guarantee lock release, even in the face of catastrophic worker failure.
The logic is slightly different:
- Start a database transaction.
pg_try_advisory_xact_lock).- If successful, run the job.
COMMIT the transaction. This releases the lock.ROLLBACK the transaction. This also releases the lock.- If the worker process dies mid-job, the database connection is severed. PostgreSQL will automatically roll back the orphaned transaction, which—you guessed it—also releases the lock.
This provides an ironclad guarantee of lock cleanup.
The Production-Ready Implementation
import { Pool } from 'pg';
function simpleHash(str: string): number {
let hash = 0;
for (let i = 0; i < str.length; i++) {
const char = str.charCodeAt(i);
hash = (hash << 5) - hash + char;
hash |= 0; // Convert to 32bit integer
}
return Math.abs(hash);
}
// Advanced implementation using transaction-level locks
async function withFaultTolerantLock(pool: Pool, jobName: string, jobFn: () => Promise<void>): Promise<void> {
const lockId = simpleHash(jobName);
const client = await pool.connect();
try {
// Start a transaction
await client.query('BEGIN');
// Attempt to acquire the transaction-level lock
const { rows } = await client.query('SELECT pg_try_advisory_xact_lock($1)', [lockId]);
const lockAcquired = rows[0].pg_try_advisory_xact_lock;
if (lockAcquired) {
console.log(`[${new Date().toISOString()}] Tx-Lock for job '${jobName}' (ID: ${lockId}) acquired. Running job.`);
try {
// Execute job logic within the transaction's scope
await jobFn();
// If successful, commit the transaction to release the lock
await client.query('COMMIT');
console.log(`[${new Date().toISOString()}] Job '${jobName}' completed successfully. Lock released via COMMIT.`);
} catch (error) {
console.error(`[${new Date().toISOString()}] Error in job '${jobName}', rolling back.`, error);
// If an error occurs, roll back to release the lock
await client.query('ROLLBACK');
// Re-throw the error so the caller knows the job failed
throw error;
}
} else {
// Lock not acquired, another worker is active. This is not an error.
// We must end the transaction we started.
await client.query('ROLLBACK'); // or COMMIT, it doesn't matter as we did nothing.
console.log(`[${new Date().toISOString()}] Could not acquire Tx-Lock for job '${jobName}' (ID: ${lockId}). Skipping.`);
}
} finally {
client.release();
}
}
// --- Example Usage ---
const pool = new Pool({
connectionString: 'postgresql://user:password@host:port/database'
});
const criticalBatchJob = async () => {
console.log('Starting critical batch processing...');
await new Promise(resolve => setTimeout(resolve, 15000)); // A longer job
// Simulate a potential failure
if (Math.random() > 0.8) {
throw new Error('Failed to process a critical record!');
}
console.log('Critical batch processing finished.');
};
// Simulate multiple instances trying to run every 20 seconds
setInterval(() => {
console.log(`\n--- New scheduling cycle at ${new Date().toISOString()} ---`);
withFaultTolerantLock(pool, 'critical-batch-job', criticalBatchJob)
.catch(err => {
// Catch errors from the job itself (after rollback)
console.error(`[${new Date().toISOString()}] Job execution failed at the top level: ${err.message}`);
});
}, 20000);
Now, if you run this code and manually kill the Node.js process while it's in the middle of the 15-second setTimeout, PostgreSQL will terminate the backend connection, roll back the transaction, and release the lock. The next instance that tries to run the job after 20 seconds will successfully acquire the lock.
Advanced Considerations and Edge Cases
Mastering this pattern requires understanding its interaction with the broader system.
1. Lock ID Namespacing
Using a single bigint for lock IDs across an entire database is risky. Different applications or even different parts of the same application could suffer a hash collision, causing unintended locking. PostgreSQL's advisory lock functions are overloaded to accept two integer arguments, which is the preferred method for namespacing.
pg_try_advisory_lock(classId, objectId)
classId: A unique integer representing your application or module (e.g., a hash of 'billing-service').objectId: A unique integer for the specific job within that class (e.g., a hash of 'process-monthly-invoices').Revised Lock Acquisition:
const APP_CLASS_ID = simpleHash('my-awesome-scheduler');
async function withNamespacedLock(pool: Pool, jobName: string, jobFn: () => Promise<void>): Promise<void> {
const jobObjectId = simpleHash(jobName);
const client = await pool.connect();
try {
await client.query('BEGIN');
const { rows } = await client.query('SELECT pg_try_advisory_xact_lock($1, $2)', [APP_CLASS_ID, jobObjectId]);
const lockAcquired = rows[0].pg_try_advisory_xact_lock;
// ... rest of the logic
} finally {
client.release();
}
}
This effectively prevents cross-application lock contention.
2. Monitoring and Debugging Stuck Locks
Even with the fault-tolerant pattern, you need a way to inspect the state of advisory locks for debugging. The pg_locks system view is your essential tool.
SELECT
locktype,
classid,
objid,
pid AS process_id,
mode,
granted,
(SELECT datname FROM pg_database WHERE oid = database) as database_name,
(SELECT query FROM pg_stat_activity WHERE pid = pg_locks.pid) as current_query
FROM pg_locks
WHERE locktype = 'advisory';
Running this query will show you exactly which process ID (pid) is holding which advisory lock (classid, objid). If you suspect a lock is stuck (which should be nearly impossible with the transaction-level pattern unless a transaction is genuinely long-running), you can cross-reference the pid with pg_stat_activity to see exactly what the holding process is doing. This is an invaluable diagnostic tool.
3. Interaction with Transactional Outbox Pattern
Many modern systems use the Transactional Outbox pattern to ensure reliable message publishing. If your scheduled job needs to perform database writes and publish events (e.g., to Kafka), the advisory lock pattern integrates beautifully.
The job's logic would:
- Perform its business logic (e.g., update user records).
outbox table.All of this happens inside the same transaction that holds the advisory lock. The COMMIT at the end atomically saves the data changes, the outbox events, and releases the lock. A separate outbox processor service (which might itself use advisory locks to coordinate its workers) can then safely read from the outbox table and publish the messages.
const outboxJob = async (client: pg.Client) => { // Note: client is passed in
// 1. Update business data
await client.query(`UPDATE subscriptions SET status = 'expired' WHERE expires_at < NOW()`);
// 2. Insert event into outbox table
await client.query(`
INSERT INTO outbox (event_type, payload)
VALUES ('subscription.expired', '{"details": "..."}')
`);
};
// The wrapper needs to be modified to pass the client to the job function
// so all work happens in the same transaction.
4. Performance and Scalability
pg_try_advisory_ functions are non-blocking. If 100 service instances try to run a job at the same time, one will get the lock instantly, and 99 will instantly get a false response. This is extremely efficient. The database is not burdened by 99 waiting connections.BEGIN/COMMIT). For jobs that run infrequently (every minute or more), this is completely irrelevant. For jobs that need to run every second, you might measure the impact, but it's still likely to be minimal compared to the job's actual work.When Not to Use This Pattern
Despite its power and elegance, this pattern is not a silver bullet. It's a distributed locking mechanism, not a full-fledged job queue. Consider a more robust solution if you need:
pg-boss).Conclusion
For the common problem of ensuring a scheduled task runs exactly once across a fleet of horizontally-scaled application instances, PostgreSQL's transaction-level advisory locks provide a uniquely powerful, lightweight, and fault-tolerant solution. By binding the lock's lifecycle to a database transaction, we gain automatic cleanup in the event of worker failure, eliminating the risk of stale locks that plagues more naive implementations.
This pattern leverages the transactional guarantees of a tool already present in your stack, avoiding the introduction of new infrastructure and complexity. It is a testament to the depth of features within PostgreSQL and a prime example of a senior-level engineering pattern: simple, robust, and deeply aware of underlying system behaviors and failure modes.