PostgreSQL Advisory Locks for Fault-Tolerant Distributed Cron Jobs

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Ubiquitous Challenge: Cron Jobs in a Scaled-Out World

In a monolithic, single-instance architecture, scheduling tasks is trivial. A standard cron daemon or an in-process scheduler like node-cron works flawlessly. However, the moment you containerize and horizontally scale that service to N instances for high availability and load distribution, this simplicity shatters. The same cron job, defined in the same codebase, will now attempt to execute simultaneously on all N instances. For tasks like 'send a daily summary email' or 'process a nightly data aggregation batch', this leads to disastrous duplication of work, data corruption, and wasted resources.

The knee-jerk reaction is often to build a simplistic locking mechanism. A common anti-pattern involves adding a is_running boolean or a locked_at timestamp to a database table. A worker tries to acquire the lock by setting the flag within a transaction. This approach is fraught with peril and fundamentally broken in a concurrent environment:

  • Race Conditions: Two instances can read the false flag simultaneously before either has a chance to write true, leading to both proceeding with the task.
  • Stale Locks: If a worker acquires the lock and then crashes before it can release it, the lock remains indefinitely, permanently disabling the job.
  • To solve this correctly, we need an atomic, database-native locking primitive. While external systems like Zookeeper, etcd, or Redis (with the Redlock algorithm) can provide distributed locking, they introduce significant operational overhead and architectural complexity. For services already relying on PostgreSQL, a far more elegant, lightweight, and robust solution exists: Advisory Locks.

    This article is not an introduction to advisory locks. It is a deep dive into a production-ready pattern for leveraging them to build a fault-tolerant distributed scheduler, focusing on the nuanced differences between lock types and handling real-world failure modes.

    Why Advisory Locks Are Uniquely Suited for This Task

    PostgreSQL's advisory locks are a cooperative, application-level locking mechanism. Unlike standard row or table locks, they do not lock any specific data tuple. Instead, they lock an arbitrary bigint or a pair of integer values that exist only in the lock manager's shared memory. This is their superpower for our use case:

    * Decoupled from Data: We can acquire a lock for a conceptual entity—like "the-daily-report-job"—without needing a corresponding table or row. This means our locking mechanism has zero impact on data access performance.

    * High Performance: Since they operate entirely in memory, acquiring and releasing advisory locks is exceptionally fast, typically measured in microseconds.

    * Database-Native Atomicity: They are managed by PostgreSQL itself, inheriting its battle-tested concurrency control and transactional guarantees.

    PostgreSQL provides several functions for managing these locks, but for our scheduler, the most critical distinction is between session-level and transaction-level locks, particularly their non-blocking variants.

  • pg_try_advisory_lock(key): Attempts to acquire a session-level lock. Returns true if successful, false if the lock is already held by another session. The lock persists until explicitly released with pg_advisory_unlock(key) or until the database session ends.
  • pg_try_advisory_xact_lock(key): Attempts to acquire a transaction-level lock. Returns true if successful, false otherwise. The lock is automatically released at the end of the current transaction (either COMMIT or ROLLBACK).
  • This automatic release mechanism is the cornerstone of our fault-tolerant pattern.

    Pattern 1: A Basic Scheduler with Session-Level Locks

    Let's start with a foundational implementation using session-level locks. This pattern is suitable for very short-lived tasks where the risk of a worker crash during execution is low and a temporary job outage due to a stale lock is acceptable.

    We'll build a higher-order function in TypeScript that wraps a job's logic, ensuring it only executes if it can acquire a unique lock.

    Lock ID Generation

    First, we need a deterministic way to convert a job's name into a unique bigint key. A non-cryptographic hash is perfect for this. We'll use a simple string hashing function. For production, you might consider a more robust hashing library like fnv1a to generate a 64-bit integer, but this illustrates the principle.

    typescript
    // A simple string to 32-bit integer hash function.
    // PostgreSQL advisory locks can use a bigint (64-bit), so a more robust hash is recommended for production
    // to avoid collisions, but this serves for demonstration.
    function simpleHash(str: string): number {
        let hash = 0;
        for (let i = 0; i < str.length; i++) {
            const char = str.charCodeAt(i);
            hash = (hash << 5) - hash + char;
            hash |= 0; // Convert to 32bit integer
        }
        return Math.abs(hash); // Use a positive integer for the lock key
    }

    The Scheduler Wrapper

    We'll use the pg library for Node.js. The wrapper will accept a database pool client, a unique job name, and the async function to execute.

    typescript
    import { Pool } from 'pg';
    
    // (simpleHash function from above)
    
    async function withDistributedLock(pool: Pool, jobName: string, jobFn: () => Promise<void>): Promise<void> {
        const lockId = simpleHash(jobName);
        const client = await pool.connect();
    
        try {
            // Attempt to acquire the lock without blocking
            const { rows } = await client.query('SELECT pg_try_advisory_lock($1)', [lockId]);
            const lockAcquired = rows[0].pg_try_advisory_lock;
    
            if (lockAcquired) {
                console.log(`[${new Date().toISOString()}] Lock for job '${jobName}' (ID: ${lockId}) acquired. Running job.`);
                try {
                    // Execute the actual job logic
                    await jobFn();
                } catch (error) {
                    console.error(`[${new Date().toISOString()}] Error executing job '${jobName}':`, error);
                } finally {
                    // CRITICAL: Always release the lock
                    await client.query('SELECT pg_advisory_unlock($1)', [lockId]);
                    console.log(`[${new Date().toISOString()}] Lock for job '${jobName}' (ID: ${lockId}) released.`);
                }
            } else {
                console.log(`[${new Date().toISOString()}] Could not acquire lock for job '${jobName}' (ID: ${lockId}). Another instance is running.`);
            }
        } finally {
            // Release the client back to the pool
            client.release();
        }
    }
    
    // --- Example Usage ---
    
    const pool = new Pool({
        connectionString: 'postgresql://user:password@host:port/database'
    });
    
    const dailyReportJob = async () => {
        console.log('Generating daily report...');
        // Simulate work
        await new Promise(resolve => setTimeout(resolve, 5000)); 
        console.log('Daily report generation complete.');
    };
    
    // In a real app, this would be triggered by a cron scheduler (e.g., node-cron)
    // setInterval will simulate multiple instances trying to run the job every 10 seconds
    setInterval(() => {
        withDistributedLock(pool, 'daily-report', dailyReportJob).catch(console.error);
    }, 10000);

    Analysis of the Session-Level Pattern

    Pros:

    * Simple to implement.

    * Low overhead; no transactions are explicitly managed by our code.

    Cons (Critical):

    No Fault Tolerance: If the Node.js process crashes after acquiring the lock but before* the finally block that releases it, the session-level lock is orphaned. It will persist until the database server detects the dead TCP connection and terminates the backend process. This tcp_keepalives_idle setting can be minutes or even hours by default, effectively disabling the job for that duration.

    Connection Pooling Issues: The pg_advisory_unlock call must* be executed on the exact same connection that acquired the lock. While pool.connect() and client.release() manage this correctly for a single execution, complex application logic could inadvertently use a different client from the pool, failing to release the lock.

    This pattern is simply not robust enough for most production systems. It highlights the problem that transaction-level locks are designed to solve.

    Pattern 2: The Fault-Tolerant Scheduler with Transaction-Level Locks

    This is the definitive, production-grade pattern. By scoping the lock to a database transaction, we leverage PostgreSQL's own lifecycle management to guarantee lock release, even in the face of catastrophic worker failure.

    The logic is slightly different:

    • Start a database transaction.
  • Attempt to acquire a transaction-level lock (pg_try_advisory_xact_lock).
    • If successful, run the job.
  • If the job succeeds, COMMIT the transaction. This releases the lock.
  • If the job fails, ROLLBACK the transaction. This also releases the lock.
    • If the worker process dies mid-job, the database connection is severed. PostgreSQL will automatically roll back the orphaned transaction, which—you guessed it—also releases the lock.

    This provides an ironclad guarantee of lock cleanup.

    The Production-Ready Implementation

    typescript
    import { Pool } from 'pg';
    
    function simpleHash(str: string): number {
        let hash = 0;
        for (let i = 0; i < str.length; i++) {
            const char = str.charCodeAt(i);
            hash = (hash << 5) - hash + char;
            hash |= 0; // Convert to 32bit integer
        }
        return Math.abs(hash);
    }
    
    // Advanced implementation using transaction-level locks
    async function withFaultTolerantLock(pool: Pool, jobName: string, jobFn: () => Promise<void>): Promise<void> {
        const lockId = simpleHash(jobName);
        const client = await pool.connect();
    
        try {
            // Start a transaction
            await client.query('BEGIN');
    
            // Attempt to acquire the transaction-level lock
            const { rows } = await client.query('SELECT pg_try_advisory_xact_lock($1)', [lockId]);
            const lockAcquired = rows[0].pg_try_advisory_xact_lock;
    
            if (lockAcquired) {
                console.log(`[${new Date().toISOString()}] Tx-Lock for job '${jobName}' (ID: ${lockId}) acquired. Running job.`);
                try {
                    // Execute job logic within the transaction's scope
                    await jobFn();
                    // If successful, commit the transaction to release the lock
                    await client.query('COMMIT');
                    console.log(`[${new Date().toISOString()}] Job '${jobName}' completed successfully. Lock released via COMMIT.`);
                } catch (error) {
                    console.error(`[${new Date().toISOString()}] Error in job '${jobName}', rolling back.`, error);
                    // If an error occurs, roll back to release the lock
                    await client.query('ROLLBACK');
                    // Re-throw the error so the caller knows the job failed
                    throw error;
                }
            } else {
                // Lock not acquired, another worker is active. This is not an error.
                // We must end the transaction we started.
                await client.query('ROLLBACK'); // or COMMIT, it doesn't matter as we did nothing.
                console.log(`[${new Date().toISOString()}] Could not acquire Tx-Lock for job '${jobName}' (ID: ${lockId}). Skipping.`);
            }
        } finally {
            client.release();
        }
    }
    
    // --- Example Usage ---
    
    const pool = new Pool({
        connectionString: 'postgresql://user:password@host:port/database'
    });
    
    const criticalBatchJob = async () => {
        console.log('Starting critical batch processing...');
        await new Promise(resolve => setTimeout(resolve, 15000)); // A longer job
    
        // Simulate a potential failure
        if (Math.random() > 0.8) {
            throw new Error('Failed to process a critical record!');
        }
    
        console.log('Critical batch processing finished.');
    };
    
    // Simulate multiple instances trying to run every 20 seconds
    setInterval(() => {
        console.log(`\n--- New scheduling cycle at ${new Date().toISOString()} ---`);
        withFaultTolerantLock(pool, 'critical-batch-job', criticalBatchJob)
            .catch(err => {
                // Catch errors from the job itself (after rollback)
                console.error(`[${new Date().toISOString()}] Job execution failed at the top level: ${err.message}`);
            });
    }, 20000);

    Now, if you run this code and manually kill the Node.js process while it's in the middle of the 15-second setTimeout, PostgreSQL will terminate the backend connection, roll back the transaction, and release the lock. The next instance that tries to run the job after 20 seconds will successfully acquire the lock.

    Advanced Considerations and Edge Cases

    Mastering this pattern requires understanding its interaction with the broader system.

    1. Lock ID Namespacing

    Using a single bigint for lock IDs across an entire database is risky. Different applications or even different parts of the same application could suffer a hash collision, causing unintended locking. PostgreSQL's advisory lock functions are overloaded to accept two integer arguments, which is the preferred method for namespacing.

    pg_try_advisory_lock(classId, objectId)

  • classId: A unique integer representing your application or module (e.g., a hash of 'billing-service').
  • objectId: A unique integer for the specific job within that class (e.g., a hash of 'process-monthly-invoices').
  • Revised Lock Acquisition:

    typescript
    const APP_CLASS_ID = simpleHash('my-awesome-scheduler');
    
    async function withNamespacedLock(pool: Pool, jobName: string, jobFn: () => Promise<void>): Promise<void> {
        const jobObjectId = simpleHash(jobName);
        const client = await pool.connect();
    
        try {
            await client.query('BEGIN');
            const { rows } = await client.query('SELECT pg_try_advisory_xact_lock($1, $2)', [APP_CLASS_ID, jobObjectId]);
            const lockAcquired = rows[0].pg_try_advisory_xact_lock;
            // ... rest of the logic
        } finally {
            client.release();
        }
    }

    This effectively prevents cross-application lock contention.

    2. Monitoring and Debugging Stuck Locks

    Even with the fault-tolerant pattern, you need a way to inspect the state of advisory locks for debugging. The pg_locks system view is your essential tool.

    sql
    SELECT
        locktype,
        classid,
        objid,
        pid AS process_id,
        mode,
        granted,
        (SELECT datname FROM pg_database WHERE oid = database) as database_name,
        (SELECT query FROM pg_stat_activity WHERE pid = pg_locks.pid) as current_query
    FROM pg_locks
    WHERE locktype = 'advisory';

    Running this query will show you exactly which process ID (pid) is holding which advisory lock (classid, objid). If you suspect a lock is stuck (which should be nearly impossible with the transaction-level pattern unless a transaction is genuinely long-running), you can cross-reference the pid with pg_stat_activity to see exactly what the holding process is doing. This is an invaluable diagnostic tool.

    3. Interaction with Transactional Outbox Pattern

    Many modern systems use the Transactional Outbox pattern to ensure reliable message publishing. If your scheduled job needs to perform database writes and publish events (e.g., to Kafka), the advisory lock pattern integrates beautifully.

    The job's logic would:

    • Perform its business logic (e.g., update user records).
  • Insert event messages into an outbox table.
  • All of this happens inside the same transaction that holds the advisory lock. The COMMIT at the end atomically saves the data changes, the outbox events, and releases the lock. A separate outbox processor service (which might itself use advisory locks to coordinate its workers) can then safely read from the outbox table and publish the messages.

    typescript
    const outboxJob = async (client: pg.Client) => { // Note: client is passed in
        // 1. Update business data
        await client.query(`UPDATE subscriptions SET status = 'expired' WHERE expires_at < NOW()`);
    
        // 2. Insert event into outbox table
        await client.query(`
            INSERT INTO outbox (event_type, payload)
            VALUES ('subscription.expired', '{"details": "..."}')
        `);
    };
    
    // The wrapper needs to be modified to pass the client to the job function
    // so all work happens in the same transaction.

    4. Performance and Scalability

  • Lock Contention: The pg_try_advisory_ functions are non-blocking. If 100 service instances try to run a job at the same time, one will get the lock instantly, and 99 will instantly get a false response. This is extremely efficient. The database is not burdened by 99 waiting connections.
  • Connection Pool Sizing: Each instance attempting to run a job will briefly check out a connection from the pool. Ensure your connection pool is large enough to handle this burst of requests from all service instances at the top of the minute (or whenever your jobs are scheduled).
  • Database Overhead: The overhead of the lock itself is negligible. The main performance consideration is the transaction overhead (BEGIN/COMMIT). For jobs that run infrequently (every minute or more), this is completely irrelevant. For jobs that need to run every second, you might measure the impact, but it's still likely to be minimal compared to the job's actual work.
  • When Not to Use This Pattern

    Despite its power and elegance, this pattern is not a silver bullet. It's a distributed locking mechanism, not a full-fledged job queue. Consider a more robust solution if you need:

  • Persistent Job Queuing: If a job needs to be queued and guaranteed to run later even if all workers are down at the scheduled time, you need a persistent queue like RabbitMQ or a database-backed queue (e.g., pg-boss).
  • Automatic Retries with Backoff: Implementing sophisticated retry logic within the simple lock wrapper can get complex. Libraries like Celery or BullMQ have this built-in.
  • Delayed/Scheduled Jobs: This pattern works for cron-style execution, not for scheduling a specific task to run once at a future time (e.g., "send a reminder email in 3 days").
  • Complex Workflows and Dependencies: If jobs form a directed acyclic graph (DAG) of dependencies, you need a workflow orchestration engine like Airflow or Temporal.
  • Conclusion

    For the common problem of ensuring a scheduled task runs exactly once across a fleet of horizontally-scaled application instances, PostgreSQL's transaction-level advisory locks provide a uniquely powerful, lightweight, and fault-tolerant solution. By binding the lock's lifecycle to a database transaction, we gain automatic cleanup in the event of worker failure, eliminating the risk of stale locks that plagues more naive implementations.

    This pattern leverages the transactional guarantees of a tool already present in your stack, avoiding the introduction of new infrastructure and complexity. It is a testament to the depth of features within PostgreSQL and a prime example of a senior-level engineering pattern: simple, robust, and deeply aware of underlying system behaviors and failure modes.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles