PostgreSQL RLS: Performance Tuning Policies in Multi-Tenant Systems
The Double-Edged Sword of RLS in Multi-Tenancy
Row-Level Security (RLS) is one of PostgreSQL's most compelling features for building multi-tenant SaaS applications. It offers a robust, centralized, and non-bypassable mechanism for data isolation directly within the database. Instead of littering application code with WHERE tenant_id = ? clauses and praying a developer never forgets one, you can enforce this logic at the lowest possible layer.
However, this power comes with a significant performance cost if implemented naively. A poorly constructed RLS policy can render your indexes useless, force nested loops where index scans should occur, and degrade query performance by an order of magnitude or more. For senior engineers, the question isn't if they should use RLS, but how to implement it so that it scales with user and data growth.
This article is not an introduction to RLS. We assume you know how to enable RLS and write a basic policy. Instead, we will dissect the performance characteristics of common but inefficient patterns, then systematically replace them with production-hardened, high-performance alternatives. We will live inside EXPLAIN ANALYZE outputs to prove our optimizations and explore advanced concepts like LEAKPROOF functions and the strategic use of SECURITY DEFINER.
The Scenario: A Multi-Tenant Document Management System
To ground our discussion, let's model a simple multi-tenant system. We have tenants, users who belong to those tenants, and documents that are owned by a tenant.
Here is our initial schema:
-- Use pgcrypto for UUIDs
CREATE EXTENSION IF NOT EXISTS "pgcrypto";
-- Tenants table
CREATE TABLE tenants (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL UNIQUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Users table, linked to a tenant
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
email TEXT NOT NULL UNIQUE,
role TEXT NOT NULL DEFAULT 'member' CHECK (role IN ('member', 'admin')),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_users_tenant_id ON users(tenant_id);
-- Documents table, the core resource we need to protect
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
author_id UUID REFERENCES users(id) ON DELETE SET NULL,
title TEXT NOT NULL,
content TEXT,
status TEXT NOT NULL DEFAULT 'draft' CHECK (status IN ('draft', 'published', 'archived')),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_documents_tenant_id ON documents(tenant_id);
CREATE INDEX idx_documents_author_id ON documents(author_id);
CREATE INDEX idx_documents_status ON documents(status);
-- Populate with some data for two tenants
INSERT INTO tenants (id, name) VALUES
('a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11', 'Tenant Alpha'),
('a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a12', 'Tenant Bravo');
INSERT INTO users (id, tenant_id, email, role) VALUES
('b0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11', 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11', '[email protected]', 'admin'),
('b0eebc99-9c0b-4ef8-bb6d-6bb9bd380a12', 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11', '[email protected]', 'member'),
('c0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11', 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a12', '[email protected]', 'member');
-- Generate a large number of documents for realistic performance testing
INSERT INTO documents (tenant_id, author_id, title, status)
SELECT
CASE WHEN n % 2 = 0 THEN 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11'::uuid ELSE 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a12'::uuid END,
CASE WHEN n % 2 = 0 THEN 'b0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11'::uuid ELSE 'c0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11'::uuid END,
'Document ' || n,
CASE WHEN n % 10 = 0 THEN 'published' ELSE 'draft' END
FROM generate_series(1, 1000000) n;
ANALYZE documents;
ANALYZE users;
The Performance Trap: Subqueries in Policies
The most intuitive way to write an RLS policy is to check if the document's tenant_id matches the tenant_id of the current user. A common first attempt looks like this:
-- Enable RLS on the documents table
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
-- A common, but inefficient, policy
CREATE POLICY select_documents_policy ON documents
FOR SELECT
USING (
tenant_id = (SELECT tenant_id FROM users WHERE email = current_user)
);
This seems logical. For every row in documents being considered, PostgreSQL will execute the subquery to fetch the tenant_id for the user identified by current_user. Let's see how this performs.
-- Simulate a specific user running a query
SET ROLE '[email protected]'; -- Assuming you've created this user role
-- Run a typical query
EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM documents WHERE status = 'published';
On a reasonably sized table, the output will be alarming:
Finalize Aggregate (cost=139366.53..139366.54 rows=1 width=8) (actual time=2431.503..2431.504 rows=1 loops=1)
Buffers: shared hit=4430
-> Gather (cost=139366.31..139366.52 rows=2 width=8) (actual time=2431.429..2431.496 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=4430
-> Partial Aggregate (cost=138366.31..138366.32 rows=1 width=8) (actual time=2415.658..2415.659 rows=1 loops=3)
Buffers: shared hit=4430
-> Parallel Seq Scan on documents (cost=0.00..138241.31 rows=50000 width=0) (actual time=0.180..2411.396 rows=16667 loops=3)
Filter: ((status = 'published'::text) AND (tenant_id = (SubPlan 1)))
Rows Removed by Filter: 316666
Buffers: shared hit=4430
SubPlan 1
-> Index Scan using users_email_key on users (cost=0.29..0.31 rows=1 width=16) (actual time=0.003..0.003 rows=1 loops=1000000)
Index Cond: (email = CURRENT_USER)
Buffers: shared hit=2000002
Dissecting the Disaster:
Parallel Seq Scan on documents: The planner has completely ignored our idx_documents_tenant_id. It's scanning the entire table. Why?SubPlan 1: This is our subquery (SELECT tenant_id FROM users ...).loops=1000000: This is the smoking gun. The subplan was executed one million times—once for every single row in the documents table. The planner, in its default conservative mode, assumes the subquery's result could change for each row, so it re-evaluates it repeatedly.This is an O(N*M) operation where N is the number of documents and M is the cost of the user lookup. Even though the user lookup is fast (an index scan), executing it a million times is catastrophic. The application will feel sluggish for small tables and completely time out for large ones.
Pattern 1: Caching Tenant Context in Session Variables
The root cause of the problem is the repeated execution of the subquery. The tenant_id for the current user is constant for the duration of a database session (or at least a transaction). We can exploit this.
The solution is to fetch the user's context (like tenant_id and role) once at the beginning of the session and store it in a session variable. Then, our RLS policy can reference a simple, fast function that reads this variable.
Step 1: Create Helper Functions
We'll create a function to safely read a setting. We use current_setting(setting_name, missing_ok) which allows us to handle cases where the setting might not be present.
-- Create a function to get the current tenant ID from a session variable.
-- It's defined as STABLE, meaning its result is constant within a transaction.
-- This is a critical hint for the query planner.
CREATE OR REPLACE FUNCTION auth.current_tenant_id()
RETURNS UUID AS $$
BEGIN
-- The 't' flag in `to_regrole` checks if the role exists without throwing an error.
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'app_user') THEN
-- If the role does not exist, it means we are likely in a migration
-- or superuser context. In this case, we can't get a tenant_id.
-- Returning NULL might be appropriate, or you could raise an exception
-- depending on your security model for superusers.
-- For this example, we'll allow superusers to bypass by returning NULL
-- and adjusting the policy later.
RETURN NULL;
END IF;
-- The `true` flag makes current_setting return NULL instead of erroring
-- if the setting is not found.
RETURN current_setting('app.current_tenant_id', true)::UUID;
END;
$$ LANGUAGE plpgsql STABLE SECURITY INVOKER;
Key Design Choices:
* STABLE Volatility: This tells the planner that the function's return value will not change within the current transaction. This is the magic that allows the planner to evaluate it once and reuse the result, avoiding the per-row execution.
* SECURITY INVOKER: The function runs with the permissions of the user calling it. This is the safest default. We'll discuss SECURITY DEFINER later.
* Error Handling: The current_setting('...', true) pattern prevents errors if the variable isn't set, which is crucial for tools, migrations, and superuser access.
Step 2: Set the Session Variable in Your Application
Your application's authentication middleware is the perfect place to set this variable. After a user authenticates, for every connection they use from the pool, you must run a SET command before any other query.
Example (Node.js with pg library):
// In your middleware, after authenticating the user and getting their tenantId
async function getClientForUser(user) {
const client = await pool.connect();
// Use SET LOCAL to scope the setting to the current transaction.
// This is safer in connection-pooled environments.
await client.query(`SET LOCAL app.current_tenant_id = '${user.tenantId}';`);
// You might also set other context here
// await client.query(`SET LOCAL app.current_user_id = '${user.id}';`);
// await client.query(`SET LOCAL app.current_user_role = '${user.role}';`);
return client;
}
SET LOCAL vs SET: SET LOCAL is crucial in environments with connection pooling and transaction-based workflows. It ensures the setting is automatically cleared at the end of the transaction (COMMIT or ROLLBACK), preventing a user's session variable from leaking into another user's session if the connection is reused.
Step 3: Rewrite the RLS Policy
Now we can create a blazingly fast policy.
-- First, drop the old, slow policy
DROP POLICY IF EXISTS select_documents_policy ON documents;
-- Create the new, performant policy
CREATE POLICY select_documents_policy ON documents
FOR SELECT
USING ( tenant_id = auth.current_tenant_id() );
-- Also create policies for other operations
CREATE POLICY insert_documents_policy ON documents
FOR INSERT
WITH CHECK ( tenant_id = auth.current_tenant_id() );
CREATE POLICY update_documents_policy ON documents
FOR UPDATE
USING ( tenant_id = auth.current_tenant_id() )
WITH CHECK ( tenant_id = auth.current_tenant_id() );
CREATE POLICY delete_documents_policy ON documents
FOR DELETE
USING ( tenant_id = auth.current_tenant_id() );
Step 4: Analyze the Performance Gain
Let's re-run our test. We need to simulate the application's behavior of setting the variable.
-- Simulate the application setting the session context
BEGIN;
SET LOCAL app.current_tenant_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11';
-- Now run the same query
EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM documents WHERE status = 'published';
COMMIT;
The new query plan is a night-and-day difference:
Finalize Aggregate (cost=10635.83..10635.84 rows=1 width=8) (actual time=27.534..27.535 rows=1 loops=1)
Buffers: shared hit=2209
-> Gather (cost=10635.62..10635.83 rows=2 width=8) (actual time=27.493..27.528 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=2209
-> Partial Aggregate (cost=9635.62..9635.63 rows=1 width=8) (actual time=21.996..21.996 rows=1 loops=3)
Buffers: shared hit=2209
-> Parallel Bitmap Heap Scan on documents (cost=1060.46..9594.04 rows=16631 width=0) (actual time=3.589..19.682 rows=16667 loops=3)
Recheck Cond: ((tenant_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11'::uuid) AND (status = 'published'::text))
Rows Removed by Index Recheck: 147570
Heap Blocks: exact=2184
Buffers: shared hit=2209
-> Bitmap Index Scan on idx_documents_tenant_id_status (cost=0.00..1050.46 rows=49894 width=0) (actual time=7.217..7.217 rows=50000 loops=1)
Index Cond: ((tenant_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11'::uuid) AND (status = 'published'::text))
Buffers: shared hit=25
Note: I created a composite index CREATE INDEX idx_documents_tenant_id_status ON documents(tenant_id, status); for this plan to be optimal.
Dissecting the Victory:
* Execution Time: Dropped from ~2400ms to ~27ms. That's nearly a 100x improvement.
* Bitmap Index Scan: The planner is now using our indexes! Because the auth.current_tenant_id() function is STABLE, the planner evaluates it once, gets the UUID constant, and then uses that constant to build a highly efficient query plan that can perform an index scan.
* No Subplan: The dreaded per-row subplan is completely gone.
This single pattern is the most important optimization for RLS in any multi-tenant system.
Pattern 2: Partial Indexes for Complex, High-Cardinality Policies
Our previous query was WHERE status = 'published'. What happens if most documents are in the 'draft' state? Let's say 95% are drafts, 4% are published, and 1% are archived. A standard composite index on (tenant_id, status) is good, but it still has to store entries for every single row.
When you have a policy that frequently filters on a low-cardinality column (a column with few distinct values, like a status enum), a partial index can be significantly more efficient. It's a smaller, faster index that only contains entries for rows matching its WHERE clause.
Let's imagine a common feature: a dashboard that only shows 'published' documents.
-- A policy for a specific role that can only see published documents
CREATE POLICY select_published_documents_for_members ON documents
FOR SELECT USING (
tenant_id = auth.current_tenant_id() AND
status = 'published'
);
To optimize for this, we create an index specifically for this access pattern:
-- Drop the general composite index if it exists
DROP INDEX IF EXISTS idx_documents_tenant_id_status;
-- Create a partial index only for published documents
CREATE INDEX idx_documents_published_by_tenant
ON documents (tenant_id)
WHERE status = 'published';
Why is this better?
(tenant_id). This means it fits better in memory, and scans are faster.WHERE status = 'published'. Let's analyze the plan for a query that benefits from this:
BEGIN;
SET LOCAL app.current_tenant_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11';
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, title FROM documents WHERE status = 'published' LIMIT 10;
COMMIT;
The planner will now explicitly choose our partial index, resulting in an extremely fast lookup because it's only scanning the small subset of rows that are relevant.
Limit (cost=0.29..5.53 rows=10 width=29) (actual time=0.027..0.038 rows=10 loops=1)
Buffers: shared hit=5
-> Index Scan using idx_documents_published_by_tenant on documents (cost=0.29..2724.79 rows=5208 width=29) (actual time=0.026..0.035 rows=10 loops=1)
Index Cond: (tenant_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11'::uuid)
Buffers: shared hit=5
The planner uses an Index Scan on our tiny idx_documents_published_by_tenant and finds 10 rows almost instantaneously. This is a micro-optimization that has a huge impact on frequently-run queries in a UI.
Advanced Edge Cases and Production Hardening
`SECURITY DEFINER` vs. `SECURITY INVOKER`
Our helper function was SECURITY INVOKER. What if the function needed to look up information the current user doesn't have direct access to? For example, a central roles table.
-- A central table not directly exposed to users
CREATE TABLE internal.role_permissions (
role TEXT PRIMARY KEY,
can_view_archived BOOLEAN DEFAULT false
);
-- Grant usage to a special user that the function will run as
GRANT USAGE ON SCHEMA internal TO function_runner_role;
GRANT SELECT ON internal.role_permissions TO function_runner_role;
A SECURITY DEFINER function runs with the privileges of the user who defined the function, not the one calling it.
CREATE OR REPLACE FUNCTION auth.can_view_archived(role_name TEXT)
RETURNS BOOLEAN AS $$
DECLARE
permission_exists BOOLEAN;
BEGIN
SELECT can_view_archived INTO permission_exists
FROM internal.role_permissions WHERE role = role_name;
RETURN COALESCE(permission_exists, false);
END;
$$ LANGUAGE plpgsql STABLE SECURITY DEFINER;
-- Set the owner of the function to the privileged role
ALTER FUNCTION auth.can_view_archived(TEXT) OWNER TO function_runner_role;
CRITICAL SECURITY NOTE: When using SECURITY DEFINER, you must sanitize all inputs and set a secure search_path to prevent privilege escalation attacks. A common practice is to set it at the top of the function:
SET search_path = pg_catalog;
This prevents a malicious user from creating a table named role_permissions in their own schema that the function might accidentally pick up.
`LEAKPROOF` Functions and Information Leaks
Consider this policy:
USING (tenant_id = auth.current_tenant_id() AND sensitive_check_function(column_x) = true)
If sensitive_check_function throws an error for certain values of column_x, a user could potentially infer data from rows they shouldn't see. For example, if the function divides by a value and throws a division-by-zero error, the user knows that column_x was zero for a row, even if the tenant_id check would have ultimately filtered it out. The query planner is free to reorder these checks.
To prevent this, you can mark the function as LEAKPROOF.
CREATE FUNCTION ... LEAKPROOF;
This is a promise to the planner that the function has no side effects and its output depends only on its arguments. In return, the planner guarantees it will not apply the function to rows that the user would otherwise not be able to see. You should only use LEAKPROOF on functions you have carefully audited for side-channel vulnerabilities.
RLS and Views
RLS policies on underlying tables are applied when a view is queried. This is usually what you want.
CREATE VIEW published_documents AS
SELECT id, tenant_id, title, created_at FROM documents WHERE status = 'published';
When a user runs SELECT * FROM published_documents, the RLS policy on documents is still applied, correctly filtering by tenant_id.
However, be careful with views defined with security_barrier = true. This can be used to ensure the view's WHERE clause is applied before the RLS policy's functions, which can be useful but also complex to reason about. For most multi-tenant scenarios, default view behavior is sufficient and safer.
A Production-Ready RLS Implementation Checklist
SET LOCAL to store user/tenant context (app.current_tenant_id, etc.) in session variables.STABLE Helper Functions: Create STABLE SQL functions (e.g., auth.current_tenant_id()) to read these session variables. Avoid any complex logic, joins, or subqueries inside these functions.USING (tenant_id = auth.current_tenant_id())).(tenant_id)). For more complex, recurring filter conditions in your policies, create matching composite or partial indexes (e.g., ON documents (tenant_id) WHERE status = 'published').SECURITY DEFINER: Use SECURITY INVOKER by default. Only use SECURITY DEFINER when absolutely necessary, and always set a secure search_path and sanitize inputs.EXPLAIN EVERYTHING: Never assume a policy is performant. Use EXPLAIN (ANALYZE, BUFFERS) on all common query patterns to verify that the planner is using indexes and avoiding sequential scans. The query plan is your ground truth.current_setting(..., true) pattern which returns NULL is a good start. You might need to temporarily disable RLS for a session (SET session_replication_role = 'replica') or use a user with the BYPASSRLS attribute for major data operations.By moving beyond naive subquery-based policies and adopting this session-context pattern, you can build a multi-tenant PostgreSQL architecture that is not only secure and maintainable but also scales efficiently as your application grows. RLS ceases to be a performance liability and becomes the robust foundation of your data isolation strategy.