/

/

SQL Performance on Large Tables: How to Optimize 50M+ Row Queries

Content

SQL Performance on Large Tables: How to Optimize 50M+ Row Queries

SQL Performance on Large Tables: How to Optimize 50M+ Row Queries

SQL Performance on Large Tables: How to Optimize 50M+ Row Queries

Introduction: When SQL Slows to a Crawl

You write what looks like a perfectly reasonable SQL query. It runs fine in development against a few thousand rows. Then it hits production — and suddenly you're waiting 45 seconds for a result that should take milliseconds. Sound familiar?

SQL performance on large tables is one of the most common pain points for developers and data analysts. Once a table crosses the 10M row mark — let alone 50M or 100M — the rules change. A query that worked yesterday becomes a bottleneck today as data grows.

The good news: most slow SQL queries can be fixed with a handful of well-understood techniques. This guide covers the core strategies for sql query optimization on large tables, complete with real before/after examples you can apply immediately.

Why Large Tables Slow Down SQL Queries

Before fixing the problem, it helps to understand why it happens. When a table has 50M+ rows, several things work against you:

  • Full table scans: Without proper indexes, the database engine reads every single row to find your results.

  • Join explosions: Joining two large tables without indexed join columns can produce intermediate result sets with billions of rows.

  • Lock contention: High-write tables with long-running read queries create blocking chains.

  • Memory pressure: Queries that can't fit their working set in memory spill to disk.

  • Statistics staleness: Query planners use table statistics to choose execution plans. Outdated stats lead to bad plan choices.

5 Key SQL Optimization Strategies for Large Tables

1. Indexing: The Single Biggest Win

Proper indexing is the foundation of sql indexing best practices. An index lets the database jump directly to the rows you need instead of scanning everything.

Before (no index, full table scan):

-- orders table: 55M rows, no index on customer_id
SELECT order_id, total_amount, created_at
FROM orders
WHERE customer_id = 10482
ORDER BY created_at DESC;
-- Execution time: 38.4 seconds

After (with index):

CREATE INDEX idx_orders_customer_created
  ON orders (customer_id, created_at DESC);

SELECT order_id, total_amount, created_at
FROM orders
WHERE customer_id = 10482
ORDER BY created_at DESC;
-- Execution time: 12ms

Indexing rules for large tables:

  • Index columns in WHERE, JOIN ON, and ORDER BY clauses.

  • Use composite indexes when filtering on multiple columns — column order matters.

  • Avoid over-indexing: each index slows down writes.

  • Consider partial indexes for tables where you only query a subset of rows.

CREATE INDEX idx_orders_active
  ON orders (created_at)
  WHERE status = 'active';

2. Table Partitioning

Partitioning splits a large table into smaller physical chunks based on a column value — typically a date.

CREATE TABLE orders (
  order_id BIGINT, customer_id INT,
  total_amount NUMERIC, created_at TIMESTAMPTZ
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2024_q1 PARTITION OF orders
  FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

CREATE TABLE orders_2024_q2 PARTITION OF orders
  FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');

SELECT * FROM orders
WHERE created_at >= '2024-04-01' AND created_at < '2024-07-01';

Queries that filter on the partition key only scan the relevant partition, not the whole table.

3. Query Rewriting

Sometimes the query itself is the problem. Small rewrites can have dramatic effects.

Avoid SELECT * on large tables:

-- Bad
SELECT * FROM events WHERE user_id = 991 LIMIT 100;

-- Good
SELECT event_type, created_at FROM events
WHERE user_id = 991 ORDER BY created_at DESC LIMIT 100;

Replace correlated subqueries with JOINs:

-- Bad: runs once per row
SELECT u.user_id, u.email,
  (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.user_id) AS order_count
FROM users u;

-- Good: single pass
SELECT u.user_id, u.email, COUNT(o.order_id) AS order_count
FROM users u LEFT JOIN orders o ON o.user_id = u.user_id
GROUP BY u.user_id, u.email;

Push filters early with CTEs:

WITH recent_orders AS (
  SELECT order_id, customer_id, total_amount
  FROM orders WHERE created_at >= NOW() - INTERVAL '30 days'
)
SELECT c.name, SUM(r.total_amount) AS revenue
FROM customers c JOIN recent_orders r ON r.customer_id = c.customer_id
GROUP BY c.name ORDER BY revenue DESC;

4. EXPLAIN ANALYZE: See What's Actually Happening

You cannot optimize sql query speed reliably without looking at the execution plan.

EXPLAIN ANALYZE
SELECT order_id, total_amount FROM orders
WHERE customer_id = 10482 AND status = 'completed'
ORDER BY created_at DESC LIMIT 50;

Key things to look for:

  • Seq Scan on a large table — need an index.

  • Rows estimate vs actual — large mismatches mean stale statistics. Run ANALYZE table_name;

  • Nested Loop on large tables — usually slower than Hash Join.

  • High cost nodes — focus optimization effort here.

5. Caching and Materialized Views

For expensive queries that don't need real-time results:

CREATE MATERIALIZED VIEW monthly_revenue AS
SELECT DATE_TRUNC('month', created_at) AS month,
  SUM(total_amount) AS revenue, COUNT(*) AS order_count
FROM orders WHERE status = 'completed'
GROUP BY 1 ORDER BY 1;

REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_revenue;

SELECT * FROM monthly_revenue
WHERE month >= NOW() - INTERVAL '12 months';

Common Mistakes to Avoid

  • Functions on indexed columns: WHERE DATE(created_at) = '2024-01-01' prevents index use. Use range instead.

  • Leading wildcards: WHERE name LIKE '%smith' forces a full scan.

  • Implicit type conversions: Comparing INT to string bypasses the index.

  • Missing LIMIT: SELECT * without LIMIT on a 50M row table can lock up your database.

  • Running VACUUM/ANALYZE too infrequently on PostgreSQL.

How AI Can Help Write Optimized Queries

Writing optimized SQL from scratch requires experience that takes years to build. AI2SQL generates production-ready, optimized SQL queries from plain English descriptions. Instead of writing a correlated subquery and then rewriting it as a JOIN, you describe what you want and AI2SQL produces optimized SQL that follows best practices automatically.

It's particularly useful when you're working with an unfamiliar schema, need to fix a slow sql query quickly, or want a starting point to then tune with EXPLAIN ANALYZE.

Conclusion

SQL performance on large tables is a solvable problem. Strategic indexing, partitioning by date, rewriting inefficient patterns, reading execution plans, and caching expensive aggregations cover the vast majority of slow query issues.

Start with EXPLAIN ANALYZE to diagnose, add the right index as your first fix, and layer in partitioning and caching as data grows. Most 30-second queries can become sub-second with the right approach.

Want to skip the trial-and-error? Give AI2SQL a try — describe your query in plain English and get optimized, production-ready SQL in seconds.

Share this

More Articles

More Articles

More Articles