/

/

Spark SQL Refresher: What Data Engineers Need to Know in 2026

Content

Spark SQL Refresher: What Data Engineers Need to Know in 2026

Spark SQL Refresher: What Data Engineers Need to Know in 2026

Spark SQL Refresher: What Data Engineers Need to Know in 2026

Why Spark SQL Still Matters in 2026

Spark SQL remains one of the most important tools in a data engineer's toolkit. With lakehouse architectures (Delta Lake, Apache Iceberg), Spark SQL has evolved significantly.

DataFrames API vs SQL API

DataFrame API

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg
spark = SparkSession.builder.appName("refresher").getOrCreate()
df = spark.read.parquet("s3://data-lake/orders/")
result = df.filter(col("status") == "completed")
  .groupBy("customer_id")
  .agg(sum("amount").alias("total_spent"), avg("amount").alias("avg_order"))
  .orderBy(col("total_spent").desc())
result.show()

SQL API

df.createOrReplaceTempView("orders")
result = spark.sql("""
  SELECT customer_id, SUM(amount) AS total_spent, AVG(amount) AS avg_order
  FROM orders WHERE status = 'completed'
  GROUP BY customer_id ORDER BY total_spent DESC
""")

Both compile to the same plan via the Catalyst optimizer. If you prefer writing plain SQL, AI2SQL can help generate syntactically correct SQL to paste into spark.sql().

Key Differences from Standard SQL

  • No indexes: Use partitioning, bucketing, Z-ordering instead

  • LATERAL VIEW explode() instead of UNNEST

  • Temp views are session-scoped

  • Schema evolution via mergeSchema option

Window Functions

SELECT rep_id, sale_date, revenue,
  SUM(revenue) OVER (PARTITION BY rep_id ORDER BY sale_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
  RANK() OVER (PARTITION BY DATE_TRUNC('month', sale_date)
    ORDER BY revenue DESC) AS monthly_rank
FROM sales;

UDFs: Use With Caution

Prefer Pandas UDFs (vectorized) over standard Python UDFs. They can be 10-100x faster.

@pandas_udf(StringType())
def normalize_email_vectorized(emails: pd.Series) -> pd.Series:
    return emails.str.strip().str.lower()

Partitioning and Bucketing

df.write.mode("overwrite").partitionBy("year", "month").parquet("s3://data-lake/events/")

Caching

Cache only what you reuse. Caching everything wastes executor memory.

CACHE TABLE orders;
-- run queries
UNCACHE TABLE orders;

Quick Reference

  • Catalyst Optimizer: SQL and DataFrame APIs use the same optimizer

  • AQE: Enabled by default since Spark 3.2

  • shuffle.partitions: Default 200, set to 2-3x your cores

AI2SQL can generate baseline SQL to port to Spark.

Share this

More Articles

More Articles

More Articles