Why Spark SQL Still Matters in 2026
Spark SQL remains one of the most important tools in a data engineer's toolkit. With lakehouse architectures (Delta Lake, Apache Iceberg), Spark SQL has evolved significantly.
DataFrames API vs SQL API
DataFrame API
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg
spark = SparkSession.builder.appName("refresher").getOrCreate()
df = spark.read.parquet("s3://data-lake/orders/")
result = df.filter(col("status") == "completed")
.groupBy("customer_id")
.agg(sum("amount").alias("total_spent"), avg("amount").alias("avg_order"))
.orderBy(col("total_spent").desc())
result.show()
SQL API
df.createOrReplaceTempView("orders")
result = spark.sql("""
SELECT customer_id, SUM(amount) AS total_spent, AVG(amount) AS avg_order
FROM orders WHERE status = 'completed'
GROUP BY customer_id ORDER BY total_spent DESC
""")
Both compile to the same plan via the Catalyst optimizer. If you prefer writing plain SQL, AI2SQL can help generate syntactically correct SQL to paste into spark.sql().
Key Differences from Standard SQL
No indexes: Use partitioning, bucketing, Z-ordering instead
LATERAL VIEW explode() instead of UNNEST
Temp views are session-scoped
Schema evolution via mergeSchema option
Window Functions
SELECT rep_id, sale_date, revenue,
SUM(revenue) OVER (PARTITION BY rep_id ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
RANK() OVER (PARTITION BY DATE_TRUNC('month', sale_date)
ORDER BY revenue DESC) AS monthly_rank
FROM sales;
UDFs: Use With Caution
Prefer Pandas UDFs (vectorized) over standard Python UDFs. They can be 10-100x faster.
@pandas_udf(StringType())
def normalize_email_vectorized(emails: pd.Series) -> pd.Series:
return emails.str.strip().str.lower()
Partitioning and Bucketing
df.write.mode("overwrite").partitionBy("year", "month").parquet("s3://data-lake/events/")
Caching
Cache only what you reuse. Caching everything wastes executor memory.
CACHE TABLE orders;
-- run queries
UNCACHE TABLE orders;
Quick Reference
Catalyst Optimizer: SQL and DataFrame APIs use the same optimizer
AQE: Enabled by default since Spark 3.2
shuffle.partitions: Default 200, set to 2-3x your cores
AI2SQL can generate baseline SQL to port to Spark.