Skip to main content

One post tagged with "JDBC"

View All Tags

Spark JDBC Data Source: The Complete Optimization Guide for Reads, Writes, and Pushdown

· 43 min read
Cazpian Engineering
Platform Engineering Team

Spark JDBC Data Source: The Complete Optimization Guide

You have a 500 million row table in PostgreSQL. You write spark.read.jdbc(url, "orders", properties) and hit run. Thirty minutes later, the job is still running. One executor is at 100% CPU. The other 49 are idle. Your database server is pegged at a single core, streaming rows through a single JDBC connection while your 50-node Spark cluster sits there doing nothing.

This is the default behavior of Spark JDBC reads. No partitioning. No parallelism. One thread, one connection, one query: SELECT * FROM orders. Every row flows through a single pipe. It is the number one performance mistake data engineers make with Spark JDBC, and it is the default.

This post covers everything you need to know to fix it and to optimize every aspect of Spark JDBC reads and writes. We start with why the default is so slow, then go deep on parallel reads, all pushdown optimizations, fetchSize and batchSize tuning, database-specific configurations, write optimizations, advanced patterns, monitoring and debugging, anti-patterns, and a complete configuration reference.