Spark Runtime Metrics Collection with DriverPlugin, ExecutorPlugin, and SparkListener
You tuned your Spark cluster. You picked the right join strategies. You enabled AQE. But you are still flying blind. When a job takes twice as long as yesterday, you open the Spark UI, scroll through 200 stages, and guess. When an Iceberg scan suddenly plans for 12 seconds instead of 2, you have no history to compare against. You cannot trend what you do not collect.
Apache Spark ships a powerful but underused plugin system — DriverPlugin, ExecutorPlugin, and SparkListener — that lets you tap into every metric the engine produces at runtime. Combined with Iceberg's MetricsReporter, you get a unified view of compute and storage performance for every query, every task, and every table scan. This post shows you how to build that pipeline from scratch, store the metrics at scale, and turn raw numbers into actionable performance insights.