PySpark World · Made for Humans

Learn PySpark, the human way

One concept at a time — story, visual, code, quiz, practice. No dry docs. No setup required.

Chapter 01

Lazy Evaluation & the DAG

Why is Spark so fast? Because it refuses to work until you absolutely make it. Understanding this unlocks every other Spark optimization.

1
Story
2
Visual
3
Code
4
Quiz
5
Practice
1
Story
The Lazy Chef — Why Spark Waits

Imagine you walk into a restaurant and give the chef a complex order — "steam the rice, julienne the carrots, simmer the broth, combine them, reduce the sauce, then plate."

A normal chef starts Step 1 immediately. But a genius chef pauses, reads the whole order first, realizes — "I can steam the rice while the broth simmers. I only need carrots, so I won't even prep the potatoes." — then starts. The dish arrives faster with less wasted effort.

The Spark Analogy

Your code = the order slip. Every time you write .filter(), .groupBy(), or .select() you're adding a line to the order. Spark reads it and says "noted" — but doesn't cook. Nothing happens. No data is touched. No CPU is used.

Calling .show() or .collect() = "Serve it!" This is an Action. Only now does Spark spring into motion. Before touching a single row, it reviews your entire order, builds an optimized plan called the DAG, and executes.

The core insight: Spark separates defining work (Transformations — lazy) from doing work (Actions — eager). This separation lets the Catalyst Optimizer rewrite your full pipeline into something far more efficient before touching data.

Every Spark operation is one of two types:

  • TRANSFORMATIONReturns a new DataFrame. Lazy — nothing executes. e.g. .filter() .select() .groupBy() .join() .withColumn()
  • ACTIONTriggers execution of the full DAG. Eager. e.g. .show() .count() .collect() .write()
2
Visual
The DAG — Spark's Optimized Execution Blueprint

When an Action fires, Spark builds a DAG (Directed Acyclic Graph) — a map of every transformation and its dependencies. The Catalyst Optimizer then rewrites it into the most efficient physical plan before execution.

you write this (nothing executes yet) employees.parquet 10 million rows · 8 columns · on disk TRANSFORMATION .filter(dept == "Engineering") TRANSFORMATION .groupBy("dept") TRANSFORMATION .count() ★ ACTION — triggers the whole DAG! .show() Catalyst rewrites your plan before any data is read ⚙ Catalyst Optimizer ↑ Filter Pushdown Moves filter to scan time. 10M rows → ~3K at the disk. 97% less data processed. ✂ Column Pruning Only reads "dept" from disk. 8 columns → 1 column read. ⚡ Stage Fusion Filter + partial agg merged into one pipeline pass.
DAG = Directed Acyclic Graph. "Directed" = each step flows forward only. "Acyclic" = no loops. Spark builds this graph silently as you chain transformations — then the Catalyst Optimizer rewrites it into something far more efficient before any execution begins.
3
Code
See It Live — Annotated Real Code

The transformations chain silently with no output, no logs, no data movement. Only .show() fires the engine. Then .explain() reveals the optimized plan Spark built.

Lazy Evaluation Demo
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("lazy-demo").getOrCreate()

# ── Step 1: Load data ─────────────────────────────────────────────────
# Nothing happens here. Spark just notes "read from this file".
df = spark.read.parquet("employees.parquet")
print(type(df))  # <class 'pyspark.sql.dataframe.DataFrame'>  ← just a plan!

# ── Step 2: Chain Transformations ────────────────────────────────────
# Still nothing. Spark is collecting your recipe, not cooking it.
filtered = df.filter(col("dept") == "Engineering")  # lazy
grouped  = filtered.groupBy("dept")               # lazy
result   = grouped.count()                           # lazy (count here = transform)
# Zero bytes read from disk. Zero rows processed. Zero CPU used.

# ── Step 3: ACTION ───────────────────────────────────────────────────
# ★ Only NOW does Spark execute everything at once ★
result.show()     # triggers the full DAG
result.explain()  # reveals the physical plan Spark built
Output
+-----------+-----+
|       dept|count|
+-----------+-----+
|Engineering| 3421|
+-----------+-----+

== Physical Plan ==
*(2) HashAggregate(keys=[dept#5], functions=[count(1)])
+- Exchange hashpartitioning(dept#5, 200)
   +- *(1) HashAggregate(keys=[dept#5], functions=[partial_count(1)])
      +- *(1) Filter (isnotnull(dept#5) AND (dept#5 = Engineering))
         +- FileScan parquet [dept#5]
                PushedFilters: [IsNotNull(dept), EqualTo(dept,Engineering)]
                ReadSchema: struct<dept:string>   ← only 1 of 8 columns!
Look at the last two lines: PushedFilters moved the filter to scan time, and ReadSchema: struct<dept:string> means Spark read only 1 column instead of 8. You didn't write that optimization — Spark inferred it from your lazy pipeline.

⚠ Common Mistake — The Loop Trap

# ✗ BAD: 4 Actions = 4 full scans of the entire dataset
for dept in ["Engineering", "HR", "Finance", "Sales"]:
    df.filter(col("dept") == dept).count()   # 4 × .count() = 4 jobs!

# ✓ GOOD: 1 Action = 1 scan, all departments answered at once
df.groupBy("dept").count().show()
4
Quiz · 5 Questions
Test Your Understanding

Answer all 5 to reveal your score. Think before you click — no retries per question.

Question 1 / 5
When does Spark actually execute the transformations you've written?
Transformations are lazy — they just extend the plan. Spark executes only when an Action is called. This delay is what lets Catalyst optimize your full pipeline before touching a single row of data.
Question 2 / 5
What is the DAG in Apache Spark?
DAG = Directed Acyclic Graph. It maps every transformation and its dependencies, built silently during the lazy phase. The Catalyst Optimizer rewrites it into the most efficient physical plan before any execution begins.
Question 3 / 5
Which of these is an Action (not a Transformation)?
.collect() is an Action — it triggers execution and returns all rows to the driver as a Python list. Warning: never call .collect() on a huge DataFrame; it pulls everything into driver memory and will crash it.
Question 4 / 5
You run df.select("name","salary").filter(col("salary") > 50000).show(). The Parquet file has 50 columns. How many columns does Spark read from disk?
Column pruning is one of Catalyst's most impactful tricks. Because Parquet is a columnar format, Spark skips the 48 unneeded columns entirely at the I/O level — never reading them from disk at all.
Question 5 / 5
A junior engineer writes this code. What's the problem?
for dept in ["Eng","HR","Finance","Sales"]: df.filter(col("dept") == dept).count()
Every .count() is an Action = a new Spark job = a full scan of all partitions. The fix: df.groupBy("dept").count().show() — one pass through the data, one optimized job, all four answers at once.
0/5
Calculating... Answer all 5 questions to see your result.
5
Practice
Now Write It Yourself

Theory absorbed, quiz done — time to write real code. Q06 in the Question Bank is dedicated to lazy evaluation. Open it, run .explain(), and watch filter pushdown appear yourself.

Go to Question Bank → find Q06 Opens PySpark Lab · look for Q06 "What is lazy evaluation?" · run explain() live
Go Deeper

After running the basic example, try these in the playground:

1. Call .count() twice on the same DataFrame — does Spark run 2 jobs?
2. Add .cache() before the second call — what changes in the plan?
3. Use .explain("extended") to see both the logical and physical plans side-by-side.

Chapter Complete
You understand Lazy Evaluation!
You know why Spark is fast, what the DAG is, and the most common performance trap to avoid. You're thinking like a data engineer now.
Up Next
02
Partitions & Shuffles
Why .groupBy() on millions of rows can be painfully slow — and how controlling partitions fixes it.
Coming Soon
More Chapters Releasing one by one
Chapter 02
Partitions & Shuffles
🔒 Coming Soon
Chapter 03
DataFrames & Schemas
🔒 Coming Soon
Chapter 04
Joins & Broadcast
🔒 Coming Soon
Chapter 05
Window Functions
🔒 Coming Soon
Chapter 06
Performance Tuning
🔒 Coming Soon