PySpark Playground

region	total ($)	avg_sale ($)
North America	4,821,540	312.40
Europe	3,109,220	287.15
Asia Pacific	2,650,800	241.30
Latin America	980,440	198.75
Middle East	430,120	176.50

Deep Dive

Core Spark Concepts

Click any topic to expand the full explanation — these are the concepts interviewers probe deepest.

0 / 9 explored

Most AskedCatalyst optimizer, shuffle partitions & join strategies — 80% of senior interviews.

TipClick any concept to expand. Expand all before your interview day.

Depth MattersInterviewers love "why" answers. Know the internal mechanics, not just the API.

Partitioning & Shuffling

Analogy

Think of your DataFrame as a giant book. Spark photocopies it into chapters (partitions) and hands each chapter to a different reader (executor). All readers work simultaneously — the job finishes in the time it takes to read one chapter, not the whole book.

What is a Partition?

A partition is a logical chunk of your DataFrame stored on one executor node. Spark processes each partition as one Task. More partitions = more parallelism (up to a point).

DataFrame split across 4 executors in parallel

DataFrame — 4 million rows

splits into 4 partitions

Partition 0

rows

Partition 1

rows

Partition 2

rows

Partition 3

rows

each assigned to one executor

Executor 1

Task 1

Executor 2

Task 2

Executor 3

Task 3

Executor 4

Task 4

All 4 tasks run simultaneously

Default Partitions

spark.sql.shuffle.partitions defaults to 200 after every shuffle — a number tuned for large clusters and often wrong for yours.

Rule of thumb: aim for 128–256 MB per partition and 2–4× your total executor cores. A 10 GB dataset on 20 cores needs ~50 partitions, not 200. Over-partitioning causes scheduler overhead; under-partitioning wastes cores.

repartition() vs coalesce()

repartition(n)

Full shuffle — data crosses the network
Can increase or decrease partition count
Produces equal-sized partitions
Use to rebalance skewed data or increase partitions

coalesce(n)

Minimal shuffle — merges local partitions only
Can only decrease partition count
May produce uneven partitions
Use before writing to reduce output file count

What Causes a Shuffle?

A shuffle moves data across the network between executors — the most expensive operation in Spark. It involves serialization, disk I/O, and network transfer.

Wide transformations trigger shuffles: groupBy, join, distinct, orderBy, repartition
Narrow transformations are free: filter, select, withColumn — each partition works independently, no network traffic

Reduce shuffles: filter and select early (before joins), broadcast small tables (broadcast(df)), and pre-partition datasets on join keys to avoid repeated reshuffling.

Caching & Persistence

Analogy

Without caching, asking Spark the same question twice is like a chef cooking the entire meal from scratch both times you ask. With cache, the first batch is kept warm — the second serving is instant.

Why Cache?

Spark's lazy evaluation re-runs the full transformation chain for every Action. Calling .count() and then .show() on the same DataFrame reads the source file and reruns every transform — twice. Cache breaks this cycle.

Without cache — 2× work

S3 read→filter→join→.count()

S3 read→filter→join→.show()

Source read + transforms run twice

With cache — 1× work

S3 read→filter→join→[cached]

[cached]→.count()

[cached]→.show()

Source read once, reused for both actions

cache() vs persist()

df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK). Use persist() when you need a specific storage level.

Storage Levels

MEMORY_ONLY

Fastest · drops if OOM

MEMORY_AND_DISK

Safe default

MEMORY_ONLY_SER

Smaller footprint

DISK_ONLY

Slow · saves RAM

OFF_HEAP

No GC pressure

Checkpointing vs Caching

Cache

Stored on executor memory / disk
Lineage is preserved
Lost if executor dies
Fast — no remote write

Checkpoint

Saved to HDFS / S3
Lineage is truncated — Spark forgets how it got there
Survives executor failure
Use in long iterative jobs and Streaming

Joins & Broadcast Strategy

Analogy

Broadcast join = photocopying a small menu and handing a copy to each chef. Each chef has everything they need and never leaves their station. Zero network traffic for the big table.

Sort-merge join = sending all chefs to a central filing cabinet, sorted alphabetically, to find their matching order. Works for any size, but there's a lot of walking (network shuffle).

Join Types

inner — only rows that match on both sides
left / right — all rows from one side; nulls where no match on the other
full — all rows from both sides, nulls wherever no match exists
left_semi — left rows where a match exists in right (equivalent to WHERE EXISTS)
left_anti — left rows where no match in right (equivalent to WHERE NOT EXISTS)
cross — Cartesian product: every left row × every right row — expensive, use deliberately

Join Strategies

Broadcast Hash Join

Small table (<10 MB default) copied to every executor
Large table has zero shuffle
Fastest strategy; best for dimension tables
Force it: broadcast(df) hint
Threshold: spark.sql.autoBroadcastJoinThreshold

Sort-Merge Join

Both sides shuffled by join key, then sorted
Default for large ↔ large joins
Predictable and memory-efficient
Pre-bucketing on join key eliminates the shuffle entirely

Shuffle Hash Join

Shuffles both sides on the join key, then builds a hash map on the smaller side. Faster than sort-merge when one side is clearly smaller but too big to broadcast. Needs more memory — can OOM on very large datasets.

Broadcast Variable vs Broadcast Join

A broadcast variable (sc.broadcast(value)) ships any read-only Python object — a dict, lookup map — to all executors once. A broadcast join is the DataFrame-level application of this: ship a small DataFrame so Spark never needs to shuffle the large one.

Data Skew & Salting

Analogy

Skew is like 4 checkout lanes at a supermarket — but one cashier has 200 people in line and the others have 5 each. The store closes when the last customer is served, so the busiest lane sets your total wait time. The other 3 cashiers are wasted.

What is Data Skew?

Skew occurs when some partition keys carry vastly more rows than others. During groupBy or join, one executor processes 80% of the data while others sit idle. Your job's duration = the slowest task's duration.

Skewed partitions — one task takes 100× longer

P0 · 7.8M rows

P1 · 80K

P2 · 60K

P3 · 70K

Executors 2–4 finish in seconds and sit idle. Job waits on P0.

Detecting Skew

df.groupBy("join_key").count().orderBy("count", ascending=False).show(10)

Also check Spark UI → Stages → Task metrics. If the 75th-percentile task duration is 2s but the max is 4 minutes, you have a hot key.

Salting — How to Fix It

Append a random integer ("salt") to the hot key to spread it across N sub-partitions, run the partial aggregation, then strip the salt and do the final rollup.

import pyspark.sql.functions as F

SALT = 10  # spread hot key across 10 sub-partitions

# 1 — salt the large (skewed) DataFrame
df_salted = df.withColumn("salt", (F.rand() * SALT).cast("int")) \
              .withColumn("salted_key", F.concat(col("join_key"), F.lit("_"), col("salt")))

# 2 — explode the small (lookup) DataFrame to match every salt value
lookup_exploded = lookup.withColumn("salt", F.explode(F.array([F.lit(i) for i in range(SALT)]))) \
                        .withColumn("salted_key", F.concat(col("join_key"), F.lit("_"), col("salt")))

# 3 — join on salted key, then aggregate as normal
result = df_salted.join(lookup_exploded, "salted_key").drop("salt", "salted_key")

What are Window Functions?

Window functions compute values across a sliding window of rows related to the current row — without reducing the row count. Unlike groupBy + agg, every input row produces exactly one output row with new computed columns.

Defining a Window

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, dense_rank, row_number, lag, lead, sum, avg

# Partition by dept, order by salary — highest first
w = Window.partitionBy("dept").orderBy(col("salary").desc())

df.withColumn("rank",         rank().over(w))       \
  .withColumn("dense_rank",   dense_rank().over(w)) \
  .withColumn("row_num",      row_number().over(w)) \
  .withColumn("prev_salary",  lag("salary", 1).over(w)) \
  .withColumn("running_sum",  sum("salary").over(
      w.rowsBetween(Window.unboundedPreceding, Window.currentRow)))

rank() vs dense_rank() vs row_number()

These three look similar but behave differently when rows tie. The table below shows the same dataset through each lens:

Name	Salary	rank()	dense_rank()	row_number()
Alice	90,000	1	1	1
Bob	90,000	1	1	2
Carol	75,000	3	2	3
Dave	60,000	4	3	4
		Gap after tie (skips 2)	No gap ever	Always unique

Interview trap: "Get the top-N per group" questions — use row_number() for strictly unique results, rank() to include all tied entries at the cutoff position.

Frame Specification

rowsBetween(start, end) counts physical rows (position-based). rangeBetween(start, end) counts by value distance. Use Window.unboundedPreceding / Window.unboundedFollowing to span the entire partition.

Delta Lake Basics

Analogy

Delta Lake is like Git for your data files. Every write adds a commit to a transaction log. You can check out any past version, roll back a bad write, and multiple readers/writers work safely at the same time without corrupting each other.

What is Delta Lake?

Delta Lake is an open-source storage layer that adds ACID transactions to data lakes. It sits on top of Parquet files and adds a _delta_log/ JSON transaction log that records every change — inserts, updates, deletes, schema changes.

Delta Lake storage structure

Data Files

part-0001.parquet
part-0002.parquet
part-0003.parquet

_delta_log/

00000.json — v0 schema
00001.json — insert
00002.json — update

Checkpoints

00010.checkpoint.parquet
state snapshot
(every 10 commits)

ACID in Delta Lake

Atomicity: writes are all-or-nothing — a failed write is rolled back, no partial data lands
Consistency: schema enforcement blocks bad data at write time
Isolation: snapshot isolation — readers always see a consistent view even during concurrent writes
Durability: committed transactions persist in the log on S3 / ADLS / HDFS

Time Travel

# Read a past version
df = spark.read.format("delta").option("versionAsOf", 5).load("/data/table")

# Or by timestamp
df = spark.read.format("delta").option("timestampAsOf", "2025-01-01").load("/data/table")

# Rollback to version 3
DeltaTable.forPath(spark, "/data/table").restoreToVersion(3)

Z-Ordering (Clustering)

Z-ordering co-locates related data into the same Parquet files. Queries with filters on a Z-ordered column skip entire files — reading far less data.

delta_table.optimize().executeZOrderBy("user_id", "event_date")

OPTIMIZE + Z-ORDER after large bulk loads. Run VACUUM to delete old Parquet files and reclaim storage (default retention: 7 days).

Structured Streaming

Analogy

Think of a Kafka topic as a conveyor belt that never stops. Structured Streaming reads the belt in short bursts (micro-batches), processes each burst exactly like a batch query, and remembers where it stopped so it can pick up where it left off if the job crashes.

The Unbounded Table Model

Structured Streaming treats a live stream as an unbounded table that grows as new events arrive. You write a normal batch query; Spark executes it incrementally in micro-batches under the hood.

Micro-batch execution model

SOURCE

Kafka / S3

→

MICRO-BATCH

trigger every N sec

→

QUERY

filter / agg / join

→

SINK

Delta / Console

Trigger Modes

Trigger.ProcessingTime("10 seconds") — micro-batch every 10 seconds; standard choice
Trigger.Once() — process all pending data then stop; batch + streaming hybrid pattern
Trigger.AvailableNow() — like Once but splits into multiple micro-batches (Spark 3.3+)
Trigger.Continuous("1 second") — millisecond latency, experimental; uses continuous processing engine

Watermarking — Handling Late Data

Streaming aggregations must bound the state they keep in memory. Watermarking tells Spark: "discard events that arrive more than X late", capping memory growth.

df.withWatermark("event_time", "10 minutes") \
  .groupBy(window(col("event_time"), "5 minutes"), col("user_id")) \
  .agg(count("*").alias("event_count"))

Checkpointing for Fault Tolerance

Structured Streaming checkpoints progress (Kafka offsets, state store) to durable storage. A crash + restart picks up exactly where the job left off — guaranteeing exactly-once semantics end-to-end when combined with an idempotent sink.

Always set a checkpoint location in production: .option("checkpointLocation", "s3://bucket/checkpoints/job1"). Without it, a restart replays from the beginning and duplicates data at the sink.

Interview Questions

PySpark Question Bank

60 questions from Amazon, Databricks, TCS, Uber and more — Fresher to Senior level.

🎯

Interview Readiness 0 / 60 reviewed

Real InterviewsQuestions sourced from actual Amazon, Databricks & Uber interview rounds.

StrategyStart with Easy → Medium. Hard questions build on these foundations.

Run CodeClick "Try the Code" — the editor opens in-place so you never lose your spot.

60 PySpark Questions Locked

Your complete interview prep is waiting inside — open the treasure to begin your journey from Fresher to Senior.

60 Questions Fresher → Senior Live Playground Free — No Signup

FresherEasyTCSInfosys

What is PySpark and why is it used?

Python API for Apache Spark — distributed big data processing that scales beyond a single machine's memory.

Answer

PySpark is the Python API for Apache Spark — a distributed computing engine that processes data across many machines simultaneously.

Scale: handles terabytes and petabytes — far beyond what pandas can load into a single machine's RAM
Speed: in-memory computation is up to 100× faster than Hadoop MapReduce for iterative workloads
Fault tolerance: if a machine fails mid-job, Spark automatically recomputes lost data using lineage (the DAG)
Unified engine: batch processing, streaming, ML (MLlib), SQL, and graph (GraphX) all in one framework

Quick Start

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyApp").master("local[*]").getOrCreate()
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.filter(df.salary > 50000).groupBy("dept").count().show()

Interview Edge

When asked "why PySpark over pandas?" say: "pandas is limited to one machine's RAM — PySpark distributes across a cluster. For datasets above ~10 GB, PySpark is the right choice. For smaller datasets, pandas is actually faster because there is no serialization overhead." This shows you know the tradeoff, not just the buzzword.

FresherEasyAmazonTCS

Explain Apache Spark's architecture

Driver → Cluster Manager → Worker Nodes with Executors → Tasks. Execution: Application → Jobs → Stages → Tasks.

Answer

Driver: Master process — runs your code, creates SparkSession, builds the DAG, coordinates execution, collects results.
Cluster Manager: Manages cluster resources (YARN, Kubernetes, Standalone, Mesos).
Executors: JVM processes on worker nodes — run Tasks, store cached data.
Hierarchy: Application → Jobs (one per Action) → Stages (split by shuffles) → Tasks (one per partition)

FresherEasyInfosysWipro

What is an RDD?

Resilient Distributed Dataset — immutable, fault-tolerant distributed collection. The foundational abstraction of Spark.

Answer

An RDD is an immutable, fault-tolerant distributed collection of objects. Fault-tolerant via lineage — if a partition is lost, Spark recomputes it from the original source. Type-unsafe, no Catalyst optimizer. Prefer DataFrames; use RDD only for custom low-level operations or unstructured data (text, binary).

RDD Example

rdd = spark.sparkContext.parallelize([1,2,3,4,5], numSlices=3)
result = rdd.map(lambda x: x*2).filter(lambda x: x%2==0)
print(result.collect())  # [2, 4, 6, 8, 10]

FresherEasyAmazonAccenture

What is a DataFrame? How does it differ from an RDD?

Distributed table with named columns + schema + Catalyst optimization. Always faster than equivalent RDD for structured data.

Answer

A DataFrame is a distributed collection with named, typed columns — like a relational table. Uses Catalyst + Tungsten optimizers → significantly faster than RDD. Has SQL compatibility. Immutable — transformations return new DataFrames. Use DataFrames by default for all structured/semi-structured data.

FresherEasyTCSGoogle

What is SparkSession? How do you create one?

Unified entry point for all Spark since 2.0. Replaces SparkContext + SQLContext. Use .getOrCreate() to avoid duplicates.

SparkSession

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("MyApp") \
    .master("yarn") \
    .config("spark.executor.memory", "4g") \
    .config("spark.executor.cores", "2") \
    .getOrCreate()  # returns existing or creates new
print(spark.version)  # 3.5.0

FresherEasyAmazonDatabricks

What is lazy evaluation? What is the DAG?

Transformations build a plan (DAG) but don't execute. An Action triggers Catalyst to optimize and execute the full plan.

Start here — the restaurant analogy

When you order food at a restaurant, the kitchen does not start cooking the second you speak. The waiter writes your order down. Only when the waiter takes the slip to the kitchen — "serve it now" — does any cooking happen.

PySpark works exactly the same way. Calling filter(), select(), or groupBy() is writing down your order. Nothing touches the actual data. The DAG is the written order slip. An Action (show(), count(), collect()) is handing that slip to the kitchen — that is when PySpark reads data, runs calculations, and returns a result.

This design lets PySpark look at your entire chain of steps before executing any of them, allowing the Catalyst optimizer to rewrite the plan in ways a human would rarely think of.

Execution flow — Transformations build the DAG, Actions fire it

Formal Definition

Lazy Evaluation — PySpark does not execute transformations when they are called. Each call (filter, select, groupBy…) adds a node to the DAG but touches zero data.

DAG (Directed Acyclic Graph) — the dependency graph of all pending transformations. "Directed" means data flows in one direction (no cycles). Spark's Catalyst optimizer reads this entire graph and rewrites it before anything runs — pushing filters earlier, dropping unused columns, merging stages.

Global optimization: Catalyst can reorder steps automatically — e.g., apply a filter before a join to reduce data volume, even if you wrote it the other way
Avoid wasted work: If you write 5 transformations then never call an action, PySpark does exactly zero work
Fault recovery: Lost partitions are rebuilt by replaying the DAG lineage from the original source — no data duplication needed
Actions that trigger execution: show(), count(), collect(), take(n), write.*(), toPandas()

Step-by-step example — 10 million employee rows

You have a CSV with 10 million employee records and 50 columns. You write these 3 lines:

df.filter(df.age > 30) — PySpark records "keep rows where age > 30". Nothing is read from disk.
.select("name","dept","salary") — records "keep only 3 columns". Still nothing runs.
.groupBy("dept").count() — records "group by dept and count". Still nothing runs.

Catalyst then inspects the entire plan. It notices: "I only need to read 3 columns from disk out of 50, and I can filter rows before grouping — this reduces data by ~70% before any shuffling." Then you call .show(). Only now does Spark read data — already with the Catalyst-optimized plan. On 10M rows this optimization can reduce execution time from minutes to seconds.

Inspect the Execution Plan

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LazyDemo").getOrCreate()

# Create sample data — imagine this is a 10M row file
data = [("Alice",35,"Eng",90000),("Bob",28,"HR",60000),
        ("Carol",42,"Eng",110000),("Dave",25,"Eng",70000)]
df = spark.createDataFrame(data, ["name","age","dept","salary"])

# These 3 lines build the DAG — NOTHING executes yet
plan = df.filter(df.age > 30) \
         .select("name","dept","salary") \
         .groupBy("dept").count()

# See what Catalyst will actually run (optimized plan)
plan.explain()         # physical plan only
plan.explain(extended=True)  # logical + optimized + physical

# ← THIS is the Action that triggers execution
plan.show()            # data is read and processed here

Interview Edge

What separates a good answer from a great one: Most candidates say "transformations are lazy." The follow-up that impresses interviewers is explaining why this matters: "Catalyst sees the full plan before executing, so it can apply optimizations a developer would never code manually — like reading only 3 of 50 columns from a Parquet file, or pushing a filter before a shuffle. That's the real performance win."

FresherEasyInfosysMeta

Transformations vs Actions — 5 examples of each

Transformations: lazy → new DataFrame. Actions: trigger execution → result to driver or storage.

Answer

Transformations (lazy): filter, select, groupBy, join, withColumn, orderBy, distinct, union, repartition, map, flatMap

Actions (trigger execution): show(), collect(), count(), take(n), first(), write.*(), toPandas(), foreach(), reduce()

Narrow: filter, select, map — no shuffle, each partition maps to one output partition.
Wide: groupBy, join, distinct, orderBy — cause shuffle across executors, much more expensive.

FresherEasyTCSWipro

How do you read and write data in PySpark?

spark.read.format().load() and df.write.format().mode().save(). Know the 4 write modes: overwrite, append, ignore, error.

Read & Write

# Read
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df = spark.read.parquet("s3://bucket/data/")
df = spark.read.format("delta").load("s3://bucket/delta/")
# Write (modes: overwrite, append, ignore, error)
df.write.mode("overwrite").parquet("output/")
df.write.mode("append").partitionBy("year","month").parquet("output/")

FresherEasyTCSAccenture

How do you handle NULL values in PySpark?

dropna(), fillna(), isNull()/isNotNull(), coalesce() — four different tools for different null scenarios.

NULL Handling

from pyspark.sql.functions import col, when, coalesce, lit, count
df.dropna()                               # drop rows with ANY null
df.dropna(subset=["name","salary"])     # drop if specific cols null
df.fillna({"salary":0, "dept":"Unknown"}) # fill per column
df.withColumn("pay", coalesce(col("salary"), col("base_pay"), lit(0)))
# Count nulls per column
df.select([count(when(col(c).isNull(),c)).alias(c) for c in df.columns]).show()

FresherEasyAmazonGoogle

What is a UDF (User-Defined Function)?

Custom Python function for DataFrame columns. Flexible but slow — bypasses Catalyst, JVM↔Python serialization per row.

Answer

A UDF applies a custom Python function to DataFrame columns. Downside: row-at-a-time processing + JVM↔Python serialization per row → 10–100× slower than built-in functions. Always prefer Spark's built-in pyspark.sql.functions. For performance, use pandas_udf (vectorized, Arrow-based).

UDF

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
band_udf = udf(lambda s: "High" if s >= 100000 else "Mid" if s >= 50000 else "Low", StringType())
df.withColumn("band", band_udf("salary")).show()

FresherEasyDatabricksInfosys

PySpark vs pandas — key differences

Distributed vs single-machine. Lazy vs eager. No row index in PySpark. Bridge: toPandas() and createDataFrame().

Answer

Execution: pandas eager; PySpark lazy (DAG)
Scale: pandas single machine RAM; PySpark petabyte-scale across cluster
Mutability: pandas mutable; PySpark DataFrames immutable
Row index: pandas has index; PySpark has no inherent row order
Bridge: df.toPandas() (small results only), spark.createDataFrame(pdf), Spark 3.2+ pyspark.pandas API

FresherEasyAmazonTCS

What is the difference between cache() and persist()?

cache() = persist(MEMORY_AND_DISK). persist() lets you choose storage level. Both avoid recomputation on repeated actions.

Cache vs Persist

from pyspark import StorageLevel
df_hot = df.join(other,"id").groupBy("dept").agg(...)
df_hot.cache()           # MEMORY_AND_DISK shorthand
df_hot.count()           # action triggers caching
df_hot.show()            # uses cache — fast!
df_hot.unpersist()       # release when done
df.persist(StorageLevel.MEMORY_ONLY_SER)  # smaller footprint
df.persist(StorageLevel.OFF_HEAP)         # avoid GC pressure

FresherEasyWiproAccenture

What file formats does PySpark support? Which is preferred?

Parquet for analytics, Delta for data lakes with ACID, Avro for Kafka/streaming, CSV/JSON only for ingestion.

Answer

Parquet (recommended): columnar, compressed, schema-embedded, predicate pushdown, column pruning.
Delta Lake: Parquet + ACID + time travel + schema enforcement. Best for data lakes.
Avro: row-based, excellent for streaming/Kafka, schema evolution.
ORC: similar to Parquet, popular with Hive.
CSV/JSON: use only at ingestion — slow, no schema, no compression.

FresherEasyTCSInfosys

How do you create a DataFrame from a list, CSV, or pandas?

spark.createDataFrame(), spark.read.csv(), rdd.toDF(), spark.range() — multiple entry points.

DataFrame Creation

# From list
df = spark.createDataFrame([("Alice",30),("Bob",25)], ["name","age"])
# From pandas
import pandas as pd
df = spark.createDataFrame(pd.DataFrame({"name":["Alice"],"age":[30]}))
# From RDD
df = sc.parallelize([("Alice",30)]).toDF(["name","age"])
# Quick range (1M rows)
df = spark.range(1000000)
df.printSchema(); df.show(5)

FresherEasyTCSInfosys

Core DataFrame operations — select, filter, groupBy, withColumn

The building blocks of every PySpark pipeline. when() / otherwise() for conditional logic.

Core Operations

from pyspark.sql.functions import col, avg, count, when, desc
df.select("name", (col("salary")*1.1).alias("new_salary"))
df.filter((col("dept")=="Eng") & (col("age")>25))
df.withColumn("grade", when(col("score")>=90,"A").when(col("score")>=70,"B").otherwise("C"))
df.groupBy("dept").agg(count("*").alias("n"),avg("salary").alias("avg_sal")) \
  .orderBy(desc("avg_sal")).show()
df.dropDuplicates(["employee_id"])
df.withColumnRenamed("salary","annual_pay")

IntermediateMediumAmazonDatabricks

Explain the Catalyst Optimizer and its 4 stages

Analysis → Logical Optimization → Physical Planning → Code Generation. This is why DataFrames beat RDDs in performance.

Answer

Catalyst is Spark's query optimizer — it automatically rewrites and optimizes DataFrame queries before execution:

Analysis: Resolves column names, table references, and data types against the catalog. Catches errors early.
Logical Optimization: Rule-based rewrites — predicate pushdown (filter early), constant folding, null propagation, column pruning (drop unused columns).
Physical Planning: Chooses the execution strategy — which join type (broadcast vs sort-merge), partition strategy. Generates multiple physical plans and picks the cheapest one using Cost-Based Optimization (CBO).
Code Generation: Tungsten's whole-stage code generation compiles the physical plan to optimized JVM bytecode — eliminates virtual function calls, uses CPU registers efficiently.

This is why identical logic runs 5–20× faster as a DataFrame than as an RDD — RDDs bypass all four stages.

IntermediateMediumAmazonUber

What is Broadcasting and when should you use it?

Send a small DataFrame to every executor — eliminates shuffle for the large table in a join. Threshold: 10MB default (autoBroadcastJoinThreshold).

Answer

Broadcasting sends a small, read-only table to every executor node so the large table never needs to shuffle across the network. Use when one side of a join is small enough to fit in executor memory (<10MB by default, configurable via spark.sql.autoBroadcastJoinThreshold). Typical speedup: 3–8× on fact-dimension joins.

Broadcast Join

from pyspark.sql.functions import broadcast
# large_df: 10GB fact table; dim_df: 5MB dimension table
result = large_df.join(broadcast(dim_df), "dept_id", "inner")
# dim_df sent to each executor once — no shuffle on large_df

# Tune threshold (default 10MB)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "50m")
# Disable auto-broadcast
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

IntermediateMediumAmazonNetflix

Window Functions in PySpark — concept, syntax, and examples

Calculations across a row window without collapsing rows. rank() vs dense_rank() vs row_number() — know the difference.

Answer

Window functions compute values across a set of rows related to the current row — without collapsing them into one row (unlike groupBy+agg). Each row keeps its identity while gaining computed columns. Use cases: ranking, running totals, moving averages, lag/lead comparisons.

rank() vs dense_rank() vs row_number():
rank(): gaps after ties: 1,1,3; dense_rank(): no gaps: 1,1,2; row_number(): always unique: 1,2,3

Window Functions

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, dense_rank, row_number, lag, sum, col

w = Window.partitionBy("dept").orderBy(col("salary").desc())
df.withColumn("rank",        rank().over(w)) \
  .withColumn("dense_rank",  dense_rank().over(w)) \
  .withColumn("row_num",     row_number().over(w)) \
  .withColumn("prev_salary", lag("salary",1).over(w)) \
  .withColumn("running_sum", sum("salary").over(
      w.rowsBetween(Window.unboundedPreceding, Window.currentRow))) \
  .show()

IntermediateMediumTCSAmazon

repartition() vs coalesce() — deep comparison

repartition(): full shuffle, can increase or decrease. coalesce(): minimal shuffle, only decreases. Use coalesce() before write.

Answer

repartition(n): full shuffle — data crosses network; can increase OR decrease partition count; produces equal-sized partitions. Use to rebalance skewed data or increase parallelism.
coalesce(n): merges local partitions with minimal shuffle; can only DECREASE partition count; may produce uneven partitions. Use before writing to reduce output file count.
Default shuffle partitions: spark.sql.shuffle.partitions = 200. Tune to 2–4× executor cores for your cluster size.

Repartition vs Coalesce

df.repartition(200)              # full shuffle, 200 equal partitions
df.repartition(50, "dept")      # shuffle + partition by column
df.coalesce(10)                  # merge locally, minimal shuffle
# Before writing: reduce small files
df.coalesce(1).write.mode("overwrite").parquet("output/")
# Check current partition count
print(df.rdd.getNumPartitions())

IntermediateMediumAmazonLinkedIn

What is Data Skew and how do you handle it?

Uneven data distribution causing hot partitions. Solutions: salting, broadcast joins, AQE skew handling, separate processing of hot keys.

Answer

Data skew: some partition keys have vastly more rows than others. During groupBy/join, one executor processes 80% of the data while others are idle. Job time = slowest task.
Detect: df.groupBy("key").count().orderBy("count",ascending=False).show(). Check Spark UI → Stages → Task metrics for outlier task durations.

Solutions:
1. Salting: add random prefix to keys, partial agg, remove prefix, final agg
2. Broadcast join: if small side fits in memory, broadcast it
3. AQE (Spark 3.0+): spark.sql.adaptive.skewJoin.enabled=true auto-splits skewed partitions
4. Separate hot keys: filter hot keys out, process separately, union results

IntermediateMediumUberLinkedIn

What is the Salting technique for skewed data?

Add random integer prefix to skewed keys → partial aggregation → remove prefix → final aggregation. Spreads hot partitions evenly.

Salting for Skewed groupBy

import pyspark.sql.functions as F
SALT = 10
# Step 1: add random salt (0–9) to key
df_s = df.withColumn("salted_key",
    F.concat(col("join_key"), F.lit("_"), (F.rand()*SALT).cast("int")))
# Step 2: partial aggregation on salted key
partial = df_s.groupBy("salted_key").agg(F.sum("amount").alias("part_sum"))
# Step 3: strip salt, final aggregation
result = partial \
    .withColumn("orig_key", F.split(col("salted_key"),"_")[0]) \
    .groupBy("orig_key").agg(F.sum("part_sum").alias("total"))

IntermediateMediumInfosysAmazon

Broadcast Variables vs Accumulators

Broadcast: read-only shared lookup sent to all executors once. Accumulator: write-only counter/sum aggregated back to driver.

Answer

Broadcast Variable: read-only data structure (dict, list) sent to every executor once and cached in memory. Workers can read it efficiently. Use for lookup tables, config maps, ML model weights.

Accumulator: workers can only add to it (write-only); the driver reads the final accumulated value. Use for counters and sums across tasks. Note: accumulators in transformations may be double-counted on re-execution — use inside foreachPartition for reliability.

Broadcast & Accumulator

# Broadcast variable
lookup = {"NYC":"New York", "LA":"Los Angeles"}
bc = spark.sparkContext.broadcast(lookup)
df.rdd.map(lambda r: bc.value.get(r.city, "Unknown")).collect()

# Accumulator
null_counter = spark.sparkContext.accumulator(0)
def count_nulls(row):
    if row.salary is None: null_counter.add(1)
df.rdd.foreach(count_nulls)
print(f"Null salaries: {null_counter.value}")

IntermediateMediumAmazonFlipkart

All join types in PySpark — with use cases

inner, left, right, full, left_semi, left_anti, cross — know what each returns and when to use it.

Answer

inner: only rows matching in both — standard join
left/right: all from one side, nulls for non-matching other side
full/outer: all rows from both sides, nulls where no match
left_semi: rows from left WHERE match exists in right (like WHERE EXISTS) — does NOT include right columns
left_anti: rows from left WHERE NO match in right (like WHERE NOT EXISTS) — great for finding missing records
cross: Cartesian product, every left row × every right row — use with extreme care

Join Types

df_a.join(df_b, "id", "inner")
df_a.join(df_b, "id", "left")
df_a.join(df_b, "id", "full")
df_a.join(df_b, "id", "left_semi")   # a's rows where id exists in b
df_a.join(df_b, "id", "left_anti")   # a's rows where id NOT in b
# Multi-column join
df_a.join(df_b, ["dept","year"], "inner")

IntermediateMediumTCSGoogle

Schema inference vs explicit schema — when to use each?

Schema inference reads data to determine types (slow, error-prone). Explicit StructType is faster, safer, and recommended for production.

Answer

inferSchema=True: Spark reads a sample of the data, guesses types. Convenient for exploration but adds an extra pass over the data — slow on large files, may guess wrong types (e.g., "001" read as integer instead of string).

Explicit schema: Define with StructType/StructField. Faster (no inference pass), predictable, catches schema mismatches early. Use in production always.

Explicit Schema

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
schema = StructType([
    StructField("emp_id",   StringType(),    nullable=False),
    StructField("name",     StringType(),    nullable=True),
    StructField("salary",   DoubleType(),    nullable=True),
    StructField("dept",     StringType(),    nullable=True),
    StructField("hire_date",TimestampType(), nullable=True),
])
df = spark.read.schema(schema).csv("employees.csv", header=True)
df.printSchema()

IntermediateMediumAmazonDatabricks

How does fault tolerance work in PySpark?

RDD lineage + recomputation. Checkpointing truncates lineage. DataFrames also have lineage stored in the physical plan.

Answer

RDD Lineage: Spark records every transformation that produced each RDD. If an executor dies and a partition is lost, Spark uses the lineage to recompute only that partition from the last stable data — no need to restart the entire job.

Checkpointing: Saves an RDD/DataFrame to HDFS/S3 and truncates the lineage. Required for very long lineages (iterative ML algorithms) or Structured Streaming (to recover stream state). Unlike cache, checkpointing survives executor failure permanently.

Replication: Each Spark block can be replicated across executors (default: no replication for performance). Delta Lake adds further durability with ACID transaction log.

IntermediateMediumGoogleDatabricks

What is Predicate Pushdown and Column Pruning?

Filter pushed to storage layer to reduce data read. Column pruning reads only needed columns. Both are automatic in Catalyst for Parquet/Delta.

Answer

Predicate Pushdown: Catalyst pushes filter conditions down to the data source (Parquet, Delta, JDBC). The storage layer only returns rows that pass the filter — never reading filtered rows at all. For Parquet: uses min/max statistics to skip entire row groups. For Delta: also uses partition pruning and Z-order data skipping.

Column Pruning: Only reads columns referenced in the query. If your CSV has 100 columns but you select("name","salary"), Catalyst tells the reader to skip the other 98. For columnar formats (Parquet, ORC), this is a massive I/O reduction.

Both happen automatically — no code change needed. Avoid df.rdd and Python UDFs which break these optimizations.

IntermediateMediumDatabricksAmazon

What are pandas_udf (vectorized UDFs)? Why are they faster?

Apache Arrow-based batched UDFs — process column chunks at once instead of row-at-a-time. 10–100× faster than regular Python UDFs.

Answer

Regular Python UDFs: JVM serializes each row → Python processes one row → deserializes result back → repeat for every row. Massive overhead.

pandas_udf: Uses Apache Arrow to transfer entire column chunks between JVM and Python in one step (zero-copy). Python processes a pandas Series/DataFrame at once using vectorized operations. Result: 10–100× faster, uses Python's full numerical ecosystem.

pandas_udf

from pyspark.sql.functions import pandas_udf
import pandas as pd
from pyspark.sql.types import DoubleType

# Scalar pandas_udf — operates on pandas Series
@pandas_udf(DoubleType())
def apply_tax(salary: pd.Series) -> pd.Series:
    return salary * 0.7  # vectorized — no loop!

df.withColumn("take_home", apply_tax("salary")).show()

# Grouped pandas_udf — group → pandas DataFrame → pandas DataFrame
@pandas_udf(df.schema)
def normalize_dept(pdf: pd.DataFrame) -> pd.DataFrame:
    pdf["salary"] = (pdf["salary"] - pdf["salary"].mean()) / pdf["salary"].std()
    return pdf
df.groupBy("dept").applyInPandas(normalize_dept, schema=df.schema).show()

IntermediateMediumAmazonTCS

PySpark job optimization — 10-point checklist

Broadcast joins, DataFrame over RDD, caching, partition tuning, avoid Python UDFs, filter early, explain() to inspect plans.

Answer

Broadcast small tables — eliminate shuffle for small-large joins
Use DataFrames over RDDs — get Catalyst + Tungsten for free
Filter early — reduce data volume before expensive joins/aggregations
Cache reused DataFrames — avoid recomputing expensive intermediate results
Tune shuffle partitions — spark.sql.shuffle.partitions should be 2–4× executor cores
Use Parquet/Delta — enables predicate pushdown and column pruning
Avoid Python UDFs — use built-in functions or pandas_udf instead
Partition on write — partitionBy("year","month") for better read performance
Use explain() — inspect the physical plan to find missing filters or expensive joins
Enable AQE — spark.sql.adaptive.enabled=true for automatic runtime optimization

IntermediateMediumAmazonNetflix

Complex data types — ArrayType, MapType, StructType with explode()

Nested data is common in JSON ingestion. explode() flattens arrays to rows. Dot notation for nested structs.

Complex Types

from pyspark.sql.functions import explode, col, map_keys, map_values

# Schema with nested types
json_data = """{"user":"Alice","tags":["sql","spark"],"scores":{"math":95,"eng":88}}"""
df = spark.read.json(sc.parallelize([json_data]))
df.printSchema()
# root: user: string, tags: array[string], scores: struct{eng,math}

# Explode array → one row per element
df.select("user", explode("tags").alias("tag")).show()
# +-----+-----+   +-----+-----+
# |user |tag  |   |Alice|sql  |
# +-----+-----+   |Alice|spark|

# Access nested struct fields
df.select("user", col("scores.math"), col("scores.eng")).show()

IntermediateMediumAmazonWalmart

Pivot: convert rows to columns

df.groupBy().pivot().agg() — turns category values into column headers. Handle nulls with fillna after pivot.

Pivot Example

from pyspark.sql.functions import sum

# Source: user | product | amount
# Goal: user | Electronics | Clothing | Food
pivoted = df.groupBy("user") \
    .pivot("product", ["Electronics","Clothing","Food"]) \
    .agg(sum("amount")) \
    .fillna(0)
pivoted.show()
# Specify values for performance (avoids full scan to find distinct values)

ExperiencedHardDatabricksGoogle

What is Adaptive Query Execution (AQE)?

Runtime re-optimization in Spark 3.0+. Adjusts join strategies, coalesces partitions, handles skew — based on actual runtime statistics, not estimates.

Answer

AQE (Spark 3.0+, default ON in 3.2+) re-optimizes query plans at runtime using actual data statistics collected during execution — not static estimates.

Three key features:

Dynamic Coalescing: automatically coalesces small shuffle partitions after each shuffle stage, reducing the 200 default partitions to the actual optimal number
Dynamic Skew Handling: detects skewed partitions at runtime and automatically splits them into smaller sub-tasks (spark.sql.adaptive.skewJoin.enabled=true)
Dynamic Join Strategy Switching: can switch from sort-merge to broadcast join mid-execution if one side turns out to be small enough after filtering

Enable AQE

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# AQE is ON by default in Spark 3.2+ and Databricks Runtime 8.0+

ExperiencedHardDatabricksAmazon

Delta Lake ACID properties — how is each implemented?

Transaction log (_delta_log) is the foundation. Snapshot isolation for readers. Optimistic concurrency control for writers.

Answer

Atomicity: every write operation is recorded in the transaction log (_delta_log) as a single JSON commit. If a write fails mid-way, the partial data is ignored — the commit entry never appears in the log.
Consistency: schema enforcement (rejectIfNotMatching) prevents bad data. Schema evolution is opt-in. Delta validates data types on every write.
Isolation: Snapshot Isolation — readers always see a consistent snapshot of the table as of when their query started, even during concurrent writes. Uses optimistic concurrency control for write-write conflicts.
Durability: commits are persisted in durable storage (S3, ADLS, HDFS) before acknowledgement. The log is the source of truth — the actual Parquet files are immutable and append-only.

ExperiencedHardDatabricksNetflix

Time travel in Delta Lake — mechanism and use cases

VERSION AS OF / TIMESTAMP AS OF. Stored in _delta_log. Use for auditing, rollback, and ML reproducibility.

Answer

Delta Lake's transaction log records every change with a version number and timestamp. Time travel lets you query data as it existed at any past point by reading the transaction log to reconstruct the table state at that version.

Use cases: audit trails (who changed what, when), rollback after bad writes, reproduce ML experiments with same training data snapshot, compare data before/after pipeline run.

Time Travel

# By version number
df = spark.read.format("delta").option("versionAsOf", 5).load("/data/table")
# By timestamp
df = spark.read.format("delta").option("timestampAsOf", "2025-01-01").load("/data/table")
# SQL syntax
spark.sql("SELECT * FROM my_table VERSION AS OF 5")
spark.sql("SELECT * FROM my_table TIMESTAMP AS OF '2025-01-01'")
# Rollback (restore previous version)
from delta.tables import DeltaTable
DeltaTable.forPath(spark, "/data/table").restoreToVersion(5)

ExperiencedHardUberLinkedIn

Structured Streaming architecture — unbounded table model

Treats a stream as an unbounded, append-only table. Write batch-like queries; Spark handles incremental execution. Trigger modes control latency.

Answer

Structured Streaming models a live data stream as an unbounded table — new data arrives as new rows. You write a static batch query; Spark handles incremental execution automatically in micro-batches.

Trigger modes: ProcessingTime("10 seconds") — batch every 10s; Once() — process all available, then stop; AvailableNow() — like Once but multi-batch (Spark 3.3+); Continuous("1 second") — millisecond latency (experimental)

Output modes: Append (only new rows), Complete (all rows rewritten), Update (only changed rows)

Streaming Example

from pyspark.sql.functions import window, col, count
# Read from Kafka
stream_df = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "events").load()
# Aggregate: events per 5-minute window
query = stream_df \
    .withWatermark("event_time", "10 minutes") \
    .groupBy(window(col("event_time"),"5 minutes"), "user_id") \
    .agg(count("*").alias("events")) \
    .writeStream.outputMode("append") \
    .format("delta").option("checkpointLocation","/checkpoints/q") \
    .trigger(processingTime="10 seconds").start()
query.awaitTermination()

ExperiencedHardUberAmazon

Watermarking in Structured Streaming — when and why?

Threshold for late data tolerance. Bounds state size. Without watermarking, Spark must keep state forever — memory grows unbounded.

Answer

Watermarking defines the maximum delay Spark will wait for late-arriving data: "data older than MAX(event_time) - watermark_threshold is too late and will be dropped."

Without watermarking, stateful operations (window aggregations, stream-stream joins) must keep state for all time windows indefinitely → unbounded memory growth → OOM crash.

Trade-off: larger watermark = more state kept = more late data accepted = higher memory use. Smaller watermark = less state = lower latency = some late data dropped.

Requirement: watermark MUST be set before append output mode works with aggregations (Spark enforces this).

ExperiencedHardDatabricksGoogle

Dynamic Partition Pruning (DPP)

Eliminates partitions not needed based on join condition filters — massive speedup for star schema queries. Spark 3.0+ with AQE.

Answer

DPP (Spark 3.0+) eliminates partitions from the probe side of a join based on the result of filtering the build side. Example: if you join a 10TB fact table with a dimension table and filter WHERE dim.category = 'Electronics', DPP uses that filter to skip non-Electronics partitions in the fact table before reading them — potentially scanning <1% of the data instead of 100%.

Requires: one side is partitioned on the join key; the query has a selective filter on the other side; AQE enabled.

ExperiencedHardAmazonNetflix

Broadcast vs Sort-Merge vs Shuffle Hash Join — when does Spark choose each?

Broadcast: small table (<10MB) — fastest, no shuffle. Sort-Merge: large-large — stable, scalable. Shuffle Hash: medium tables — needs memory.

Answer

Broadcast Hash Join: one side ≤ autoBroadcastJoinThreshold (10MB). Small table broadcast to all executors. No shuffle on large side. Fastest. Force with broadcast() hint.
Shuffle Hash Join: both sides shuffled by join key, then smaller side builds an in-memory hash map per partition. Faster than sort-merge but requires hash table to fit in memory. Chosen when one side is significantly smaller.
Sort-Merge Join: both sides shuffled and sorted by join key, then merged. Memory-efficient (streams rather than builds hash). Default for large-large joins. Can pre-sort/bucket to eliminate shuffle.
Broadcast Nested Loop Join: fallback for non-equi joins (e.g., range conditions). Very slow — O(n×m).

Force a strategy: df.hint("broadcast"), df.hint("merge"), df.hint("shuffle_hash")

ExperiencedHardDatabricksAmazon

Python UDF vs pandas_udf — architecture and performance comparison

Python UDF: row-at-a-time via Pickle. pandas_udf: batch via Arrow. 10–100× speed difference. Both bypass Catalyst.

Answer

Python UDF pipeline (slow): JVM executor → pickle serialize one row → Python worker → deserialize → apply function → serialize result → JVM. 2 serializations per row, Python process overhead, row-at-a-time.

pandas_udf pipeline (fast): JVM executor → Apache Arrow zero-copy batch transfer → Python worker receives pandas Series → vectorized numpy operations → Arrow transfer back. One serialization for thousands of rows, uses SIMD CPU instructions via numpy.

When to use each: Built-in functions > pandas_udf > Python UDF. Scala UDFs are even faster than pandas_udf (no JVM→Python boundary) but require Scala knowledge.

Both bypass Catalyst — the optimizer cannot see inside a UDF function body, so it can't push predicates into it or optimize across UDF boundaries.

ExperiencedHardDatabricksAmazon

Z-ordering in Delta Lake — what it does and when to use it

Co-locates related data in the same Parquet files for data skipping. Reduces files scanned for point lookups. Best for high-cardinality columns.

Answer

Z-ordering is a data clustering technique that co-locates rows with similar values of specified columns in the same Parquet files. Delta Lake then records the min/max values of those columns per file in the transaction log. When you query with a filter on a Z-ordered column, Delta uses data skipping to eliminate entire files without reading them.

When to use: high-cardinality columns you frequently filter on (user_id, product_id, order_id). Columns already used for partitioning don't benefit further. Works best when the table is regularly OPTIMIZEd.

Limitation: Z-ordering on multiple columns is a trade-off — effectiveness decreases as more columns are added. Typically <= 4 columns.

Z-Order Optimization

from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, "/data/events")
# Compact small files + Z-order by user_id
dt.optimize().executeZOrderBy("user_id", "event_date")
# SQL equivalent
spark.sql("OPTIMIZE events ZORDER BY (user_id, event_date)")

ExperiencedHardUberLinkedIn

Exactly-once semantics in Structured Streaming

Idempotent sink + checkpointing + offset tracking. Delta Lake as sink gives exactly-once out of the box.

Answer

Exactly-once = each input message produces exactly one output effect, even if the system crashes and restarts. Structured Streaming achieves this through:
1. Checkpointing: saves processed offsets (e.g., Kafka offset) to durable storage. On restart, reads from last committed offset — no re-processing of old messages.
2. Idempotent sinks: if a micro-batch is re-run due to failure, the sink must be safe to write twice. Delta Lake (ACID) and HDFS/S3 file sinks support this natively.
3. End-to-end exactly-once: Kafka source (committed offsets) + checkpointing + Delta Lake sink = full exactly-once guarantee.

CodingMediumAmazonFlipkart

Remove duplicate rows — keep the one with the latest timestamp

Window + ROW_NUMBER partitioned by ID, ordered by timestamp DESC, then filter WHERE rn = 1.

Deduplication

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col, desc

w = Window.partitionBy("employee_id").orderBy(desc("updated_at"))

deduped = df.withColumn("rn", row_number().over(w)) \
            .filter(col("rn") == 1) \
            .drop("rn")

deduped.show()
# For simple dedup (no timestamp preference):
df.dropDuplicates(["employee_id"])

CodingEasyAmazonUber

Broadcast join — implement and verify it's used

from pyspark.sql.functions import broadcast — wrap the small DataFrame and use explain() to confirm BroadcastHashJoin in the plan.

Broadcast Join

from pyspark.sql.functions import broadcast

# Large fact table: millions of rows
fact_df = spark.read.parquet("s3://data/transactions/")
# Small dimension table: <10MB
dim_df = spark.read.parquet("s3://data/departments/")

result = fact_df.join(broadcast(dim_df), "dept_id", "inner")

# Verify BroadcastHashJoin appears in physical plan
result.explain()
# Physical Plan should show: BroadcastHashJoin, BuildRight
result.show()

CodingMediumAmazonNetflix

Top-N records per group using ROW_NUMBER

Window.partitionBy("dept").orderBy(desc("salary")), filter rn <= N, drop the row_number column.

Top-N Per Group

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col, desc

N = 3
w = Window.partitionBy("dept").orderBy(desc("salary"))

top_n = df.withColumn("rn", row_number().over(w)) \
          .filter(col("rn") <= N) \
          .drop("rn")

top_n.orderBy("dept", desc("salary")).show()
# Returns top 3 earners per department

CodingEasyTCSInfosys

Conditional column with when() / otherwise()

Equivalent to SQL CASE WHEN. Chain multiple when() calls before the final otherwise(). Nest inside withColumn().

Conditional Logic

from pyspark.sql.functions import when, col

df.withColumn("grade",
    when(col("score") >= 90, "A")
    .when(col("score") >= 80, "B")
    .when(col("score") >= 70, "C")
    .otherwise("F")) \
.withColumn("salary_band",
    when(col("salary") >= 100000, "High")
    .when(col("salary") >= 50000,  "Mid")
    .otherwise("Low")).show()

CodingEasyTCSAccenture

Define explicit schema with StructType and read JSON

Always define schema explicitly in production. StructType([StructField(...)]) — no inferSchema needed.

Explicit Schema

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType

schema = StructType([
    StructField("emp_id",    StringType(),    nullable=False),
    StructField("name",      StringType(),    nullable=True),
    StructField("salary",    DoubleType(),    nullable=True),
    StructField("dept",      StringType(),    nullable=True),
    StructField("hire_date", TimestampType(), nullable=True),
])
df = spark.read.schema(schema).json("s3://data/employees.json")
df.printSchema()
df.show(5)

CodingMediumAmazonDatabricks

Flatten nested JSON (explode arrays + dot notation for structs)

explode() one row per array element. col("struct.field") for nested struct access. explode_outer() preserves rows with empty arrays.

Flatten Nested JSON

from pyspark.sql.functions import explode, explode_outer, col

# JSON: {"user":"Alice", "orders":[{"id":1,"amt":50},{"id":2,"amt":80}], "addr":{"city":"NY"}}
df = spark.read.json("orders.json")

# Explode array: one row per order
df_exploded = df.select(
    col("user"),
    explode("orders").alias("order"),      # drops rows with empty arrays
    col("addr.city").alias("city")          # dot notation for struct
)

# Flatten nested struct fields
df_flat = df_exploded.select(
    "user", "city",
    col("order.id").alias("order_id"),
    col("order.amt").alias("amount")
)
df_flat.show()

CodingMediumAmazonWalmart

Pivot: convert transaction rows to wide-format columns

groupBy().pivot().agg() — always specify pivot values explicitly for performance, then fillna(0).

Pivot

from pyspark.sql.functions import sum

# Input: user | product_category | amount
# Output: user | Electronics | Clothing | Food

pivoted = df.groupBy("user") \
    .pivot("product_category", ["Electronics","Clothing","Food"]) \
    .agg(sum("amount")) \
    .fillna(0)  # replace nulls (categories user never bought)

pivoted.show()
# Specifying values avoids a full scan to find distinct category values

CodingHardUberLinkedIn

Implement salting for a skewed groupBy

Add random salt 0–N to key → partial agg → strip salt → final agg. Distributes hot partition across N executors.

Salting for Skewed GroupBy

import pyspark.sql.functions as F

SALT_BUCKETS = 20  # spread hot key across 20 partitions

# Step 1: add random salt to the skewed column
df_salted = df.withColumn("salt", (F.rand() * SALT_BUCKETS).cast("int")) \
              .withColumn("salted_key",
                  F.concat(col("customer_id"), F.lit("_"), col("salt")))

# Step 2: partial aggregation on salted key
partial = df_salted.groupBy("salted_key", "product") \
    .agg(F.sum("amount").alias("partial_sum"))

# Step 3: strip salt prefix, final aggregation
result = partial \
    .withColumn("customer_id", F.split(col("salted_key"), "_")[0]) \
    .groupBy("customer_id", "product") \
    .agg(F.sum("partial_sum").alias("total_amount"))
result.show()

CodingMediumTCSInfosys

Data validation framework — null checks, row counts, schema validation

Reusable quality checks: row count, null percentage per column, duplicate detection, schema match.

Data Validation

from pyspark.sql.functions import col, count, when, sum as spark_sum

def validate_df(df, expected_schema=None, pk_cols=None):
    results = {}
    total = df.count()
    results["row_count"] = total

    # Null percentage per column
    null_pct = df.select([
        (count(when(col(c).isNull(), c)) / total * 100).alias(c)
        for c in df.columns
    ]).collect()[0].asDict()
    results["null_pct"] = null_pct

    # Duplicate check
    if pk_cols:
        dup_count = total - df.dropDuplicates(pk_cols).count()
        results["duplicates"] = dup_count

    # Schema validation
    if expected_schema:
        schema_ok = df.schema == expected_schema
        results["schema_match"] = schema_ok

    return results

report = validate_df(df, pk_cols=["emp_id"])
print(report)

CodingEasyWiproTCS

Reduce small files with coalesce before writing

coalesce(n) merges locally — no full shuffle. Use before write to control output file count and size.

Coalesce Small Files

# Problem: job creates 1000 tiny files (small file problem)
print(df.rdd.getNumPartitions())  # 1000

# Solution: coalesce before write (no full shuffle)
df.coalesce(10).write.mode("overwrite").parquet("output/")
# Creates 10 output files instead of 1000

# For a single-file output (use with care on large data)
df.coalesce(1).write.mode("overwrite").csv("report.csv", header=True)

# Delta Lake auto-optimizes with OPTIMIZE command
spark.sql("OPTIMIZE my_delta_table")

MCQEasy

Which of the following causes a data shuffle?

Only wide transformations shuffle data across executor boundaries.

Afilter()

Bselect()

CgroupBy().agg()

DwithColumn()

Correct: C — groupBy().agg()
groupBy is a wide transformation — it must collect all rows with the same key from across all partitions into the same partition, which requires shuffling data over the network. filter, select, and withColumn are narrow transformations — each output partition depends on only one input partition, no network transfer needed.

MCQEasy

cache() is equivalent to persist() with which storage level?

Think about the default behavior — memory first, then disk.

AMEMORY_ONLY

BMEMORY_AND_DISK

CDISK_ONLY

DOFF_HEAP

Correct: B — MEMORY_AND_DISK
cache() stores data in memory and spills to disk when memory is full. This is the MEMORY_AND_DISK storage level. Note: in Spark 2.x, cache() was MEMORY_ONLY — it changed to MEMORY_AND_DISK in Spark 3.x.

MCQEasy

Which of the following is NOT an Action in PySpark?

Actions trigger execution and return results. Transformations build the DAG.

Acollect()

Bshow()

Cfilter()

Dcount()

Correct: C — filter()
filter() is a transformation — it returns a new DataFrame lazily without executing anything. collect(), show(), and count() are all Actions that trigger execution of the full DAG and return results to the driver.

MCQMedium

What technology do pandas_udf use for fast data transfer between JVM and Python?

Zero-copy columnar format designed for in-memory analytics.

AApache Arrow

BApache Parquet

CPython Pickle

DProtocol Buffers

Correct: A — Apache Arrow
Apache Arrow is a cross-language in-memory columnar format. pandas_udf use Arrow to transfer entire batches of data between the JVM and Python process in a zero-copy, zero-serialization manner — unlike Python UDFs which use Pickle for row-by-row serialization.

MCQMedium

Which method can ONLY decrease the number of partitions?

One causes a full shuffle, the other merges locally.

Arepartition()

Bcoalesce()

CpartitionBy()

DsortWithinPartitions()

Correct: B — coalesce()
coalesce(n) can only decrease partitions. It merges partitions with minimal data movement (combining local partitions on the same executor). repartition() can both increase and decrease partitions but always does a full shuffle.

MCQMedium

AQE stands for what, and which Spark version introduced it as default?

It re-optimizes at runtime using actual data statistics.

AAdaptive Query Execution — default ON in Spark 3.2

BAdaptive Queue Engine — default ON in Spark 3.0

CAutomatic Query Execution — default ON in Spark 2.4

DAdvanced Query Estimation — default ON in Spark 3.1

Correct: A — Adaptive Query Execution, default ON in Spark 3.2
AQE was introduced in Spark 3.0 but was opt-in. It became ON by default in Spark 3.2. AQE re-optimizes the physical plan at runtime based on actual statistics collected during shuffle stages — enabling dynamic partition coalescing, skew join handling, and join strategy switching.

MCQHard

What isolation level does Delta Lake use for concurrent readers and writers?

Readers see a point-in-time consistent view.

ARead Committed

BSnapshot Isolation

CSerializable

DRepeatable Read

Correct: B — Snapshot Isolation
Delta Lake uses Snapshot Isolation: each reader sees a consistent snapshot of the table as it existed at the time their query started, even if writers are simultaneously committing new data. This is weaker than Serializable (which prevents phantom reads and write skew) but sufficient for most data lake use cases and much more performant.

MCQMedium

Which join type is chosen when one table is smaller than autoBroadcastJoinThreshold?

No shuffle required on the larger table.

ABroadcast Hash Join

BSort-Merge Join

CShuffle Hash Join

DNested Loop Join

Correct: A — Broadcast Hash Join
When one side of a join is ≤ spark.sql.autoBroadcastJoinThreshold (default 10MB), Catalyst automatically chooses Broadcast Hash Join. The small table is sent to every executor and held in memory. The large table is scanned locally on each executor — no shuffle needed. This is the fastest join strategy and can be forced with the broadcast() hint even if the threshold isn't met.

MCQHard

In Structured Streaming, what happens if you don't set a watermark for a windowed aggregation?

Think about what Spark must keep track of for all possible late data.

ASpark automatically limits state to 1 hour

BState is dropped after each micro-batch

CState grows unbounded — potential OOM

DOnly complete output mode is available

Correct: C — State grows unbounded
Without a watermark, Spark must keep state for every window it has ever seen, in case late data arrives for that window at any future point. This causes the state store to grow without bound, eventually leading to out-of-memory errors. Watermarking tells Spark it's safe to evict state for windows older than MAX(event_time) - threshold.

MCQMedium

Which optimizer stage in Catalyst generates optimized JVM bytecode?

The final stage — where Tungsten's whole-stage code generation happens.

AAnalysis

BLogical Optimization

CPhysical Planning

DCode Generation

Correct: D — Code Generation
The fourth and final stage of the Catalyst optimizer is Code Generation (via Tungsten's whole-stage code gen). It compiles the physical plan into optimized JVM bytecode at runtime — eliminating virtual function calls, using CPU registers efficiently, and enabling SIMD-friendly data layout. This is a primary reason why DataFrame operations are so much faster than equivalent RDD code.

No questions match Try a different search term or clear your filters

Master PySpark for Data Engineering

What Existed Before PySpark?

How MapReduce Actually Worked

The Engine Behind Big Data

PySpark Data Flow

Core Spark Concepts

What is a Partition?

Default Partitions

repartition() vs coalesce()

What Causes a Shuffle?

Why Cache?

cache() vs persist()

Storage Levels

Checkpointing vs Caching

Join Types

Join Strategies

Shuffle Hash Join

Broadcast Variable vs Broadcast Join

What is Data Skew?

Detecting Skew

Salting — How to Fix It

Other Solutions

What are Window Functions?

Defining a Window

rank() vs dense_rank() vs row_number()

Frame Specification

What is Delta Lake?

ACID in Delta Lake

Time Travel

Z-Ordering (Clustering)

The Unbounded Table Model

Trigger Modes

Watermarking — Handling Late Data

Checkpointing for Fault Tolerance

PySpark Cheat Sheet

Learn PySpark,the human way

Lazy Evaluation & the DAG

PySpark Question Bank

Answer

Answer

Answer

Answer

Formal Definition

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Answer

Learn PySpark,
the human way