Figure III · Pipelined writes and the byte ledger

HDFS, by the byte.

The HDFS exam question is almost always the same: a file of size F, a replication factor of R — how much network and how much disk does the cluster pay? The answer falls out of one design choice: pipelining.

Why this matters

The filesystem that built the modern data lake

Yahoo built Hadoop and HDFS in 2006 to crawl and index the web at petabyte scale, after reading Google's GFS paper. HDFS — and its cloud descendants Colossus and GCS — sit underneath nearly every large dataset you've ever interacted with at one remove:

Common Crawl — the web crawl that GPT, Claude, and every other LLM trained on lives in HDFS-like object storage
Tesla Autopilot — the petabytes of dashcam video used to train self-driving
Spotify — every play event, ever, archived for the recommendation engine
The Internet Archive — every webpage snapshot since 1996

The pattern: write-once, read-many, sequential access, very large files, on commodity disks. You don't pick HDFS because it's fast at small operations — you pick it because it never runs out of room.

A naive design would have the client send the whole file R times — once to each replica. HDFS doesn’t. The client sends the file once, to the first DataNode chosen by the NameNode. That DataNode forwards the bytes to the second, which forwards them to the third. The bytes flow through a pipeline, and the client pays the network cost only of the first hop.

Set the file size and replication factor below and press Write. Watch the bytes cross the wire from the client to the first DataNode, then propagate down the pipeline. Watch the byte counters fill in. The numbers are the same numbers Tyler asks about on the exam.

Figure III.1 A client writes a file to a five-DataNode cluster

File size F (MB) 128 MB

Replication factor R 3

Action

Read this: the thick line that lights up first is the client → DataNode 1 hop — the only network the client pays for. The next two lines are the pipeline, between DataNodes, invisible to the client. The byte counter shows F × R on the cluster network because each replica still costs one cross-node hop.

Client → cluster — MB

Cluster network — MB

Cluster disk — MB

Read network — MB

Total — MB

§ III.1What just happened

The client only sees one DataNode

No matter how big R is, the client uploads F bytes once. The NameNode tells the client which DataNode to use; the client streams the file to that DataNode and walks away. The client’s bandwidth bill is F, full stop.

The pipeline pays one hop per replica

DataNode 1 forwards the bytes to DataNode 2 as it receives them; DataNode 2 forwards to DataNode 3; and so on. That’s R hops over the cluster network — one from the client to DN 1, and then R − 1 between DataNodes. Total cluster network = F × R.

Disk costs scale with R, too

Each of the R DataNodes writes the bytes to its local disk. Total cluster disk = F × R. This is the single biggest reason replication is expensive — not the network, but the storage.

Reads only pay F

To read the file the client only needs one replica. The NameNode points the client at the closest DataNode that holds it. Read network = F. And so the all-in cost of writing then reading once is F × (R + 1) on the cluster network.

Smaller blocks, better balance

HDFS splits a file into 128 MB blocks (the default) and chooses placement per block. The math above holds for each block. Smaller blocks let the NameNode spread load more evenly, at the cost of more metadata it has to track in RAM.

§ III.2The rules to remember

client uploads     : F                  (independent of R)
cluster network    : F × R              (one hop per replica)
cluster disk       : F × R              (each replica writes once)
read network       : F                  (one replica is enough)
write + 1 read     : F × (R + 1)
default block size : 128 MB
default R          : 3

← previous figure: a Kafka topic, partitioned · next figure: HBase regions on a row line →