HDFS, by the byte.
The HDFS exam question is almost always the same: a file of size F, a replication factor of R — how much network and how much disk does the cluster pay? The answer falls out of one design choice: pipelining.
A naive design would have the client send the whole file R times — once to each replica. HDFS doesn’t. The client sends the file once, to the first DataNode chosen by the NameNode. That DataNode forwards the bytes to the second, which forwards them to the third. The bytes flow through a pipeline, and the client pays the network cost only of the first hop.
Set the file size and replication factor below and press Write. Watch the bytes cross the wire from the client to the first DataNode, then propagate down the pipeline. Watch the byte counters fill in. The numbers are the same numbers Tyler asks about on the exam.
§ III.1What just happened
The client only sees one DataNode
No matter how big R is, the client uploads F bytes once. The NameNode tells the client which DataNode to use; the client streams the file to that DataNode and walks away. The client’s bandwidth bill is F, full stop.
The pipeline pays one hop per replica
DataNode 1 forwards the bytes to DataNode 2 as it receives them; DataNode 2 forwards to DataNode 3; and so on. That’s R hops over the cluster network — one from the client to DN 1, and then R − 1 between DataNodes. Total cluster network = F × R.
Disk costs scale with R, too
Each of the R DataNodes writes the bytes to its local disk. Total cluster disk = F × R. This is the single biggest reason replication is expensive — not the network, but the storage.
Reads only pay F
To read the file the client only needs one replica. The NameNode points the client at the closest DataNode that holds it. Read network = F. And so the all-in cost of writing then reading once is F × (R + 1) on the cluster network.
Smaller blocks, better balance
HDFS splits a file into 128 MB blocks (the default) and chooses placement per block. The math above holds for each block. Smaller blocks let the NameNode spread load more evenly, at the cost of more metadata it has to track in RAM.
§ III.2The rules to remember
client uploads : F (independent of R)
cluster network : F × R (one hop per replica)
cluster disk : F × R (each replica writes once)
read network : F (one replica is enough)
write + 1 read : F × (R + 1)
default block size : 128 MB
default R : 3
← previous figure: a Kafka topic, partitioned · next figure: HBase regions on a row line →