Short note about HDFS or why you need distributed file system

Why do you need HDFS (Hadoop Distributed Files System)? If the amount of data is small and place on your computer is enough for this, then you do not need distributed file system. But if you like to process a large amount of data, which is not possible to save on one computer, then you need to think about distributed file system.

From user perspective distributed file system looks like normal file system on your computer. This is because of the logical layer, which hides all implementation details. When you copy file in HDFS, HDFS distributes it automatically to computers in cluster. Here two parameters are very important. The first one is a block size, the second one is replication factor.

HDFS starts with dividing you file in small chuncks of data, the size of which is specified by block size parameter. Next HDFS copies these chuncks on differnet computers in cluster. If replication factor is equal to one, one chunck is stored only once. This is not very good from recovery point of view. Your data can be lost in this case. By default replication factor is equal to three. That means HDFS stores same chunk on three different computers.

Side effect of replication is an improvement of computational speed. The computation is done on the same computer, where the data is located. In this case no network transfer is needed.