This is one the major tests that you may want to do in order to see how DFS is performing. So, we are now going to take a look at how to use these tests to know how efficiently HDFS is able to write and read data.
As seen in the preceding screenshot, the library provides tools to test DFS through an option called TestDFSIO
. Now, let's execute the write test in order to understand how efficiently HDFS is able to write big files. The following is the command to execute the write test:
Once you initiate the preceding command, a map reduce job will start, which will write two files to HDFS that are 1GB in size . You can choose any numbers based on your cluster size. These tests create data in HDFS under the /benchmarks
directory. Once the execution is complete, you will see these results:
The preceding data is calculated from the RAW data generated by the Map Reduce program. You can also view the raw data as follows:
Tip
The following formulae are used to calculate throughput, the average IO rate, and standard deviation.
Throughput = size * 1000/time * 1048576
Average IO rate = rate/1000/tasks
Standard deviation = square root of (absolute value(sqrate/1000/tasks – Average IO Rate * Average IO Rate))
Similarly, you can perform benchmarking of HDFS read operations as well:
At the end of the execution, a reducer will collect the data from the RAW results, and you will see calculated numbers for the DFSIO reads:
Here, we can take a look at the RAW data as well:
The same formulae are used to calculate the throughput, average IO rate, and standard deviation.
This way, you can benchmark the DFSIO reads and writes.