One of the well-known best practices for HDFS is to store data in few large files, rather than a large number of small ones. There are a few problems related to using many small files but the ultimate HDFS killer is that the memory consumption on the name node is proportional to the number of files stored in the cluster and it doesn’t scale well when that number increases rapidly.
MapR has its own implementation of the Hadoop filesystem (called MapR-FS) and one of its claims to fame is to scale and work well with small files. In practice, though, there are a few things you should do to ensure that the performance of your map-reduce jobs does not degrade when they are dealing with too many small files, and I’d like to cover some of those.
I stumbled upon this when investigating the performance of a job in production that was taking several hours to run on a 40-node cluster. The cluster had spare capacity but the job was progressing very slowly and using only 3 of the 40 available nodes.
When I looked into the data that was being processed by the active mappers, I noticed that vast majority of the splits being read by the mappers were in blocks that were replicated into the same 3 cluster nodes. There was a significant data distribution skew towards those 3 nodes and since the map-reduce tasks prefer to execute on nodes where the data is local, the rest of the cluster sat idle while those 3 nodes were IO bound and processing heavily.
Differently from HDFS, MapR-FS doesn’t have name nodes. The file metadata is distributed across different data nodes instead. This is the key for getting rid of the name node memory limitation of HDFS, and let MapR-FS handle a lot more files, small or large, than a HDFS cluster.
Files in MapR-FS have, by default, blocks of 256MB. Blocks are organised in logical structures called “containers”. When a new block is created it is automatically assigned to one existing container within the volume that contains that file. The container determines the replication factor (3 by default) and the nodes where the replicas will be physically stored. Containers are bound to a MapR volume and cannot span multiple volumes.
There’s also a special container in MapR-FS called a “name container”, which is where the volume namespace and file chunk locations are stored. Besides the metadata, the name container always stores the first 64KB of the file’s data.
Also, there’s only a single name container per MaprFS volume. So the metadata for all the files in a volume, along with the files’ first 64KB of data, will be all stored in the same name container. The larger the number of files in a volume, the more data this container will be replicating across the same 3 cluster nodes (by default).
So, if your data set is comprised of a very large number of small files (with sizes around 64KB or less) and is all in the sae volume, most of the data will be stored in the same 3 cluster nodes, regardless of the cluster size. Even if you had a very large cluster, whenever you ran a map-reduce job to process those files, the job’s tasks would be pretty much allocated on only 3 nodes of the cluster due to data locality. Those 3 nodes would be under heavy load while the rest of the cluster would sit idle.
To give you an idea of the dimension of this problem, the first time I noticed this in production was due to a Hive query that was causing high load only in 3 nodes of 40-node cluster. The job took 5 hours to complete. When I looked into the problem I found that the table used by the Hive query had tens of thousands of very small files, many of them smaller than 64K, due to the way the data was being ingested.
We coalesced the table to combine all those small files into a much smaller number of bigger ones. The job ran again after that, without any changes, and completed in just 15 minutes!! To be completely fair, we also changed the table’s file format from SequenceFile to RCFile at the same time we coalesced the data, which certainly brought some additional performance improvements. But, from the 3-node contention I saw during the first job run, I’m fairly convinced that the main issue in this case was the data distribution skew due to the large amount of small files.
This kind of problem is mitigated when large files are used, since only a small fraction of the data (everything below the 64KB mark) will be stored in the name container, with the rest distributed across other containers and, therefore, other nodes. We’ll also have a smaller number of files (for a similar data volume), which reduces the problem even more.
If your data is ingested in a way that creates many small files, plan to coalesce those files into larger ones on a regular basis. One good tool for that is Edward Capriolo’s File Crusher. This is also (and especially) applicable to HDFS.
Best practice #1: Keep you data stored into large files. Pay special attention to incremental ingestion pipelines, which may create many small files, and coalesce them on a regular basis.
A quick and dirty workaround for the 3-node contention issue explained above would be to increase the replication factor for the name container. This would allow more nodes to run map-reduce tasks on that data. However, it would also use a lot more disk space just to achieve the additional data locality across a larger number of nodes. This is NOT an approach I would recommend to solve this particular problem.
Instead, the proper way to solve this in Mapr-FS is to split your data across different volumes. Especially if you’re dealing with a large number of small files that cannot be coalesced, splitting them across multiple volumes will keep the number of files per volume (and per name container) under control and it will also spread the small files’ data evenly across the cluster, since each volume will have its own name container, replicate across a different set of nodes.
The volumes may, or may not, follow your data lifecycle, with monthly, weekly or even daily volumes, depending on the amount of data being ingested and files being created.
Best practice #2: Use Mapr-FS volumes to plan your data distribution and keep the number of files per volume under control.