Friday 12 August 2016

Increasing the Disk Space in HDFS



After successfully applying for and receiving money from an angel investor who, for the sake of anonymity, we’ll call “my wife”, I was able to complete my Hadoop-At-Home (HAH!) cluster this past week. The major infrastructure upgrade was necessary so I could move to four 1TB drives and stop worrying about running out of space as I get deeper into my learning. I decided to avoid trying to image my drives and to instead just start with a fresh clean install. It turned out this was an especially good idea since I had borked a few things along the way, and the time it would take to figure out how to unbork them would have probably been longer than just starting from scratch. I followed my inclination and luckily the reinstall went smoothly and fairly quick.

But then tragedy struck: last night I finished up provisioning my HAH! cluster with Ambari, using the new Hortonwork HDP 2.0 release and soon discovered that all the glorious gigabytes I had envisioned were sadly only showing 50GB per machine.



As usual, this turned out to be a user error and lack of understanding on my part; a result of choosing all defaults in the Ambari provisioning process and not understanding how partitions are allocated in Linux. Essentially the issue resulted from HDFS being configured by default to use /hadoop/hdfs/data as the data directory. As I found, this folder is part of my /root file system which only had 50GB allocated as a default from my CentOS installation. All the massive usable space ended up allocated to the home file system. Please note my Linux terminology here may be imprecise or inaccurate so feel free to correct me: the screenshot below should give you an idea of what I’m saying.


The configuration for where HDFS stores its data lives in the hdfs-site.xml file which for me is located in /etc/hadoop/conf. Inside the file the dfs.datanode.data.dir setting holds the directory.


To correct this I basically created a new folder for the data to live in under the home directory and updated the configuration to point there. The directory /home/hdfs was already there so I just added a data directory and set the permissions.


cd /home/hdfs
mkdir data
chmod 755 /home/hdfs/data

Rather than edit the configuration directly in the file with vim I found that Ambari allows you to edit the config through the Services tab. So first I stopped the DataNode, NameNode, and Secondary NameNode services on the Hosts tab in Ambari. Next, I updated the DataNode directories in the Services tab under the Configs tab, saved and restarted HDFS. Rather than overwrite the existing directory (and maybe break something) I just added the additional directory by inserting it after the original location separated by a comma.



Be aware you need to make sure that each DataNode has this new folder so that HDFS can find the correct location on all servers when you restart their services. Upon restart of HDFS I now have all the space, I was expecting on this single node! Granted I did get an error regarding an HDFS Check Execute on the restart but it seems Puppet related and what’s Hadooping without an error to look at and correct later.


Original Article Written By: Fred LaCrosse

0 comments:

Post a Comment