I previously had the instructions below at the beginning of my instructions on how to install Hadoop. However, I have come to realize that you can install Hadoop and not move on to installing/configuring Hadoop's other services (YARN, MapReduce, etc.) and so I decided to split the instructions. If you just need to install HDFS to, for instance, use it in conjunction with Spark, read on.
Even though we are just installing HDFS, that still means downloading a version of Hadoop. In July 2015, the latest version of Hadoop was 2.7.1. So:
- Head to <http://hadoop.apache.org/releases.html>
- Download the binary version of Hadoop:
hadoop-2.7.1.tar.gz. scpthe archive to all of the machines that you want to participate in the HDFS cluster.
WARNING: As you will see, installing HDFS is NOT for the faint-hearted.
NOTE: These instructions are adapted from the information that appears in Chapter 10 of Hadoop: The Definitive Guide, 4th Edition. The text in that chapter is written at a high level of abstraction. My instructions below contain way more details. 😃
We will be configuring multiple machines to run as a HDFS cluster. In my case, I created five VMs: one master node and four worker nodes: worker1, worker2, etc. NOTE: Unless otherwise indicated, you need to follow the instructions below for each node in your cluster.
Edit the /etc/hosts file on each node in your cluster so you can refer to each machine by their name and not by their IP address. Edit the file (sudo vi /etc/hosts) to look something like this:
123.1.1.25 master
123.1.1.26 worker01
123.1.1.27 worker02
123.1.1.28 worker03
123.1.1.29 worker04
Note: You will need to change the IP addresses listed above to match your set-up.
Once you have saved the file, you can test that the changes worked with the ping command:
ping master
ping worker01
ping ...
Make sure that your machines have Java installed. See my instructions or use the ~1.5B pages on the Internet that also discuss how to do that.
We need to create user accounts for hadoop and hdfs. You will need to create passwords for each account and keep track of them separately.
sudo addgroup hadoopsudo adduser --ingroup hadoop hadoopsudo adduser --ingroup hadoop hdfs
Let's place the previously downloaded Hadoop archive into a known location. I decided to use /usr/local/src for that. You can use a different directory if you want.
- cd to the directory that contains the Hadoop archive.
sudo mv <archive> /usr/local/srccd /usr/local/srcsudo tar xzf <archive>sudo rm <archive>(Optional)sudo chown -R hadoop:hadoop hadoop-2.7.1(change to match the version you downloaded)
We need to make sure that the accounts on each machine have access to Hadoop (and by extension, HDFS) and its programs.
-
Add the following line to
/etc/environment:sudo vi /etc/environmentHADOOP_HOME="/usr/local/src/hadoop-2.7.1" -
Add the following line to
/etc/bash.bashrcexport PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin -
Logout and login back in and check to see that you have
hadoopin your path
Note: Be sure your version of HADOOP_HOME matches the location where you unpacked the Hadoop archive. Recall that I used /usr/local/src but you are free to use other directories. I won't mention this issue again moving forward but be aware of it.
We need to create an RSA key pair for the hdfs account. The private and public keys are stored on the master node. The public keys for this account must then be copied to the corresponding location on each of the worker nodes.
On the master node, run the following commands:
- Switch to be the hdfs user:
su - hdfs ssh-keygen -t rsa -f ~/.ssh/id_rsa- Be sure to enter a passphrase for the key pair and record it in a safe location.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
We need to copy the public key from the master to each of the worker nodes. We will use ssh-copy-id for this and simply enter the password that you created for the hdfs account earlier.
On the master node, run the following commands FOR EACH WORKER NODE:
su - hdfsssh-copy-id hdfs@worker01- Say 'yes' if you are asked to accept the fingerprint of the worker01 machine
- Enter the password for the hdfs account on worker01 if asked
- Test the ability to login to the worker01 machine:
ssh worker01 - You will be asked for your passphrase. Type it and you will be logged in on the remote machine.
As the ubuntu account on your master node, install keychain
sudo apt-get updatesudo apt-get install keychain
Then login to the hdfs account and edit the .profile to make use of keychain
-
su - hdfs -
vi .profile# Add the following lines to end of .profile file/usr/bin/keychain $HOME/.ssh/id_rsa source $HOME/.keychain/$HOSTNAME-sh -
exit -
su - hdfs -
Keychain will ask you to enter your passphrase
-
Now if you type
ssh worker01, you should be logged in without having to type your passphase -
You will no longer need to enter your passphrase until your virtual machine is rebooted
Based on my own Internet searches, the information in this section is hard to come by. You're welcome. 😃
There are up to three different directories that now need to be created on each node in the cluster:
| Directory | Description |
|---|---|
| tmp | A tmp directory on the master node. |
| namenode | A directory for the namenode server on the master node |
| datanode | A directory for the datanode server on the master node and all worker nodes. |
A simple location for these directories is in the /home/hdfs/ account. But only create them in this directory if /home/hdfs is on a large disk partition in your set-up. For intance, in my set-up, I created 500 GB data
volumes for each of my worker nodes. I then placed their datanode directories on those data volumes.
For now, the instructions here and below will assume that all three directories are created in /home/hdfs. But, given the information above, you can place them in different directories. Just make sure you adjust the information in the configuration files below with the locations that you used.
On the master node, create a directory for HDFS to store temporary files.
su - hdfsmkdir tmpchmod 777 tmp
On the master node, stay logged in as the HDFS user and now create a directory for the Name Node Server
su - hdfsmkdir namenodechmod 777 namenode
On the master node and on all worker nodes, create a directory for the Data Node Server
su - hdfsmkdir datanodechmod 777 datanode
We need a single location for Hadoop-related log files on all nodes in the cluster.
su - hadoopcd /usr/local/src/hadoop-2.7.1mkdir logschmod 777 logs
Edit the file /usr/local/src/hadoop-2.7.1/etc/hadoop/core-site.xml on all nodes of the cluster such that the configuration tag now looks like this:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdfs/tmp</value>
<description>Hadoop Directory for Temporary Files</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:54310</value>
<description>Use HDFS as file storage engine</description>
</property>
</configuration>This tells Hadoop where the temp directory is located and that we are going to be using HDFS to store our files.
Edit the file /usr/local/src/hadoop-2.7.1/etc/hadoop/hdfs-site.xml on all nodes of the cluster such that the configuration tag now looks like this:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hdfs/datanode</value>
</property>
</configuration>This tells HDFS to store two copies of each file and it tells HDFS where to store its namenode files and its datanode files. I will need to update these instructions to place these directories on disk partitions that have a lot of disk space. For now, I'm just trying to get a simple multi-node configuration up and running.
On the master node, edit the slaves files to look like this:
master
worker01
If you have more worker nodes, then list them below worker01, one per line.
On the master node, perform these commands to format the namenode directory
su - hdfshdfs namenode -format
If all goes correctly, you should see the string /home/hdfs/namenode has been successfully formatted. somewhere in the massive amount of output generated by the second command. You can also cd to the namenode directory and see that it is no longer empty.
On the master node, execute these commands:
su - hdfsstart-dfs.sh
On the master, run the command jps as the hdfs user, you should see something similar to:
26791 DataNode
26983 SecondaryNameNode
27098 Jps
26635 NameNode
This indicates that on the master, there is a name node, a secondary name node, and a data node running.
On one of the worker nodes, run the command jps as the hdfs user, you should see something similar to:
15073 Jps
14978 DataNode
This indicates that on this worker, a data node was successfully launched.
If you ever need to shut the HDFS cluster down, follow these steps. On the master node:
su - hdfsstop-dfs.sh
At this point, you are done! Your HDFS cluster is now up and running! Congrats!
Now, head over to my page on using HDFS to get started on using your new cluster.