Linux

How to Install and Configure Apache Hadoop on Ubuntu 20.04

How to Install and Configure Apache Hadoop on Ubuntu 20.04

Apache Hadoop is an open source framework that is used to manage, store and process data for various large data applications that run under a cluster system. It was written in Java with some native code in C and shell scripts. It uses a distributed file system (HDFS) and improves from a single server to thousands of machines.

Apache Hadoop is based on four main components:

  • Hadoop Common: This is a collection of utilities and libraries needed by other Hadoop modules.
  • HDFS: Also known as the Hadoop Distributed File System distributed across multiple nodes.
  • MapReduce: This is a framework used to write applications to process large amounts of data.
  • Hadoop YARN: Also known as Yet Another Resource Negotiator is the Hadoop resource management layer.

In this tutorial, we will explain how to set up a single-node Hadoop cluster on Ubuntu 20.04.

Precondition

  • Server running Ubuntu 20.04 with 4 GB RAM.
  • The root password is configured on your server.

Update System Package

Before starting, it is recommended to update your system package to the latest version. You can do this with the following command:

apt-get update -y
apt-get upgrade -y

After your system is updated, restart to apply changes.

Install Java

Apache Hadoop is a Java based application. So you need to install Java on your system. You can install it with the following command:

apt-get install default-jdk default-jre -y

Once installed, you can verify the version of Java that is installed with the following command:

java -version

You should get the following output:

openjdk version "11.0.7" 2020-04-14
OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-3ubuntu1, mixed mode, sharing)

Create Hadoop Users and SSH Setup Without Passwords

First, create a new user named hadoop with the following command:

adduser hadoop

Next, add the hadoop user to the sudo group
usermod -aG sudo hadoop

Next, log in with the user and generate an SSH key pair with the following command:

su - hadoop
ssh-keygen -t rsa

You should get the following output:

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:HG2K6x1aCGuJMqRKJb+GKIDRdKCd8LXnGsB7WSxApno hadoop@ubuntu2004
The key's randomart image is:
+---[RSA 3072]----+
|..=..            |
| O.+.o   .       |
|oo*.o + . o      |
|o .o * o +       |
|o+E.= o S        |
|=.+o * o         |
|*.o.= o o        |
|=+ o.. + .       |
|o ..  o .        |
+----[SHA256]-----+

Next, add this key to the Official ssh key and give the appropriate permissions:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Next, verify SSH without a password with the following command:

ssh localhost

After logging in without a password, you can proceed to the next step.

Install Hadoop

First, log in with a Hadoop user and download the latest version of Hadoop with the following command:

su - hadoop
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

After the download is complete, extract the downloaded file with the following command:

tar -xvzf hadoop-3.2.1.tar.gz

Next, move the extracted directory to / usr / local /:

sudo mv hadoop-3.2.1 /usr/local/hadoop

Next, create a directory to store logs with the following command:

sudo mkdir /usr/local/hadoop/logs

Next, change the ownership of the hadoop directory to hadoop:

sudo chown -R hadoop:hadoop /usr/local/hadoop

Next, you need to configure the Hadoop environment variable. You can do this by editing the ~ / .bashrc file:

nano ~/.bashrc

Add the following lines:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Save and close the file when you are finished. Then, activate the environment variable with the following command:

source ~/.bashrc

Configure Hadoop

In this section, we will learn how to set up Hadoop on one node.

Configure Java Environment Variables

Next, you need to define Java environment variables in hadoop-env.sh to configure YARN, HDFS, MapReduce, and project settings related to Hadoop.

First, find the correct Java path using the following command:

which javac

You will see the following output:

/usr/bin/javac

Next, find the OpenJDK directory with the following command:

readlink -f /usr/bin/javac

You will see the following output:

/usr/lib/jvm/java-11-openjdk-amd64/bin/javac

Next, edit the hadoop-env.sh file and specify the Java path:

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add the following lines:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"

Next, you also need to download the Javax activation file. You can download it with the following command:

cd /usr/local/hadoop/lib
sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-api/1.2.0/javax.activation-api-1.2.0.jar

You can now verify the Hadoop version using the following command:

hadoop version

You should get the following output:

Hadoop 3.2.1
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
Compiled by rohithsharmaks on 2019-09-10T15:56Z
Compiled with protoc 2.5.0
From source with checksum 776eaf9eee9c0ffc370bcbc1888737
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.1.jar

Configure the core-site.xml file

Next, you must specify the URL for your NameNode. You can do this by editing the core-site.xml file:

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following lines:

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://0.0.0.0:9000</value>
      <description>The default file system URI</description>
   </property>

</configuration>

Save and close the file when you are finished:

Configure the hdfs-site.xml file

Next, you need to determine the location for storing metadata nodes, file files, and editing log files. You can do this by editing the hdfs-site.xml file. First, create a directory to store node metadata:

sudo mkdir -p /home/hadoop/hdfs/{namenode,datanode}
sudo chown -R hadoop:hadoop /home/hadoop/hdfs

Next, edit the hdfs-site.xml file and specify the directory location:

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following lines:

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>

   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hdfs/namenode</value>
   </property>

   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hdfs/datanode</value>
   </property>
</configuration>

Save and close the file.

Configure the mapred-site.xml file

Next, you need to define a MapReduce value. You can define it by editing the mapred-site.xml file:

sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following lines:

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Save and close the file.

Configure the yarn-site.xml File

Next, you need to edit the thread-site.xml file and specify YARN-related settings:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following lines:

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

Save and close the file when you are finished.

Format NameNode HDFS

Next, you need to validate the Hadoop configuration and format the HDFS NameNode.

First, log in with Hadoop users and format the HDFS NameNode with the following command:

su - hadoop
hdfs namenode -format

You should get the following output:

2020-06-07 11:35:57,691 INFO util.GSet: VM type       = 64-bit
2020-06-07 11:35:57,692 INFO util.GSet: 0.25% max memory 1.9 GB = 5.0 MB
2020-06-07 11:35:57,692 INFO util.GSet: capacity      = 2^19 = 524288 entries
2020-06-07 11:35:57,706 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2020-06-07 11:35:57,706 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2020-06-07 11:35:57,706 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2020-06-07 11:35:57,710 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2020-06-07 11:35:57,710 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2020-06-07 11:35:57,712 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2020-06-07 11:35:57,712 INFO util.GSet: VM type       = 64-bit
2020-06-07 11:35:57,712 INFO util.GSet: 0.029999999329447746% max memory 1.9 GB = 611.9 KB
2020-06-07 11:35:57,712 INFO util.GSet: capacity      = 2^16 = 65536 entries
2020-06-07 11:35:57,743 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1242120599-69.87.216.36-1591529757733
2020-06-07 11:35:57,763 INFO common.Storage: Storage directory /home/hadoop/hdfs/namenode has been successfully formatted.
2020-06-07 11:35:57,817 INFO namenode.FSImageFormatProtobuf: Saving image file /home/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression
2020-06-07 11:35:57,972 INFO namenode.FSImageFormatProtobuf: Image file /home/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 398 bytes saved in 0 seconds .
2020-06-07 11:35:57,987 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2020-06-07 11:35:58,000 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2020-06-07 11:35:58,003 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu2004/69.87.216.36
************************************************************/

Start the Hadoop Cluster

First, start NameNode and DataNode with the following command:

start-dfs.sh

You should get the following output:

Starting namenodes on [0.0.0.0]
Starting datanodes
Starting secondary namenodes [ubuntu2004]

Next, start the TRUE source and the nodemanager by running the following command:

start-yarn.sh

You should get the following output:

Starting resourcemanager
Starting nodemanagers

You can now verify it with the following command:

jps

You should get the following output:

5047 NameNode
5850 Jps
5326 SecondaryNameNode
5151 DataNode

Access the Hadoop Web Interface

You can now access the Hadoop Name using the URL http: // your-server-ip: 9870. You will see the following screen:

1

You can also access individual DataNodes using the URL http: // your-server-ip: 9864. You will see the following screen:

2

To access YARN Resource Manager, use the URL http: // your-server-ip: 8088. You will see the following screen:

3

Conclusion

Congratulations! You have successfully installed Hadoop on one node. You can now start exploring basic HDFS commands and designing a fully distributed Hadoop cluster. Feel free to ask me if you have questions.

Related posts

How to Install Ruby On Rails on Ubuntu 20.04

Linux

How to Install Mono on Ubuntu 20.04

Linux

How to Set Up an Elasticsearch cluster with Multiple Nodes

Linux

How to Install Vagrant on Ubuntu 20.04

Linux

How to Install Apache Cassandra on CentOS 8

Linux

How to Install Ruby on Rails (RoR) on Debian 10

Linux

What You Need to Know About Locked in Ubuntu 20.04

Linux

How to delete a Terminal on Ubuntu and other Linux distributions

Linux

How to Activate Snap Package Manager on Linux Mint 20

Linux