Install Hadoop in Stand-Alone Mode on Ubuntu 16.04

January 24, 2018

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features comparable to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Hadoop was sponsored by Apache Software Foundation. It is a Cluster data Management Project (CDMP). Hadoop is a Java-based framework which manages the large data sets among the group of cluster machines. It is very hard to configure the Cluster with Hadoop. But we can also install Hadoop on a single machine to perform some basic operations.

Hadoop may appear single software but it has a lot of components following it.

Hadoop Common: We can say this as a big library which consists of utilities and libraries to support other Hadoop modules.

HDFS: The Hadoop Distributed File system is responsible storing the data on the hard disk.

YARN: YARN is the open source distributed processing framework and it stands for Yet Another Resource Negotiator.

MapReduce: Map reduce is a model for generating and processing big data sets in the cluster using parallel and distributed algorithms.

This is the base model but there are many models available for the updated Hadoop version 2.0.

JAVA Installation
First, we have to update the java package list-
$sudo apt-get update After that install OpenJDK, the default Java Development Kit on Ubuntu 16.04- $ sudo apt-get install default-jdk
Once the installation is done, let’s confirm the version and we will get the output-
$java –version openjdk version "1.8.0_91" OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14) OpenJDK 64-Bit Server VM (build 25.91-b14,mixed mode)  Hadoop Installation We have installed Java and then we have to install the Hadoop With Java in place, we’ll visit the Apache Hadoop Releases page to find the most recent stable release. Follow the binary for the current release Here we are going to install hadoop 2.7.3 on ubuntu. Figure 1: Hadoop Version Release Figure 2: Select version On the server, we will use wget to fetch it: $ wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
**We will be redirected to the available mirror. So, the current URL may not match with the given one**
For that, we will be doing an SHA – 256 checks. Now, once again go back to the release page and then go to the Apache link. Go to the Apache web directory. We have to find the .mds file for the version which we have downloaded.

Copy the link of that file and use that with wget as mentioned below.
$wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop -2.7.3/hadoop-2.7.3.tar.gz.mds After that, do the verification using the below command. $ shasum -a 256 hadoop-2.7.3.tar.gz
We will get the following output.
d489df3808244b906eb38f4d081ba49e50c4603db03efd5e594a1e98b09259c2 hadoop-2.7.3.tar.gz
Now, check the SHA-256 value
$cat hadoop-2.7.3.tar.gz.mds The both output should match ~/hadoop-2.7.3.tar.gz.mds ... hadoop-2.7.3.tar.gz: SHA256 = D489DF38 08244B90 6EB38F4D 081BA49E 50C4603D B03EFD5E 594A1E98 B09259C2 ... We can securely ignore the difference in case and the spaces. The output of the command we run against the file we downloaded from the mirror should match the value in the file we downloaded from apache.org. Now that we’ve verified that the file wasn’t corrupted or changed, we’ll use the tar command with -xflag to extract, -z to uncompress, -v for verbose output, and -f to specify that we’re extracting from a file. Use tab-completion or substitute the exact version number in the command below: $ tar -xzvf hadoop-2.7.3.tar.gz
Finally, we will move the extracted files into /usr/local, the suitable place for locally installed software. Change the version number, if needed, to match the version we downloaded.
$sudo mv hadoop-2.7.3 /usr/local/hadoop With the software in place, we’re ready to configure its environment. Configuring Hadoop’s Java Home We have to configure the Hadoop to use the java either in Hadoop’s configuration file or using the environmental variable. Here /usr/bin/java and /etc/alternatives/java both are symlink to each other. Here, we have to use the -f flag to follow the symlink in the every part of the path that is mentioned. The sed will be used here to trim the path to get the bin/java. We have to do this to get the correct value of java Home from the output. If we want to get the default java path $ readlink -f /usr/bin/java | sed "s:bin/java::" Output /usr/lib/jvm/java-8-openjdk-amd64/jre/
We will set this version of Java to the Home path of Hadoop. There is another way in which we can use the readlink command to dynamically set the path if we use any updated version
$sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh There are two options available. Here are these- Option 1: Setting Up a Static Value /usr/local/hadoop/etc/hadoop/hadoop-env.sh ... #export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ ...
/usr/local/hadoop/etc/hadoop/hadoop-env.sh ... #export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") ...

We have set the Java Path and now we can run the Hadoop

$/usr/local/hadoop/bin/hadoop We will get the following output Usage: hadoop [--config confdir] [COMMAND | CLASSNAME] CLASSNAME run the class named CLASSNAME or where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file note: please use “yarn jar” to launch YARN applications,not this command. checknative [-a|-h] check native hadoop and compression libraries availability distcp <srcurl> <desturl> copy file or directories recursively archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive classpath prints the class path needed to get the credential interact with credential providers Hadoop jar and the required libraries daemonlog get/set the log level for each daemon The help means we’ve successfully configured Hadoop to run in stand-alone mode. We will ensure that it is functioning properly by running the example MapReduce program it ships with. To do so, create a directory called input in our home directory and copy Hadoop’s configuration files into it to use those files as our data. $ mkdir ~/input $cp /usr/local/hadoop/etc/hadoop/*.xml ~/input Next, we can use the following command to run the MapReduce hadoop-mapreduce-examples program, a Java archive with several options. We will call its grep program, one of many examples incorporated inhadoop-mapreduce-examples, followed by the input directory, input and the output directorygrep_example. The MapReduce grep program will add up the matches of a literal word or regular expression. Finally, we will supply a regular expression to find occurrences of the word principal within or at the end of a declarative sentence. The expression is case-sensitive, so we would not find the word if it were capitalized at the beginning of a sentence: /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep ~/input ~/grep_example 'principal[.]*' Once the process completed, we will get the following output. ... File System Counters FILE: Number of bytes read=1247674 FILE: Number of bytes written=2324248 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=2 Map output records=2 Map output bytes=37 Map output materialized bytes=47 Input split bytes=114 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=47 Reduce input records=2 Reduce output records=2 Spilled Records=4 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=61 Total committed heap usage (bytes)=263520256 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=151 File Output Format Counters Bytes Written=37 If we get the following output, then it means that the output folder which is existing already. ... at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) The results will be stored in the output directory and we can check it using cat.$ cat ~/grep_example/* Output 6 principal 1 principal
The output indicates that the word principal was found after six non-occurrences. The example verifies that the installation has been done properly and the Hadoop is working well on the stand alone mode. The user who is non-privileged can run Hadoop for exploring and debugging.

Conclusion

Here in this post, we have learned how to install Hadoop in the stand alone mode. We have demonstrated hadoop single node cluster setup on UBUNTU. Also, we have tested whether the configuration is completely working or not.

References

[2] “UBUNTU 16.04 How to Install Hadoop on Ubuntu 16.04 (Single Node)”, available online at: https://poweruphosting.com/blog/install-hadoop-ubuntu/

[3] Melissa Anderson, “How to Install Hadoop in Stand-Alone Mode on Ubuntu 16.04”, October 13, 2016, available online at: https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-in’-stand-alone-mode-on-ubuntu-16-04

$${}$$