Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features comparable to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
Hadoop was sponsored by Apache Software Foundation. It is a Cluster data Management Project (CDMP). Hadoop is a Java-based framework which manages the large data sets among the group of cluster machines. It is very hard to configure the Cluster with Hadoop. But we can also install Hadoop on a single machine to perform some basic operations.
Hadoop may appear single software but it has a lot of components following it.
Hadoop Common: We can say this as a big library which consists of utilities and libraries to support other Hadoop modules.
HDFS: The Hadoop Distributed File system is responsible storing the data on the hard disk.
YARN: YARN is the open source distributed processing framework and it stands for Yet Another Resource Negotiator.
MapReduce: Map reduce is a model for generating and processing big data sets in the cluster using parallel and distributed algorithms.
This is the base model but there are many models available for the updated Hadoop version 2.0.
Installation Process (Java and Hadoop)
First, we have to update the java package list-
$ sudo apt-get update
After that install OpenJDK, the default Java Development Kit on Ubuntu 16.04-
$ sudo apt-get install default-jdk
Once the installation is done, let’s confirm the version and we will get the output-
$ java –version
openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14,mixed mode)
We have installed Java and then we have to install the Hadoop
With Java in place, we’ll visit the Apache Hadoop Releases page to find the most recent stable release. Follow the binary for the current release
Here we are going to install hadoop 2.7.3 on ubuntu.
Figure 1: Hadoop Version Release
Figure 2: Select version
On the server, we will use wget to fetch it:
$ wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
**We will be redirected to the available mirror. So, the current URL may not match with the given one**
We have to check whether the file has been altered while downloading.
For that, we will be doing an SHA – 256 checks. Now, once again go back to the release page and then go to the Apache link. Go to the Apache web directory. We have to find the .mds file for the version which we have downloaded.
Copy the link of that file and use that with wget as mentioned below.
$ wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop -2.7.3/hadoop-2.7.3.tar.gz.mds
After that, do the verification using the below command.
$ shasum -a 256 hadoop-2.7.3.tar.gz
We will get the following output.
Now, check the SHA-256 value
$ cat hadoop-2.7.3.tar.gz.mds
The both output should match
hadoop-2.7.3.tar.gz: SHA256 = D489DF38 08244B90 6EB38F4D 081BA49E 50C4603D B03EFD5E 594A1E98 B09259C2
We can securely ignore the difference in case and the spaces. The output of the command we run against the file we downloaded from the mirror should match the value in the file we downloaded from apache.org.
Now that we’ve verified that the file wasn’t corrupted or changed, we’ll use the tar command with -xflag to extract, -z to uncompress, -v for verbose output, and -f to specify that we’re extracting from a file. Use tab-completion or substitute the exact version number in the command below:
$ tar -xzvf hadoop-2.7.3.tar.gz
Finally, we will move the extracted files into /usr/local, the suitable place for locally installed software. Change the version number, if needed, to match the version we downloaded.
$ sudo mv hadoop-2.7.3 /usr/local/hadoop
With the software in place, we’re ready to configure its environment.
Configuring Hadoop’s Java Home
We have to configure the Hadoop to use the java either in Hadoop’s configuration file or using the environmental variable.
Here /usr/bin/java and /etc/alternatives/java both are symlink to each other.
Here, we have to use the -f flag to follow the symlink in the every part of the path that is mentioned.
The sed will be used here to trim the path to get the bin/java. We have to do this to get the correct value of java Home from the output.
If we want to get the default java path
$ readlink -f /usr/bin/java | sed "s:bin/java::"
We will set this version of Java to the Home path of Hadoop. There is another way in which we can use the readlink command to dynamically set the path if we use any updated version
First, open the hadoop-env.sh:
$ sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
There are two options available. Here are these-
Option 1: Setting Up a Static Value
Option 2: Use the Readlink Directly
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
We have set the Java Path and now we can run the Hadoop
We will get the following output
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME run the class named CLASSNAME
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
note: please use “yarn jar” to launch
YARN applications,not this command.
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
credential interact with credential providers
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
The help means we’ve successfully configured Hadoop to run in stand-alone mode. We will ensure that it is functioning properly by running the example MapReduce program it ships with. To do so, create a directory called input in our home directory and copy Hadoop’s configuration files into it to use those files as our data.
$ mkdir ~/input
$ cp /usr/local/hadoop/etc/hadoop/*.xml ~/input
Next, we can use the following command to run the MapReduce hadoop-mapreduce-examples program, a Java archive with several options. We will call its grep program, one of many examples incorporated inhadoop-mapreduce-examples, followed by the input directory, input and the output directorygrep_example. The MapReduce grep program will add up the matches of a literal word or regular expression. Finally, we will supply a regular expression to find occurrences of the word principal within or at the end of a declarative sentence. The expression is case-sensitive, so we would not find the word if it were capitalized at the beginning of a sentence:
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep ~/input ~/grep_example 'principal[.]*'
Once the process completed, we will get the following output.
File System Counters
FILE: Number of bytes read=1247674
FILE: Number of bytes written=2324248
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map input records=2
Map output records=2
Map output bytes=37
Map output materialized bytes=47
Input split bytes=114
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=47
Reduce input records=2
Reduce output records=2
Shuffled Maps =1
Merged Map outputs=1
GC time elapsed (ms)=61
Total committed heap usage (bytes)=263520256
File Input Format Counters
File Output Format Counters
If we get the following output, then it means that the output folder which is existing already.
The results will be stored in the output directory and we can check it using cat.
$ cat ~/grep_example/*
The output indicates that the word principal was found after six non-occurrences. The example verifies that the installation has been done properly and the Hadoop is working well on the stand alone mode. The user who is non-privileged can run Hadoop for exploring and debugging.
Here in this post, we have learned how to install Hadoop in the stand alone mode. We have demonstrated hadoop single node cluster setup on UBUNTU. Also, we have tested whether the configuration is completely working or not.
 “Hadoop 2.6.5 Installing on Ubuntu 16.04 (Single-Node Cluster)”, available online at: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_16_04_single_node_cluster.php
 “UBUNTU 16.04 How to Install Hadoop on Ubuntu 16.04 (Single Node)”, available online at: https://poweruphosting.com/blog/install-hadoop-ubuntu/
 Melissa Anderson, “How to Install Hadoop in Stand-Alone Mode on Ubuntu 16.04”, October 13, 2016, available online at: https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-in’-stand-alone-mode-on-ubuntu-16-04
”Hadoop”, available online at: http://hadoop.apache.org/
 “JDK”, available online at: https://java.com/en/