Hadoop installation on CentOS 8 Tutorial

In this tutorial we’ll install the Big Data framework Apache Hadoop on a previously installed CentOS 8 virtual machine. We’ll use Docker containers for the cluster creation.

This is for testing purposes and not for production. Be careful and don’t expose it to Internet since I’m not setting up any security measure for it.

As prerrequisite, you’ll need a CentOS 8 virtual machine (it doesn’t matter if you use KVM, LXC, VirtualBox or any other virtualization technology) and a basic knowledge of the command line in GNU/Linux environments.

Hadoop installation on CentOS 8 Virtual Machine

bigdata Hadoop install centOS 8

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Extracted from https://hadoop.apache.org/

At the command line, we start removing all old docker installations from our CentOS 8:

sudo yum remove docker
sudo yum remove docker-client
sudo yum remove docker-client-latest
sudo yum remove docker-common
sudo yum remove docker-latest
sudo yum remove docker-latest-logrotate
sudo yum remove docker-logrotate
sudo yum remove docker-engine

Now we install yum-utils, configure Docker’s repository and then install Docker:

sudo yum install -y yum-utils
sudo yum-config-manager –add-repo https://download.docker.com/linux/centos/docker-ce.repo –allowerasing
sudo yum install docker-ce docker-ce-cli
docker –version

We use systemctl to start and enable Docker at startup:

sudo systemctl start docker
sudo systemctl enable docker
sudo systemctl status docker

Run our first container:

sudo docker run hello-world

Check containers:

sudo docker ps -a

Just for testing, start nginx webserver docker container at localhost:

docker run -d -p 80:80 –name myserver nginx

Check it opening a browser and going to CentOS 8 IP’s web server:

http://virtual_machine_IP/

Install Git, Curl and after that, docker-compose:

sudo dnf install git curl -y
sudo curl -L https://github.com/docker/compose/releases/download/1.25.4/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
docker-compose –version

Clone big-data-europe and create Hodoop cluster:

git clone https://github.com/big-data-europe/docker-hadoop.git

cd docker-hadoop
docker-compose up -d

****************************************************************************
If we want to use the container image from DockerHub instead:

docker pull bde2020/hadoop-datanode:latest

Or:

docker pull bde2020/hadoop-datanode:2.0.0-hadoop3.1.3-java8

bde2020/hadoop-datanode DockerHub’s repository:
https://hub.docker.com/r/bde2020/hadoop-datanode/tags
****************************************************************************

Access Hadoop UI from browser:

http://virtual_machine_IP:9870

Access DataNodes:

http://virtual_machine_IP:9864

Access YARN Resource Manager:

http://virtual_machine_IP:8088

Doing some stuff on DataNodes

Go to a running namenode container:

sudo docker exec -it namenode bash

Create a directory and some files:

mkdir input
echo “Hello world” > input/file1.txt
echo “hello Docker” > input/file2.txt

Create the input directory on HDFS:

hadoop fs -mkdir -p input

Put the input files to all datanodes on HDFS:

hdfs dfs -put ./input/* input

exit

Using the Word Count example:

Get example Word Count from maven.org:

wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-examples/2.7.1/hadoop-mapreduce-examples-2.7.1-sources.jar

Get container ID for your namenode:

sudo docker container ls

Copy jar file into docker namenode:

sudo docker cp hadoop-mapreduce-examples-2.7.1-sources.jar namenode:hadoop-mapreduce-examples-2.7.1-sources.jar

Exec file inside container:

sudo docker exec -it namenode bash
hadoop jar hadoop-mapreduce-examples-2.7.1-sources.jar org.apache.hadoop.examples.WordCount input output

Print out the Word Count program:

hdfs dfs -cat output/part-r-00000

To shutdown the cluster:

sudo docker-compose down

Bibliography:

Install Docker Engine on CentOS
https://docs.docker.com/engine/install/centos/

How to set up a Hadoop cluster in Docker
https://clubhouse.io/developer-how-to/how-to-set-up-a-hadoop-cluster-in-docker/

Github big-data-europe / docker-hadoop
https://github.com/big-data-europe/docker-hadoop

Other resources:

Post-installation steps for Linux
https://docs.docker.com/engine/install/linux-postinstall/

Launching Applications Using Docker Containers
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/DockerContainers.html

Docker For Beginners
https://www.liquidweb.com/kb/docker-for-beginners/

15 Docker Commands You Should Know
https://towardsdatascience.com/15-docker-commands-you-should-know-970ea5203421