In this tutorial we’ll install the Big Data framework Apache Hadoop on a previously installed CentOS 8 virtual machine. We’ll use Docker containers for the cluster creation.
This is for testing purposes and not for production. Be careful and don’t expose it to Internet since I’m not setting up any security measure for it.
As prerrequisite, you’ll need a CentOS 8 virtual machine (it doesn’t matter if you use KVM, LXC, VirtualBox or any other virtualization technology) and a basic knowledge of the command line in GNU/Linux environments.
Hadoop installation on CentOS 8 Virtual Machine
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Extracted from https://hadoop.apache.org/
At the command line, we start removing all old docker installations from our CentOS 8:
sudo yum remove docker
sudo yum remove docker-client
sudo yum remove docker-client-latest
sudo yum remove docker-common
sudo yum remove docker-latest
sudo yum remove docker-latest-logrotate
sudo yum remove docker-logrotate
sudo yum remove docker-engine
Now we install yum-utils, configure Docker’s repository and then install Docker:
sudo yum install -y yum-utils
sudo yum-config-manager –add-repo https://download.docker.com/linux/centos/docker-ce.repo –allowerasing
sudo yum install docker-ce docker-ce-cli
docker –version
We use systemctl to start and enable Docker at startup:
sudo systemctl start docker
sudo systemctl enable docker
sudo systemctl status docker
Run our first container:
sudo docker run hello-world
Check containers:
sudo docker ps -a
Just for testing, start nginx webserver docker container at localhost:
docker run -d -p 80:80 –name myserver nginx
Check it opening a browser and going to CentOS 8 IP’s web server:
http://virtual_machine_IP/
Install Git, Curl and after that, docker-compose:
sudo dnf install git curl -y
sudo curl -L https://github.com/docker/compose/releases/download/1.25.4/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
docker-compose –version
Clone big-data-europe and create Hodoop cluster:
git clone https://github.com/big-data-europe/docker-hadoop.git
cd docker-hadoop
docker-compose up -d
****************************************************************************
If we want to use the container image from DockerHub instead:
docker pull bde2020/hadoop-datanode:latest
Or:
docker pull bde2020/hadoop-datanode:2.0.0-hadoop3.1.3-java8
bde2020/hadoop-datanode DockerHub’s repository:
https://hub.docker.com/r/bde2020/hadoop-datanode/tags
****************************************************************************
Access Hadoop UI from browser:
http://virtual_machine_IP:9870
Access DataNodes:
http://virtual_machine_IP:9864
Access YARN Resource Manager:
http://virtual_machine_IP:8088
Doing some stuff on DataNodes
Go to a running namenode container:
sudo docker exec -it namenode bash
Create a directory and some files:
mkdir input
echo «Hello world» > input/file1.txt
echo «hello Docker» > input/file2.txt
Create the input directory on HDFS:
hadoop fs -mkdir -p input
Put the input files to all datanodes on HDFS:
hdfs dfs -put ./input/* input
exit
Using the Word Count example:
Get example Word Count from maven.org:
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-examples/2.7.1/hadoop-mapreduce-examples-2.7.1-sources.jar
Get container ID for your namenode:
sudo docker container ls
Copy jar file into docker namenode:
sudo docker cp hadoop-mapreduce-examples-2.7.1-sources.jar namenode:hadoop-mapreduce-examples-2.7.1-sources.jar
Exec file inside container:
sudo docker exec -it namenode bash
hadoop jar hadoop-mapreduce-examples-2.7.1-sources.jar org.apache.hadoop.examples.WordCount input output
Print out the Word Count program:
hdfs dfs -cat output/part-r-00000
To shutdown the cluster:
sudo docker-compose down
Bibliography:
Install Docker Engine on CentOS
https://docs.docker.com/engine/install/centos/
How to set up a Hadoop cluster in Docker
https://clubhouse.io/developer-how-to/how-to-set-up-a-hadoop-cluster-in-docker/
Github big-data-europe / docker-hadoop
https://github.com/big-data-europe/docker-hadoop
Other resources:
Post-installation steps for Linux
https://docs.docker.com/engine/install/linux-postinstall/
Launching Applications Using Docker Containers
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/DockerContainers.html
Docker For Beginners
https://www.liquidweb.com/kb/docker-for-beginners/
15 Docker Commands You Should Know
https://towardsdatascience.com/15-docker-commands-you-should-know-970ea5203421
Looking for an IT trainer (webinar, workshops, bootcamps, etc.)? Contact me at contact.
If you like this post you can help me with a donations. Many thanks!!!
Comentarios
Una respuesta a «Hadoop installation on CentOS 8 Tutorial»
[…] Here you can find an Apache Hadoop tutorial as well if you want to have a practical glimpse of Big Data. […]