Hadoop Installation Modes

Apache Hadoop framework is a big buzz in the IT world. It provides a solution the big data is posing to the digital world. The framework allows for data analysis of large datasets which are distributed across clusters of computers by using a simple programming model. 

To start learning about Hadoop you would want to setup a HADOOP environment. There are basically three modes in which hadoop cluster can be installed.  Hadoop Pseudo distributed mode The following book will guide on the practical aspects of Hadoop. Hadoop Stand Alone Mode:- To understand the basics of Hadoop and using it as a playground to run some exercise, stand alone mode of hadoop is sufficient. In this mode you install the bare minimum components on a system. Following High level steps are required for Stand alone hadoop setup 1. Set up a virtual machine with any linux environment (CENT OS or Ubuntu) 2. Install JAVA on the virtual machine. 4. Download Hadoop installation files and extract these on your virtual machine. 5. Grant permissions to Hadoop user on the folder where Hadoop is extracted. 6. Change the /home/hadoop/hadoop/conf/hadoop-env.sh file to set the HADOOP_HOME and JAVA_HOME variable —-STAND ALONE,PSEODO DISTRIBUTED HADOOP INSTALLATION STARTED —- ——           STAND ALONE HADOOP INSTALLATION COMPLETED  —— After setting up the installation as advised in the above steps you should have a stand alone hadoop installation. You can check the installation by executing the hadoop command from the hadoop home location, where hadoop is installed and you should see the following output Usage: hadoop [–config confdir] COMMAND                                      namenode -format     format the DFS filesystem                              secondarynamenode    run the DFS secondary namenode                         namenode             run the DFS namenode                                   datanode             run a DFS datanode                                     dfsadmin             run a DFS admin client                                 mradmin              run a Map-Reduce admin client                          fsck                 run a DFS filesystem checking utility                  fs                   run a generic filesystem user client                   balancer             run a cluster balancing utility                        fetchdt              fetch a delegation token from the NameNode             jobtracker           run the MapReduce job Tracker node                     tasktracker          run a MapReduce task Tracker node                      historyserver        run job history servers as a standalone daemon         job                  manipulate MapReduce jobs                              queue                get information regarding JobQueues                    version              print the version                                      distcp copy file or directories recursively              archive -archiveName NAME -p * create a hadoop    archive                                                                          classpath            prints the class path needed to get the                              Hadoop jar and the required libraries                  daemonlog            get/set the log level for each daemon                  CLASSNAME            run the class named CLASSNAME                        Most commands print help when invoked w/o parameters.                       Hadoop Pseudo Distributed Mode:- As the name suggests pseudo distributed mode is not in reality a distributed hadoop installation but it actually simulates one. Distributed mode requires that you setup hadoop installation on multi node cluster (minimum two) but with pseudo distributed mode you can actually get a feel of distributed hadoop environment with hadoop on single cluster. Apart from the above six steps listed above, you will additionally need to do the following 7. Change the following configuration files for details on the changes in configuration files refer article http://mainframewizard.com/content/setup-single-node-hadoop-cluster CORE-SITE.XML(/home/hadoop/hadoop/conf/core-site.xml) HDFS-SITE.XML(/home/hadoop/hadoop/conf/hdfs-site.xml) MAPRED-SITE.XML(/home/hadoop/hadoop/conf/mapred-site.xml) MASTERS(/home/hadoop/hadoop/conf/masters) – Not mandatory SLAVES(/home/hadoop/hadoop/conf/slaves) – Not mandatory , you can get a fully functional pseodo distributed hadoop  —-STAND ALONE,PSEODO DISTRIBUTED HADOOP INSTALLATION STARTED —- —-  PSEODO DISTRIBUTED HADOOP INSTALLATION COMPLETED—- Hadoop Distributed Mode:- Hadoop installation in a production environment is actually a distributed installation with one name node and several data nodes. For setting up a distributed hadoop cluster, you need atleast two machines, one acting as name node and data node and the other machine acting as data node. To start with you will need to setup two machines in psedo distributed mode as expailaned above. Following are the high level steps required for distributed hadoop installation. Generate SSH key for password less logon on the master node (name node machine) using the following command $ ssh-keygen -t dsa -P “” -f ~/.ssh/id_dsa Copy the generated key to the authorized keys on master node $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys Now copy the generated key to authorized keys on all slave nodes (in this case the other virtual machine with pseudo installation) I will be writing a detailed write up on distributed hadoop setup. Come back to our site for the article on distributed installation.