How to set up R + Hadoop System


This article will show how to set up R and Hadoop integrated system on Windows

Everyone is chatting and speaking about big data and Hadoop. Hence we are introducing tutorial series on how to set up Hadoop and R system on your windows machine. Hadoop will help in managing the data and R integrated with hadoop will help you in analyzing the data.

  1. Install Hadoop
  2. Download Hadoop

Download Hadoop from the Hadoop website. Here is the link of the website –  The downloaded filename will be – hadoop-1.1.2-bin.tar.gz Once you download the file, unzip it.


In conf/, add the programming command-line below:

export JAVA_HOME=/Library/Java/Home

4. How to set up remote desktop and Self Login

Go to System Preferences. In System Preferences, go to Sharing ( which comes under Internet & Network). Look for the list of services then check  the button “Remote Login”. For advanced security which is key to most of the data analytics projects these days, please click the radio button for “Only these Users” and click on Hadoop.

ssh-keygen -t rsa -P “”

cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

5. How to run Hadoop

Once you are done with above programming command options to set up Hadoop, you can run programming commands given below. This commands are to check if Hadoop is installed properly on your machine or there are any Hadoop installation problems.

## please go to directory of Hadoop

cd hadoop-1.1.2

## please check the list of Hadoop commands


## check the version of Hadoop

bin/hadoop version

## initiate Hadoop


## make sure that Hadoop is running


## pause Hadoop


Once you run jps (Hadoop command), You can see a list of services given below.

Hadoop 1.1.2 Hadoop 2.2.0 or above
master node NameNode NameNode
SecondaryNameNode ResourceManager
JobTracker JobHistoryServer
slave node DataNode DataNode
TaskTracker NodeManager


  1. Install R and R-studio on your machine

As you want the R-Hadoop system, where you can manage the data in Hadoop and run the analysis in R or R-studio on the data stored in Hadoop, please install R and R-studio on your machine

2. Download and install GCC

You can download GCC from GCC is important component of R and Hadoop systems. If you have not installed GCC in Hadoop environment and run the R packages then you get an error “Make Command Not Found” while installing and using some R data mining packages in the system.

Run the code given below to test that R and Hadoop are integrated properly and working together as data management and data mining system.




3. Now install all the R packages in R that you will use for analyzing the data stored in Hadoop –

install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”, “dplyr”, “R.methodsS3”, “caTools”, “Hmisc”))

4. Set up Hadoop environment variables as shown below –



5. Run R on Hadoop

Below is an example to count words in text files from HDFS. The R code is from Jeffrey Breen’s presentation on Using R with Hadoop.

First, we copy some text files to HDFS folder wordcount/data.

## copy local text file to hdfs

bin/hadoop fs -copyFromLocal /Users/hadoop/try-hadoop/wordcount/data/*.txt wordcount/data/

After that, we can use R code below to run a Hadoop job for word counting.





## map function

map <- function(k,lines) {

words.list <- strsplit(lines, ‘s’)

words <- unlist(words.list)

return( keyval(words, 1) )


## reduce function

reduce <- function(word, counts) {

keyval(word, sum(counts))


wordcount <- function (input, output=NULL) {

mapreduce(input=input, output=output, input.format=”text”,

map=map, reduce=reduce)


## delete previous result if any

system(“/Users/hadoop/hadoop-1.1.2/bin/hadoop fs -rmr wordcount/out”)

## Submit job

hdfs.root <- ‘wordcount’ <- file.path(hdfs.root, ‘data’)

hdfs.out <- file.path(hdfs.root, ‘out’)

out <- wordcount(, hdfs.out)

## Fetch results from HDFS

results <- from.dfs(out)

## check top 30 frequent words

results.df <-, stringsAsFactors=F)

colnames(results.df) <- c(‘word’, ‘count’)

head(results.df[order(results.df$count, decreasing=T), ], 30)

Ref :