Reader small image

You're reading from  Instant MapReduce Patterns - Hadoop Essentials How-to

Product typeBook
Published inMay 2013
PublisherPackt
ISBN-139781782167709
Edition1st Edition
Tools
Right arrow
Author (1)
Liyanapathirannahelage H Perera
Liyanapathirannahelage H Perera
author image
Liyanapathirannahelage H Perera

Srinath Perera is a senior software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2 open source project, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation and Apache Web Service project PMC. He is also a committer of Apache open source projects Axis, Axis2, and Geronimo. He received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering degree from the University of Moratuwa, Sri Lanka. He has authored many technical and peer reviewed research articles, and more details can be found on his website. He is also a frequent speaker at technical venues. He has worked with large-scale distributed systems for a long time. He closely works with Big Data technologies like Hadoop and Cassandra daily. He also teaches a parallel programming graduate class at University of Moratuwa, which is primarily based on Hadoop.
Read more about Liyanapathirannahelage H Perera

Right arrow

Writing a word count application using Java (Simple)


This recipe demonstrates how to write an analytics tasks with Hadoop using basic Java constructs. It further discusses challenges of running applications that work on many machines and motivates the need for MapReduce like frameworks.

It will describe how to count the number of occurrences of words in a file.

Getting ready

This recipe assumes you have a computer that has Java installed and the JAVA_HOME environment variable points to your Java installation. Download the code for the book and unzip them to a directory. We will refer to the unzipped directory as SAMPLE_DIR.

How to do it...

  1. Copy the dataset from hadoop-microbook.jar to HADOOP_HOME.

  2. Run the word count program by running the following command from HADOOP_HOME:

    $ java -cp hadoop-microbook.jar microbook.wordcount.JavaWordCount SAMPLE_DIR/amazon-meta.txt results.txt
    
  3. Program will run and write the word count of the input file to a file called results.txt. You will see that it will print the following as the result:

    B00007ELF7=1
    Vincent[412370]=2
    35681=1
    

How it works...

You can find the source code for the recipe at src/microbook/JavaWordCount.java. The code will read the file line by line, tokenize each line, and count the number of occurrences of each word.

BufferedReaderbr = new BufferedReader(
newFileReader(args[0]));
String line = br.readLine();
while (line != null) {
    StringTokenizertokenizer = new StringTokenizer(line); 
    while(tokenizer.hasMoreTokens()){
        String token = tokenizer.nextToken(); 
        if(tokenMap.containsKey(token)){
            Integer value = (Integer)tokenMap.get(token);
            tokenMap.put(token, value+1);
        }else{
            tokenMap.put(token, new Integer(1)); 
        } 
    }
    line = br.readLine();
}

Writer writer = new BufferedWriter(  
    new FileWriter("results.txt")); 

for(Entry<String, Integer> entry: tokenMap.entrySet()){
    writer.write(entry.getKey() + "= "+ entry.getValue());
}

This program can only use one computer for processing. For a reasonable size dataset, this is acceptable. However, for a large dataset, it will take too much time. Also, this solution keeps all the data in memory, and with a large dataset, the program is likely to run out of memory. To avoid that, the program will have to move some of the data to disk as the available free memory becomes limited, which will further slow down the program.

We solve problems involving large datasets using many computers where we can parallel process the dataset using those computers. However, writing a program that processes a dataset in a distributed setup is a heavy undertaking. The challenges of such a program are shown as follows:

  • The distributed program has to find available machines and allocate work to those machines.

  • The program has to transfer data between machines using message passing or a shared filesystem. Such a framework needs to be integrated, configured, and maintained.

  • The program has to detect any failures and take corrective action.

  • The program has to make sure all nodes are given, roughly, the same amount of work, thus making sure resources are optimally used.

  • The program has to detect the end of the execution, collect all the results, and transfer them to the final location.

Although it is possible to write such a program, it is a waste to write such programs again and again. MapReduce-based frameworks like Hadoop lets users write only the processing logic, and the frameworks can take care of complexities of a distributed execution.

Previous PageNext Page
You have been reading a chapter from
Instant MapReduce Patterns - Hadoop Essentials How-to
Published in: May 2013Publisher: PacktISBN-13: 9781782167709
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Liyanapathirannahelage H Perera

Srinath Perera is a senior software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2 open source project, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation and Apache Web Service project PMC. He is also a committer of Apache open source projects Axis, Axis2, and Geronimo. He received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering degree from the University of Moratuwa, Sri Lanka. He has authored many technical and peer reviewed research articles, and more details can be found on his website. He is also a frequent speaker at technical venues. He has worked with large-scale distributed systems for a long time. He closely works with Big Data technologies like Hadoop and Cassandra daily. He also teaches a parallel programming graduate class at University of Moratuwa, which is primarily based on Hadoop.
Read more about Liyanapathirannahelage H Perera