Web Crawling and Data Mining with Apache Nutch — Save 50%
Perform web crawling and apply data mining in your application with this book and ebook
In this article by Abdulbasit Shaikh and Zakir Laliwala, the authors of Web Crawling and Data Mining with Apache Nutch, we will cover:
- Introduction of Apache Nutch
- Apache Solr Installation
- Apache Hadoop
- Use of Apache Gora
- Integration of Apache Nutch with Apache Accumulo
- Integration of Apache Nutch with MySQL
(For more resources related to this topic, see here.)
Introduction of Apache Nutch
Apache Nutch is a very robust and scalable tool for webcrawling and it can be integrated with scripting language i.e Python for web crawling. You can use it whenever your application contains huge data and you want to apply crawling on your data.
Apache Nutch is an Open Source WebCrawler Software which is used for crawling websites. You can create your own search engine like google if you understand Apache Nutch clearly. It will provide you your own search engine using which you can increase your application page rank in searching and also customize your application searching according to your needs. It is extensible and scalable. It facilitates for parsing, indexing, creating your own search engine, customize search according to needs, scalability, robustness and ScoringFilter for custom implementations. ScoringFilter is a Java class which is used while creating Apache Nutch plugin. It is used for manipulating scoring variables.
We can run Apache Nutch on a single machine as well as distributed environment like Apache Hadoop. It is written in Java. We can find broken links using Apache Nutch, create a copy of all the visited pages for searching over for example: Build indexes. We can find Web page hyperlinks in an automated manner.
Apache Nutch can be integrated with Apache Solr easily and we can index all the webpages which are crawled by Apache Nutch to Apache Solr. We can then use Apache Solr for searching the webpages which are indexed by Apache Nutch. Apache Solr is a search platform which is built on top of Apache Lucene. It can be used for searching any type of data for example webpages.
Crawling your first website
Crawling is driven by Apache Nutch crawling tool and certain related tools for building and maintaining several data structures. It includes web database, the index and a set of segments. Once Apache Nutch has indexed the webpages to Apache Solr, you can search for the required webpage(s) in Apache Solr.
Apache Solr Installation
Apache Solr is a search platform which is built on top of Apache Lucene. It can be used for searching any type of data for example webpages. It’s a very powerful searching mechanism and provides full-text search, dynamic clustering, database integration, rich document handling and many more. Apache SOLR will be used for indexing urls which are crawled by Apache Nutch and then one can search the details in Apache SOLR crawled by Apache Nutch.
Crawling your website using the crawl script
Apache Nutch 2.2.1 comes with the facility of crawl script which does crawling by just executing one single script. In earlier version, we have to manually do each step like generating data, fetching data, parsing data and so on for perfrom crawling.
Crawling the web, the CrawlDb, and URL filters
When user invokes crawling command in Apache Nutch 1.x, crawlDB is generated by Apache Nutch which is nothing but a directory which contains details about crawling. In Apache 2.x, crawlDB is not present. Instead Apache Nutch keeps all the crawling data directly into the database.
The injector will add the necessary urls to the crawldb. Crawldb is the directory which is created by Apache Nutch for storing data related to crawling. You need to provide urls to InjectorJob either by downloading urls from internet or writing your own file which contains urls. Let’s say you have created one directory called urls which contains all the urls that needs to be injected in cralwdb. Following command will be used for perform the InjectorJob:
#bin/nutch inject crawl/crawldb urls
Urls will be directory which contains all the urls which needs to be injected in crawldb. Crawl/crawldb is the directory in which injected urls will be placed. After performing this job, you have number of unfetched urls inside your database i.e crawldb.
Once we have done with the InjectorJob, now it’s time to fetch the injected urls from crawldb. So for fetching the urls, you need to perform GeneratorJob before. Follwing command will be used for GeneratorJob:
#bin/nutch generate crawl/crawldb crawl/segments
Crawldb is the directory from where urls are generated. Segments is the directory which is used by GeneratorJob to fetch the necessary information required for crawling.
The job of the fetch is to fetch the urls which are generated by GeneratorJob. It will use the input provided by GeneratorJob. Follwing command will be used for FetcherJob:
#bin/nutch fetch –all
Here I have provided input parameters –all which means this job will fetch all the urls which are generated by GeneratorJob. You can use different input parameters according to your needs.
After FetcherJob, ParserJob is to parse the urls which are fetched by FetcherJob. Follwing command will be used for ParserJob:
# bin/nutch parse –all
I have used input parameters –all which will parse all the urls which are fetched by FetcherJob. You can use different input parameter according to your needs.
Once the ParserJob has been completed, we need to update the database by providing results of the FetcherJob. This will update the respected databases with the last fetched urls. Following command will be used for DbUpdaterJob:
# bin/nutch updatedb crawl/crawldb –all
After performing this job, database will contain both updated entries of all the initial pages and also contains the new entities which are correspond to the newly discovered pages which are linked from the initial set.
Before applying indexing, we need to first invert all the links. After this we will be able to index incoming anchor text with the pages. Following command will be used for Invertlinks:
# bin/nutch invertlinks crawl/linkdb -dir crawl/segments
Apache Hadoop is designed for running your application on servers where there will be lot of computers in which one will be master computer and rest will be the slave computers. So it’s huge data warehouse. Master computers are the computers which will direct slave computers for data processing. So processing is done by slave computers. This is the reason why Apache Hadoop is used for processing huge amount of data as process is divided into the number of slave computers and that’s why Apache Hadoop gives highest throughput for any processing. So as data will increase, you need to increase number of slave computers. That’s how Apache Hadoop functionality runs.
Integration of Apache Nutch with Apache Hadoop
Apache Nutch can be easily integrated with Apache Hadoop and we can make our process much faster than running Apache Nutch on single machine. After integrating Apache Nutch with Apache Hadoop, we can perform crawling on Apache Hadoop cluster environment. So the process will be much faster and we will get highest amount of throughput.
Apache Hadoop Setup with Cluster
This setup is not required a huge hardware to purchase and running Apache Nutch and Apache Hadoop. It is designed in such a way to make the use of hardware maximum.
Formatting the HDFS filesystem using the NameNode
HDFS stands for Hadoop Distributed File system is a directory which is used by Apache Hadoop for storage purpose. So it’s the directory which stroes all the data related to Apache Hadoop. It has two components as NameNode and DataNode in which NameNode manages the filesystem metadata and DataNodes actually stores the data. It’s highly configurable and suited well for many installations. When there are very large clusters, at that time configuration needs to be tuned.
The first step for getting start your Apache Hadoop is the formatting Hadoop filesystem which is implemented on top of the local filesystem of your cluster(which will include only your local machine if you have followed).
Setting up the deployment architecture of Apache Nutch
We have to setup Apache Nutch on each of the machine which we are using. In this case, we are using six machines cluster. So we have to setup Apache Nutch on each machine. For the less number of machines in our cluster configuration, we can setup manually on each machine. But when the machines are more, let’s say we have 100 machines in our cluster environment. So we can’t setup on each machine manually. For that we require some deployment tool such as Chef or ateleast distributed ssh. You can refer to http://www.opscode.com/chef/ for getting familiar with Chef. You can refer http://www.ibm.com/developerworks/aix/library/au-satdistadmin/for getting familiar with distributed ssh.I will just demonstrate about running Apache Hadoop on Ubuntu for Single-Node Cluster. If you want to go for running Apache Hadoop on Ubuntu for Multi-Node cluster then I have already provided reference link above. You can follow that and configure the same. Once we have done with the deployment of Apache Nutch to single machine, we will run this script start-all.sh that will start the services on the master node and data nodes. It means the script will begin the hadoop daemons on the master node and so we are able to login into all the slave nodes using ssh command as explained above and will begin daemons on the slave nodes.
The start-all.sh script expects that Apache Nutch should be put on the same location on each machine. It is also expecting that Apache Hadoop is storing the data at the same filepath on each machine. The start-all.sh script which starts the daemons on the master and slave nodes are going to use password-less login using ssh.
Introduction of Apache Nutch configuration with Eclipse
Apache Nutch can be easily configured with Eclipse. After that we can perform crawling easily using Eclipse. So need to perform crawling from command line. We can use eclipse for all the operations of crawling which we are doing from command line.Instructions are provided for fixing a development environment for Apache Nutch with Eclipse IDE. It's supposed to give a comprehensive starting resource for configuring, building, crawling and debugging of Apache Nutch within the above of context.
Following are the prerequisites for Apache Nutch integration with Eclipse:
- Get the latest version of Eclipse from http://www.eclipse.org/downloads/packages/release/juno/r
- All the required subsequent are available from the Eclipse Marketplace. But if they are not, you can download eclipse market place as follows http://marketplace.eclipse.org/marketplace-client-intro
- Once you've configuired Eclipse, Download as per here http://subclipse.tigris.org/.
If you have faced a problem with the 1.8.x release, try 1.6.x. This may resolve compatability issues.
- Download IvyDE plugin for Eclipse as here http://ant.apache.org/ivy/ivyde/download.cgi
- Download m2e plugin for Eclipse here http://marketplace.eclipse.org/content/maven-integration-eclipse
Introduction of Apache Accumulo
Accumulo is basically used as the datastore for storing data. So same way as we are using different databases like MySQL, Oracle, etc. So same way Apache Accumulo can be used. The key point of Apache Accumulo is, it is running on Apache Hadoop Cluster environment. So that's a very good feature with Accumulo.Accumulo sorted, distributed key/value store could be a strong, scalable, high performance information storage and retrieval system. Apache Accumulo depends on Google's BigTable design and is built ontop of Apache Hadoop, ,Thrift and Zookeeper. Apache Accumulo features a some novel improvement on the BigTable design within a form of cell-based access management and the server-side programming mechanism which will do modificationication in key/value pairs at varied points within the data management process
Introduction of Apache Gora
Apache Gora open source framework providesin-memory data model and persistence for large data. Apache Gora supports persisting to column stores, key and value stores, document stores and RDBMSs and analyzing the data with extensive Apache Hadoop MapReduce support.
- Apache Gora presently supports the subsequent datastores:
- PApache Hbase
- Amazon DynamoDB
Use of Apache Gora
Although there are many excellent ORM frameworks for relational databases and data modeling in NoSQL data stores different profoundly from their relative cousins. DataD-model agnostic frameworks like JDO aren't comfortable to be used cases, wherever one has to use the complete power of data models in column stores. Gora fills the thegap giving user an easy-to-use in-memory data model plus persistence for large data frameworkproviding data store specific mappings and also in built Apache Hadoop support.
Integration of Apache Nutch with Apache Accumulo
In this section, we are going to cover the integration process for integrating Apache Nutch with Apache Accumulo. Apache Accumulo is basically used for a huge data storeage. It is built on the top of Apache Hadoop, Zookeeper and Thrift. So a potential use of integrating Apache Nutch with Apache Accumulo is when our application has huge data to process and we want to run our application in cluste environment. At that time we can use Apache Accumulo as data storage purpose. As Apache Accumulo only running with Apache Hadoop, maximum use of Apache Accumulo would be in cluster based environment. So first we will start with the configuration of Apache GORA with Apache Nutch. Then we will setup Apache Hadoop and Zookeeper. Then we will do installation and configuration of Apache Accumulo. Then we will test Apache Accumulo and at the end we will see Crawling with Apache Nutch on Apache Accumulo.
Setup Apache Hadoop and Apache Zookeeper for Apache Nutch
Apache Zookeeper is a centralized service which is used for maintaining configuration information, provideses distributed synchronization, naming and also provideses group services. All these services are used by distributed applications in one or another manner. So all these services are provided by zookeeper so you don’t have to write these services from scratch. You can use these services for implementing consensus, management, group, leader election and presence protocols and you can also build it for your own requirements.
Apache Accumulo is built on the top of Apache Hadoop, Zookeeper. So we must configure Apache Accumulo within Apache Hadoop and Apache Zookeeper. You can referrer to http://www.covert.io/post/18414889381/accumulo-nutch-and-gora for any queries related to setup.
Integration of Apache Nutch with MySQL
In this section, we are going to integrate Apache Nutch with MySQL. So after that you can crawled webpages in Apache Nutch that will be stored in MYSQL. So you can go to MySQL and check your crawled webpages and also perform necessary operations. We will start with the introduction of MySQL then we will cover what is the need of integrating MySQL with Apache Nutch. After that we will see configuration of MySQL with Apache Nutch and at the end we will do crawling with Apache Nutch on MySQL. So let’s just start with the introduction of MYSQL.
We covered the following:
- Downloading Apache Hadoop and Apache Nutch
- Perform Crawling on Apache Hadoop Cluster in Apache Nutch
- Apache Nutch configuration with eclipse
- Installation steps of building Apache Nutch with Eclipse
- Crawling in Eclipse
- Configuration of Apache GORA with Apache Nutch
- Installation and Configuration of Apache Accumulo
- Crawling with Apache Nutch on Apache Accumulo
- Need of integrating MySQL with Apache Nutch
Resources for Article:
- Getting Started with the Alfresco Records Management Module [Article]
- Making Big Data Work for Hadoop and Solr [Article]
- Apache Solr PHP Integration [Article]
|Perform web crawling and apply data mining in your application with this book and ebook|
eBook Price: $20.99
Book Price: $34.99
About the Author :
Abdulbasit Shaikh has more than two years of experience in the IT industry. He completed his Masters' degree from the Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT). He has a lot of experience in open source technologies. He has worked on a number of open source technologies, such as Apache Hadoop, Apache Solr, Apache ZooKeeper, Apache Mahout, Apache Nutch, and Liferay. He has provided training on Apache Nutch, Apache Hadoop, Apache Mahout, and AWS architect. He is currently working on the OpenStack technology. He has also delivered projects and training on open source technologies. He has a very good knowledge of cloud computing, such as AWS and Microsoft Azure, as he has successfully delivered many projects in cloud computing.
He is a very enthusiastic and active person when he is working on a project or delivering a project. Currently, he is working as a Java developer at Attune Infocom Pvt. Ltd. He is totally focused on open source technologies, and he is very much interested in sharing his knowledge with the open source community.
Dr. Zakir Laliwala is an entrepreneur, an open source specialist, and a hands-on CTO at Attune Infocom. Attune Infocom provides enterprise open source solutions and services for SOA, BPM, ESB, Portal, cloud computing, and ECM. At Attune Infocom, he is responsible for product development and the delivery of solutions and services. He explores new enterprise open source technologies and defines architecture, roadmaps, and best practices. He has provided consultations and training to corporations around the world on various open source technologies such as Mule ESB, Activiti BPM, JBoss jBPM and Drools, Liferay Portal, Alfresco ECM, JBoss SOA, and cloud computing.
He received a Ph.D. in Information and Communication Technology from Dhirubhai Ambani Institute of Information and Communication Technology. He was an adjunct faculty at Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), and he taught Master's degree students at CEPT.
He has published many research papers on web services, SOA, grid computing, and the semantic web in IEEE, and has participated in ACM International Conferences. He serves as a reviewer at various international conferences and journals. He has also published book chapters and written books on open source technologies. He was a co-author of the books Mule ESB Cookbook and Activiti5 Business Process Management Beginner's Guide, Packt Publishing.