Introducing Apache Nutch
Apache Nutch is an open source web crawler that can be used to retrieve data from websites and get data from it. It is an extensible and scalable crawler that gives us the freedom to use it as we like by using plugins. Apache Nutch is written in Java, just like Apache Solr, and both tools make a perfect combination for creating a search engine of our own if they are combined.
Apache Nutch can be used on a single node or can be run in a distributed way with multiple nodes. Let's see how we can combine Apache Solr and Apache Nutch to crawl a web page and index it. To do this, let's start by installing Apache Nutch.