You're reading from ElasticSearch Cookbook
There are two ways to insert your data in ElasticSearch. In the previous chapters we have seen the index API, which allows storing documents in ElasticSearch via the PUT/POST API or the bulk shortcut. The other way is to use a service that fetches the data from an external source (one shot or periodically) and puts the data into the cluster.
ElasticSearch names these services as Rivers
and the ElasticSearch community provides several rivers to connect to the following data sources:
CouchDB
MongoDB
RabbitMQ
SQL DBMS (Oracle, MySQL, PostgreSQL and so on)
Redis
Twitter
Wikipedia
The rivers are available as external plugins.
In this chapter we'll discuss how to manage a river (creating, checking, and deleting) and how to configure the most common ones.
In ElasticSearch, the following are the two main action-related river setups:
Creating a river
Deleting a river
For managing a river, we need to perform the following steps:
A river is uniquely defined by a name and a type. The type of the river is the type name defined in the loaded river plugins.
After the
name
and thetype
parameters, usually a river requires an extra configuration that can be passed in the_meta
property.To create a river, the HTTP method is
PUT
(POST also works):curl -XPUT 'http://127.0.0.1:9200/_river/my_river/_meta' -d '{ "type" : "dummy" }'
The
dummy
type is a "fake" river always installed in ElasticSearch.The result will be as follows:
{"ok":true,"_index":"_river","_type":"my_river","_id":"_meta","_version":1}
If you look at ElasticSearch logs, you'll see some new lines, which are as follows:
[2013-08-03 20:48:39,206][INFO ][cluster.metadata ] [Elsie-Dee] [_river] creating index...
CouchDB is a NoSQL data store that stores data in the JSON format, similar to ElasticSearch. It can query with map/reduce tasks and it's RESTful, so every operation can be done via HTTP API calls.
Using ElasticSearch to search the CouchDB data is very handy as it extends CouchDB data store with Lucene search capabilities.
For using the CouchDB river, we need to perform the following steps:
Firstly, we need to install the CouchDB river plugin, which is available on GitHub and maintained by the ElasticSearch company. We can install the river plugin in the following way:
bin/plugin -install elasticsearch/elasticsearch-river-couchdb/1.2.0
After restarting the node, we are able to create a configuration (
config.json
) for our CouchDB...
MongoDB is a very common NoSQL tool used all over the world. One of its main drawbacks is that it was not designed for text searching.
Thus, the latest MongoDB version provides full text search, its completeness, and functionality are far more limited than the current ElasticSearch version. So it's quite common to use MongoDB as the data store and ElasticSearch for searching. The MongoDB river, which initially was developed by me and now is maintained by Richard Louapre, helps to create a bridge between these two applications.
You need a working ElasticSearch cluster and a working MongoDB instance installed in the same machine of ElasticSearch in replica set (http://docs.mongodb.org/manual/tutorial/deploy-replica-set/ and http://docs.mongodb.org/manual/tutorial/convert-standalone-to-replica-set/). You need to restore the sample data available in mongodb/data
using the following command:
mongorestore –d escookbook escookbook
RabbitMQ is a fast message broker, which can handle thousands of messages in a second. It can be very handy to be used in conjunction with ElasticSearch to bulk index records.
The RabbitMQ river plugin is designed to wait for messages that store bulk operations and index them.
You need a working ElasticSearch cluster and a working RabbitMQ instance installed in the same machine of ElasticSearch.
For using the RabbitMQ river, we need to perform the following steps:
Firstly, we need to install the RabbitMQ river plugin, which is available on GitHub (https://github.com/elasticsearch/elasticsearch-river-rabbitmq). We can install the river plugin in the following way:
bin/plugin -install elasticsearch/elasticsearch-river-rabbitmq/1.6.0
The result should be as follows:
-> Installing elasticsearch/elasticsearch-river-rabbitmq/1.6.0... Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-river-rabbitmq/elasticsearch-river-rabbitmq...
Generally application data is stored in a DBMS of some kind (Oracle, MySQL, PostgreSql, Microsoft SQL Server, SQLite, and so on), to power up traditional application with advanced search capabilities of ElasticSearch and Lucene. All this data must be imported in ElasticSearch. The JDBC river by Jörg Prante allows to connect to these DBMSs, executes some queries and indexes the results.
For using the JDBC river, we need to perform the following steps:
Firstly, we need to install the JDBC river plugin, which is available on GitHub (https://github.com/jprante/elasticsearch-river-jdbc). We can install the river plugin in the following way:
bin/plugin -url http://bit.ly/145e9Ly -install river-jdbc
The result should be as follows:
-> Installing river-jdbc... Trying http://bit.ly/145e9Ly... Downloading … .....DONE Installed river-jdbc into …/elasticsearch/plugins/river-jdbc
In the previous recipes, we have seen rivers that fetch data from data stores, both SQL and NoSQL. In this recipe, we'll discuss how to use the Twitter river to collect tweets from Twitter and store them in ElasticSearch.
You need a working ElasticSearch and OAuth Twitter token. To obtain it, you need to log in to Twitter (https://dev.twitter.com/apps/) and create a new app at https://dev.twitter.com/apps/new.
For using the Twitter river, we need to perform the following steps:
Firstly, we need to install the Twitter river plugin, which is available on Github (https://github.com/elasticsearch/elasticsearch-river-twitter). We can install the river plugin in the usual way as follows:
bin/plugin -install elasticsearch/elasticsearch-river-twitter/1.4.0
The result should be as follows:
-> Installing elasticsearch/elasticsearch-river-twitter/1.4.0... Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-river-twitter/elasticsearch...