Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-elasticsearch-administration
Packt
03 Mar 2015
28 min read
Save for later

Elasticsearch Administration

Packt
03 Mar 2015
28 min read
In this article by Rafał Kuć and Marek Rogoziński, author of the book Mastering Elasticsearch, Second Edition we will talk more about the Elasticsearch configuration and new features introduced in Elasticsearch 1.0 and higher. By the end of this article, you will have learned: (For more resources related to this topic, see here.) Configuring the discovery and recovery modules Using the Cat API that allows a human-readable insight into the cluster status The backup and restore functionality Federated search Discovery and recovery modules When starting your Elasticsearch node, one of the first things that Elasticsearch does is look for a master node that has the same cluster name and is visible in the network. If a master node is found, the starting node gets joined into an already formed cluster. If no master is found, then the node itself is selected as a master (of course, if the configuration allows such behavior). The process of forming a cluster and finding nodes is called discovery. The module responsible for discovery has two main purposes—electing a master and discovering new nodes within a cluster. After the cluster is formed, a process called recovery is started. During the recovery process, Elasticsearch reads the metadata and the indices from the gateway, and prepares the shards that are stored there to be used. After the recovery of the primary shards is done, Elasticsearch should be ready for work and should continue with the recovery of all the replicas (if they are present). In this section, we will take a deeper look at these two modules and discuss the possibilities of configuration Elasticsearch gives us and what the consequences of changing them are. Note that the information provided in the Discovery and recovery modules section is an extension of what we already wrote in Elasticsearch Server Second Edition, published by Packt Publishing. Discovery configuration As we have already mentioned multiple times, Elasticsearch was designed to work in a distributed environment. This is the main difference when comparing Elasticsearch to other open source search and analytics solutions available. With such assumptions, Elasticsearch is very easy to set up in a distributed environment, and we are not forced to set up additional software to make it work like this. By default, Elasticsearch assumes that the cluster is automatically formed by the nodes that declare the same cluster.name setting and can communicate with each other using multicast requests. This allows us to have several independent clusters in the same network. There are a few implementations of the discovery module that we can use, so let's see what the options are. Zen discovery Zen discovery is the default mechanism that's responsible for discovery in Elasticsearch and is available by default. The default Zen discovery configuration uses multicast to find other nodes. This is a very convenient solution: just start a new Elasticsearch node and everything works—this node will be joined to the cluster if it has the same cluster name and is visible by other nodes in that cluster. This discovery method is perfectly suited for development time, because you don't need to care about the configuration; however, it is not advised that you use it in production environments. Relying only on the cluster name is handy but can also lead to potential problems and mistakes, such as the accidental joining of nodes. Sometimes, multicast is not available for various reasons or you don't want to use it for these mentioned reasons. In the case of bigger clusters, the multicast discovery may generate too much unnecessary traffic, and this is another valid reason why it shouldn't be used for production. For these cases, Zen discovery allows us to use the unicast mode. When using the unicast Zen discovery, a node that is not a part of the cluster will send a ping request to all the addresses specified in the configuration. By doing this, it informs all the specified nodes that it is ready to be a part of the cluster and can be either joined to an existing cluster or can form a new one. Of course, after the node joins the cluster, it gets the cluster topology information, but the initial connection is only done to the specified list of hosts. Remember that even when using unicast Zen discovery, the Elasticsearch node still needs to have the same cluster name as the other nodes. If you want to know more about the differences between multicast and unicast ping methods, refer to these URLs: http://en.wikipedia.org/wiki/Multicast and http://en.wikipedia.org/wiki/Unicast. If you still want to learn about the configuration properties of multicast Zen discovery, let's look at them. Multicast Zen discovery configuration The multicast part of the Zen discovery module exposes the following settings: discovery.zen.ping.multicast.address (the default: all available interfaces): This is the interface used for the communication given as the address or interface name. discovery.zen.ping.multicast.port (the default: 54328): This port is used for communication. discovery.zen.ping.multicast.group (the default: 224.2.2.4): This is the multicast address to send messages to. discovery.zen.ping.multicast.buffer_size (the default: 2048): This is the size of the buffer used for multicast messages. discovery.zen.ping.multicast.ttl (the default: 3): This is the time for which a multicast message lives. Every time a packet crosses the router, the TTL is decreased. This allows for the limiting area where the transmission can be received. Note that routers can have the threshold values assigned compared to TTL, which causes that TTL value to not match exactly the number of routers that a packet can jump over. discovery.zen.ping.multicast.enabled (the default: true): Setting this property to false turns off the multicast. You should disable multicast if you are planning to use the unicast discovery method. The unicast Zen discovery configuration The unicast part of Zen discovery provides the following configuration options: discovery.zen.ping.unicats.hosts: This is the initial list of nodes in the cluster. The list can be defined as a list or as an array of hosts. Every host can be given a name (or an IP address) or have a port or port range added. For example, the value of this property can look like this: ["master1", "master2:8181", "master3[80000-81000]"]. So, basically, the hosts' list for the unicast discovery doesn't need to be a complete list of Elasticsearch nodes in your cluster, because once the node is connected to one of the mentioned nodes, it will be informed about all the others that form the cluster. discovery.zen.ping.unicats.concurrent_connects (the default: 10): This is the maximum number of concurrent connections unicast discoveries will use. If you have a lot of nodes that the initial connection should be made to, it is advised that you increase the default value. Master node One of the main purposes of discovery apart from connecting to other nodes is to choose a master node—a node that will take care of and manage all the other nodes. This process is called master election and is a part of the discovery module. No matter how many master eligible nodes there are, each cluster will only have a single master node active at a given time. If there is more than one master eligible node present in the cluster, they can be elected as the master when the original master fails and is removed from the cluster. Configuring master and data nodes By default, Elasticsearch allows every node to be a master node and a data node. However, in certain situations, you may want to have worker nodes, which will only hold the data or process the queries and the master nodes that will only be used as cluster-managed nodes. One of these situations is to handle a massive amount of data, where data nodes should be as performant as possible, and there shouldn't be any delay in master nodes' responses. Configuring data-only nodes To set the node to only hold data, we need to instruct Elasticsearch that we don't want such a node to be a master node. In order to do this, we add the following properties to the elasticsearch.yml configuration file: node.master: falsenode.data: true Configuring master-only nodes To set the node not to hold data and only to be a master node, we need to instruct Elasticsearch that we don't want such a node to hold data. In order to do that, we add the following properties to the elasticsearch.yml configuration file: node.master: truenode.data: false Configuring the query processing-only nodes For large enough deployments, it is also wise to have nodes that are only responsible for aggregating query results from other nodes. Such nodes should be configured as nonmaster and nondata, so they should have the following properties in the elasticsearch.yml configuration file: node.master: falsenode.data: false Please note that the node.master and the node.data properties are set to true by default, but we tend to include them for configuration clarity. The master election configuration We already wrote about the master election configuration in Elasticsearch Server Second Edition, but this topic is very important, so we decided to refresh our knowledge about it. Imagine that you have a cluster that is built of 10 nodes. Everything is working fine until, one day, your network fails and three of your nodes are disconnected from the cluster, but they still see each other. Because of the Zen discovery and the master election process, the nodes that got disconnected elect a new master and you end up with two clusters with the same name with two master nodes. Such a situation is called a split-brain and you must avoid it as much as possible. When a split-brain happens, you end up with two (or more) clusters that won't join each other until the network (or any other) problems are fixed. If you index your data during this time, you may end up with data loss and unrecoverable situations when the nodes get joined together after the network split. In order to prevent split-brain situations or at least minimize the possibility of their occurrences, Elasticsearch provides a discovery.zen.minimum_master_nodes property. This property defines a minimum amount of master eligible nodes that should be connected to each other in order to form a cluster. So now, let's get back to our cluster; if we set the discovery.zen.minimum_master_nodes property to 50 percent of the total nodes available plus one (which is six, in our case), we would end up with a single cluster. Why is that? Before the network failure, we would have 10 nodes, which is more than six nodes, and these nodes would form a cluster. After the disconnections of the three nodes, we would still have the first cluster up and running. However, because only three nodes disconnected and three is less than six, these three nodes wouldn't be allowed to elect a new master and they would wait for reconnection with the original cluster. Zen discovery fault detection and configuration Elasticsearch runs two detection processes while it is working. The first process is to send ping requests from the current master node to all the other nodes in the cluster to check whether they are operational. The second process is a reverse of that—each of the nodes sends ping requests to the master in order to verify that it is still up and running and performing its duties. However, if we have a slow network or our nodes are in different hosting locations, the default configuration may not be sufficient. Because of this, the Elasticsearch discovery module exposes three properties that we can change: discovery.zen.fd.ping_interval: This defaults to 1s and specifies the interval of how often the node will send ping requests to the target node. discovery.zen.fd.ping_timeout: This defaults to 30s and specifies how long the node will wait for the sent ping request to be responded to. If your nodes are 100 percent utilized or your network is slow, you may consider increasing that property value. discovery.zen.fd.ping_retries: This defaults to 3 and specifies the number of ping request retries before the target node will be considered not operational. You can increase this value if your network has a high number of lost packets (or you can fix your network). There is one more thing that we would like to mention. The master node is the only node that can change the state of the cluster. To achieve a proper cluster state updates sequence, Elasticsearch master nodes process single cluster state update requests one at a time, make the changes locally, and send the request to all the other nodes so that they can synchronize their state. The master nodes wait for the given time for the nodes to respond, and if the time passes or all the nodes are returned, with the current acknowledgment information, it proceeds with the next cluster state update request processing. To change the time, the master node waits for all the other nodes to respond, and you should modify the default 30 seconds time by setting the discovery.zen.publish_timeout property. Increasing the value may be needed for huge clusters working in an overloaded network. The Amazon EC2 discovery Amazon, in addition to selling goods, has a few popular services such as selling storage or computing power in a pay-as-you-go model. So-called Amazon Elastic Compute Cloud (EC2) provides server instances and, of course, they can be used to install and run Elasticsearch clusters (among many other things, as these are normal Linux machines). This is convenient—you pay for instances that are needed in order to handle the current traffic or to speed up calculations, and you shut down unnecessary instances when the traffic is lower. Elasticsearch works well on EC2, but due to the nature of the environment, some features may work slightly differently. One of these features that works differently is discovery, because Amazon EC2 doesn't support multicast discovery. Of course, we can switch to unicast discovery, but sometimes, we want to be able to automatically discover nodes and, with unicast, we need to at least provide the initial list of hosts. However, there is an alternative—we can use the Amazon EC2 plugin, a plugin that combines the multicast and unicast discovery methods using the Amazon EC2 API. Make sure that during the set up of EC2 instances, you set up communication between them (on port 9200 and 9300 by default). This is crucial in order to have Elasticsearch nodes communicate with each other and, thus, cluster functioning is required. Of course, this communication depends on network.bind_host and network.publish_host (or network.host) settings. The EC2 plugin installation The installation of a plugin is as simple as with most of the plugins. In order to install it, we should run the following command: bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.0 The EC2 plugin's generic configuration This plugin provides several configuration settings that we need to provide in order for the EC2 discovery to work: cluster.aws.access_key: Amazon access key—one of the credential values you can find in the Amazon configuration panel cluster.aws.secret_key: Amazon secret key—similar to the previously mentioned access_key setting, it can be found in the EC2 configuration panel The last thing is to inform Elasticsearch that we want to use a new discovery type by setting the discovery.type property to ec2 value and turn off multicast. Optional EC2 discovery configuration options The previously mentioned settings are sufficient to run the EC2 discovery, but in order to control the EC2 discovery plugin behavior, Elasticsearch exposes additional settings: cloud.aws.region: This region will be used to connect with Amazon EC2 web services. You can choose a region that's adequate for the region where your instance resides, for example, eu-west-1 for Ireland. The possible values can be eu-west, sa-east, us-east, us-west-1, us-west-2, ap-southeast-1, and ap-southeast-1. cloud.aws.ec2.endpoint: If you are using EC2 API services, instead of defining a region, you can provide an address of the AWS endpoint, for example, ec2.eu-west-1.amazonaws.com. cloud.aws.protocol: This is the protocol that should be used by the plugin to connect to Amazon Web Services endpoints. By default, Elasticsearch will use the HTTPS protocol (which means setting the value of the property to https). We can also change this behavior and set the property to http for the plugin to use HTTP without encryption. We are also allowed to overwrite the cloud.aws.protocol settings for each service by using the cloud.aws.ec2.protocol and cloud.aws.s3.protocol properties (the possible values are the same—https and http). cloud.aws.proxy_host: Elasticsearch allows us to define a proxy that will be used to connect to AWS endpoints. The cloud.aws.proxy_host property should be set to the address to the proxy that should be used. cloud.aws.proxy_port: The second property related to the AWS endpoints proxy allows us to specify the port on which the proxy is listening. The cloud.aws.proxy_port property should be set to the port on which the proxy listens. discovery.ec2.ping_timeout (the default: 3s): This is the time to wait for the response for the ping message sent to the other node. After this time, the nonresponsive node will be considered dead and removed from the cluster. Increasing this value makes sense when dealing with network issues or we have a lot of EC2 nodes. The EC2 nodes scanning configuration The last group of settings we want to mention allows us to configure a very important thing when building cluster working inside the EC2 environment—the ability to filter available Elasticsearch nodes in our Amazon Elastic Cloud Computing network. The Elasticsearch EC2 plugin exposes the following properties that can help us configure its behavior: discovery.ec2.host_type: This allows us to choose the host type that will be used to communicate with other nodes in the cluster. The values we can use are private_ip (the default one; the private IP address will be used for communication), public_ip (the public IP address will be used for communication), private_dns (the private hostname will be used for communication), and public_dns (the public hostname will be used for communication). discovery.ec2.groups: This is a comma-separated list of security groups. Only nodes that fall within these groups can be discovered and included in the cluster. discovery.ec2.availability_zones: This is array or command-separated list of availability zones. Only nodes with the specified availability zones will be discovered and included in the cluster. discovery.ec2.any_group (this defaults to true): Setting this property to false will force the EC2 discovery plugin to discover only those nodes that reside in an Amazon instance that falls into all of the defined security groups. The default value requires only a single group to be matched. discovery.ec2.tag: This is a prefix for a group of EC2-related settings. When you launch your Amazon EC2 instances, you can define tags, which can describe the purpose of the instance, such as the customer name or environment type. Then, you use these defined settings to limit discovery nodes. Let's say you define a tag named environment with a value of qa. In the configuration, you can now specify the following: discovery.ec2.tag.environment: qa and only nodes running on instances with this tag will be considered for discovery. cloud.node.auto_attributes: When this is set to true, Elasticsearch will add EC2-related node attributes (such as the availability zone or group) to the node properties and will allow us to use them, adjusting the Elasticsearch shard allocation and configuring the shard placement. Other discovery implementations The Zen discovery and EC2 discovery are not the only discovery types that are available. There are two more discovery types that are developed and maintained by the Elasticsearch team, and these are: Azure discovery: https://github.com/elasticsearch/elasticsearch-cloud-azure Google Compute Engine discovery: https://github.com/elasticsearch/elasticsearch-cloud-gce In addition to these, there are a few discovery implementations provided by the community, such as the ZooKeeper discovery for older versions of Elasticsearch (https://github.com/sonian/elasticsearch-zookeeper). The gateway and recovery configuration The gateway module allows us to store all the data that is needed for Elasticsearch to work properly. This means that not only is the data in Apache Lucene indices stored, but also all the metadata (for example, index allocation settings), along with the mappings configuration for each index. Whenever the cluster state is changed, for example, when the allocation properties are changed, the cluster state will be persisted by using the gateway module. When the cluster is started up, its state will be loaded using the gateway module and applied. One should remember that when configuring different nodes and different gateway types, indices will use the gateway type configuration present on the given node. If an index state should not be stored using the gateway module, one should explicitly set the index gateway type to none. The gateway recovery process Let's say explicitly that the recovery process is used by Elasticsearch to load the data stored with the use of the gateway module in order for Elasticsearch to work. Whenever a full cluster restart occurs, the gateway process kicks in to load all the relevant information we've mentioned—the metadata, the mappings, and of course, all the indices. When the recovery process starts, the primary shards are initialized first, and then, depending on the replica state, they are initialized using the gateway data, or the data is copied from the primary shards if the replicas are out of sync. Elasticsearch allows us to configure when the cluster data should be recovered using the gateway module. We can tell Elasticsearch to wait for a certain number of master eligible or data nodes to be present in the cluster before starting the recovery process. However, one should remember that when the cluster is not recovered, all the operations performed on it will not be allowed. This is done in order to avoid modification conflicts. Configuration properties Before we continue with the configuration, we would like to say one more thing. As you know, Elasticsearch nodes can play different roles—they can have a role of data nodes—the ones that hold data—they can have a master role, or they can be only used for request handing, which means not holding data and not being master eligible. Remembering all this, let's now look at the gateway configuration properties that we are allowed to modify: gateway.recover_after_nodes: This is an integer number that specifies how many nodes should be present in the cluster for the recovery to happen. For example, when set to 5, at least 5 nodes (doesn't matter whether they are data or master eligible nodes) must be present for the recovery process to start. gateway.recover_after_data_nodes: This is an integer number that allows us to set how many data nodes should be present in the cluster for the recovery process to start. gateway.recover_after_master_nodes: This is another gateway configuration option that allows us to set how many master eligible nodes should be present in the cluster for the recovery to start. gateway.recover_after_time: This allows us to set how much time to wait before the recovery process starts after the conditions defined by the preceding properties are met. If we set this property to 5m, we tell Elasticsearch to start the recovery process 5 minutes after all the defined conditions are met. The default value for this property is 5m, starting from Elasticsearch 1.3.0. Let's imagine that we have six nodes in our cluster, out of which four are data eligible. We also have an index that is built of three shards, which are spread across the cluster. The last two nodes are master eligible and they don't hold the data. What we would like to configure is the recovery process to be delayed for 3 minutes after the four data nodes are present. Our gateway configuration could look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3m Expectations on nodes In addition to the already mentioned properties, we can also specify properties that will force the recovery process of Elasticsearch. These properties are: gateway.expected_nodes: This is the number of nodes expected to be present in the cluster for the recovery to start immediately. If you don't need the recovery to be delayed, it is advised that you set this property to the number of nodes (or at least most of them) with which the cluster will be formed from, because that will guarantee that the latest cluster state will be recovered. gateway.expected_data_nodes: This is the number of expected data eligible nodes to be present in the cluster for the recovery process to start immediately. gateway.expected_master_nodes: This is the number of expected master eligible nodes to be present in the cluster for the recovery process to start immediately. Now, let's get back to our previous example. We know that when all six nodes are connected and are in the cluster, we want the recovery to start. So, in addition to the preceeding configuration, we would add the following property: gateway.expected_nodes: 6 So the whole configuration would look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3mgateway.expected_nodes: 6 The preceding configuration says that the recovery process will be delayed for 3 minutes once four data nodes join the cluster and will begin immediately after six nodes are in the cluster (doesn't matter whether they are data nodes or master eligible nodes). The local gateway With the release of Elasticsearch 0.20 (and some of the releases from 0.19 versions), all the gateway types, apart from the default local gateway type, were deprecated. It is advised that you do not use them, because they will be removed in future versions of Elasticsearch. This is still not the case, but if you want to avoid full data reindexation, you should only use the local gateway type, and this is why we won't discuss all the other types. The local gateway type uses a local storage available on a node to store the metadata, mappings, and indices. In order to use this gateway type and the local storage available on the node, there needs to be enough disk space to hold the data with no memory caching. The persistence to the local gateway is different from the other gateways that are currently present (but deprecated). The writes to this gateway are done in a synchronous manner in order to ensure that no data will be lost during the write process. In order to set the type of gateway that should be used, one should use the gateway.type property, which is set to local by default. There is one additional thing regarding the local gateway of Elasticsearch that we didn't talk about—dangling indices. When a node joins a cluster, all the shards and indices that are present on the node, but are not present in the cluster, will be included in the cluster state. Such indices are called dangling indices, and we are allowed to choose how Elasticsearch should treat them. Elasticsearch exposes the gateway.local.auto_import_dangling property, which can take the value of yes (the default value that results in importing all dangling indices into the cluster), close (results in importing the dangling indices into the cluster state but keeps them closed by default), and no (results in removing the dangling indices). When setting the gateway.local.auto_import_dangling property to no, we can also set the gateway.local.dangling_timeout property (defaults to 2h) to specify how long Elasticsearch will wait while deleting the dangling indices. The dangling indices feature can be nice when we restart old Elasticsearch nodes, and we don't want old indices to be included in the cluster. Low-level recovery configuration We discussed that we can use the gateway to configure the behavior of the Elasticsearch recovery process, but in addition to that, Elasticsearch allows us to configure the recovery process itself. However, we decided that it would be good to mention the properties we can use in the section dedicated to gateway and recovery. Cluster- level recovery configuration The recovery configuration is specified mostly on the cluster level and allows us to set general rules for the recovery module to work with. These settings are: indices.recovery.concurrent_streams: This defaults to 3 and specifies the number of concurrent streams that are allowed to be opened in order to recover a shard from its source. The higher the value of this property, the more pressure will be put on the networking layer; however, the recovery may be faster, depending on your network usage and throughput. indices.recovery.max_bytes_per_sec: By default, this is set to 20MB and specifies the maximum number of data that can be transferred during shard recovery per second. In order to disable data transfer limiting, one should set this property to 0. Similar to the number of concurrent streams, this property allows us to control the network usage of the recovery process. Setting this property to higher values may result in higher network utilization and a faster recovery process. indices.recovery.compress: This is set to true by default and allows us to define whether ElasticSearch should compress the data that is transferred during the recovery process. Setting this to false may lower the pressure on the CPU, but it will also result in more data being transferred over the network. indices.recovery.file_chunk_size: This is the chunk size used to copy the shard data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. indices.recovery.translog_ops: This defaults to 1000 and specifies how many transaction log lines should be transferred between shards in a single request during the recovery process. indices.recovery.translog_size: This is the chunk size used to copy the shard transaction log data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. In the versions prior to Elasticsearch 0.90.0, there was the indices.recovery.max_size_per_sec property that could be used, but it was deprecated, and it is suggested that you use the indices.recovery.max_bytes_per_sec property instead. However, if you are using an Elasticsearch version older than 0.90.0, it may be worth remembering this. All the previously mentioned settings can be updated using the Cluster Update API, or they can be set in the elasticsearch.yml file. Index-level recovery settings In addition to the values mentioned previously, there is a single property that can be set on a per-index basis. The property can be set both in the elasticsearch.yml file and using the indices Update Settings API, and it is called index.recovery.initial_shards. In general, Elasticsearch will only recover a particular shard when there is a quorum of shards present and if that quorum can be allocated. A quorum is 50 percent of the shards for the given index plus one. By using the index.recovery.initial_shards property, we can change what Elasticsearch will take as a quorum. This property can be set to the one of the following values: quorum: 50 percent, plus one shard needs to be present and be allocable. This is the default value. quorum-1: 50 percent of the shards for a given index need to be present and be allocable. full: All of the shards for the given index need to be present and be allocable. full-1: 100 percent minus one shards for the given index need to be present and be allocable. integer value: Any integer such as 1, 2, or 5 specifies the number of shards that are needed to be present and that can be allocated. For example, setting this value to 2 will mean that at least two shards need to be present and Elasticsearch needs at least 2 shards to be allocable. It is good to know about this property, but in most cases, the default value will be sufficient for most deployments. Summary In this article, we focused more on the Elasticsearch configuration and new features that were introduced in Elasticsearch 1.0. We configured discovery and recovery, and we used the human-friendly Cat API. In addition to that, we used the backup and restore functionality, which allowed easy backup and recovery of our indices. Finally, we looked at what federated search is and how to search and index data to multiple clusters, while still using all the functionalities of Elasticsearch and being connected to a single node. If you want to dig deeper, buy the book Mastering Elasticsearch, Second Edition and read in a simple step-by-step fashion using Elasticsearch to enhance your knowlege further. Resources for Article: Further resources on this subject: Downloading and Setting Up ElasticSearch [Article] Indexing the Data [Article] Driving Visual Analyses with Automobile Data (Python) [Article]
Read more
  • 0
  • 0
  • 5417

article-image-introducing-splunk
Packt
03 Mar 2015
14 min read
Save for later

Introducing Splunk

Packt
03 Mar 2015
14 min read
In this article by Betsy Page Sigman, author of the book Splunk Essentials, Splunk, whose "name was inspired by the process of exploring caves, or splunking, helps analysts, operators, programmers, and many others explore data from their organizations by obtaining, analyzing, and reporting on it. This multinational company, cofounded by Michael Baum, Rob Das, and Erik Swan, has a core product called "Splunk Enterprise. This manages searches, inserts, deletes, and filters, and analyzes big data that is generated by machines, as well as other types of data. "They also have a free version that has most of the capabilities of Splunk Enterprise and is an excellent learning tool. (For more resources related to this topic, see here.) Understanding events, event types, and fields in Splunk An understanding of events and event types is important before going further. Events In Splunk, an event is not just one of" the many local user meetings that are set up between developers to help each other out (although those can be very useful), "but also refers to a record of one activity that is recorded in a log file. Each event usually has: A timestamp indicating the date and exact time the event was created Information about what happened on the system that is being tracked Event types An event type is a way to allow "users to categorize similar events. It is field-defined by the user. You can define an event type in several ways, and the easiest way is by using the SplunkWeb interface. One common reason for setting up an event type is to examine why a system has failed. Logins are often problematic for systems, and a search for failed logins can help pinpoint problems. For an interesting example of how to save "a search on failed logins as an event type, visit http://docs.splunk.com/Documentation/Splunk/6.1.3/Knowledge/ClassifyAndGroupSimilarEvents#Save_a_search_as_a_new_event_type. Why are events and event types so important in Splunk? Because without events, there would be nothing to search, of course. And event types allow us to make meaningful searches easily and quickly according to our needs, as we'll see later. Sourcetypes Sourcetypes are also "important to understand, as they help define the rules for an event. A sourcetype is one of the default fields that Splunk assigns to data as it comes into the system. It determines what type of data it is so that Splunk can format it appropriately as it indexes it. This also allows the user who wants to search the "data to easily categorize it. Some of the common sourcetypes are listed as follows: access_combined, for "NCSA combined format HTTP web server logs apache_error, for standard "Apache web server error logs cisco_syslog, for the "standard syslog produced by Cisco network devices (including PIX firewalls, routers, and ACS), usually via remote syslog to a central log host websphere_core, a core file" export from WebSphere (Source: http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter) Fields Each event in Splunk is" associated with a number of fields. The core fields of host, course, sourcetype, and timestamp are key to Splunk. These fields are extracted from events at multiple points in the data processing pipeline that Splunk uses, and each of these fields includes a name and a value. The name describes the field (such as the userid) and the value says what that field's value is (susansmith, for example). Some of these fields are default fields that are given because of where the event came from or what it is. When data is processed by Splunk, and when it is indexed or searched, it uses these fields. For indexing, the default fields added include those of host, source, and sourcetype. When searching, Splunk is able to select from a bevy of fields that can either be defined by the user or are very basic, such as action results in a purchase (for a website event). Fields are essential for doing the basic work of Splunk – that is, indexing and searching. Getting data into Splunk It's time to spring into action" now and input some data into Splunk. Adding data is "simple, easy, and quick. In this section, we will use some data and tutorials created by Splunk to learn how to add data: Firstly, to obtain your data, visit the tutorial data at http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk that is readily available on Splunk. Here, download the folder tutorialdata.zip. Note that this will be a fresh dataset that has been collected over the last 7 days. Download it but don't extract the data from it just yet. You then need to log in to Splunk, using admin as the username and then by using your password. Once logged in, you will notice that toward the upper-right corner of your screen is the button Add Data, as shown in the following screenshot. Click "on this button: Button to Add Data Once you have "clicked on this button, you'll see a screen" similar to the "following screenshot: Add Data to Splunk by Choosing a Data Type or Data Source Notice here the "different types of data that you can select, as "well as the different data sources. Since the data we're going to use is a file, under "Or Choose a Data Source, click on From files and directories. Once you have clicked on this, you can then click on the radio button next to Skip preview, as indicated in the following screenshot, since you don't need to preview the data" now. You then need to click on "Continue: Preview data You can download the tutorial files at: http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk As shown in the next screenshot, click on Upload and index a file, find the tutorialdata.zip file you just downloaded (it is probably in your Downloads folder), and then click on More settings, filling it in as shown in the following screenshot. (Note that you will need to select Segment in path under Host and type 1 under Segment Number.) Click on Save when you are done: Can specify source, additional settings, and source type Following this, you "should see a screen similar to the following" screenshot. Click on Start Searching, we will look at the data now: You should see this if your data has been successfully indexed into Splunk. You will now" see a screen similar to the following" screenshot. Notice that the number of events you have will be different, as will the time of the earliest event. At this point, click on Data Summary: The Search screen You should see the Data Summary screen like in the following screenshot. However, note that the Hosts shown here will not be the same as the ones you get. Take a quick look at what is on the Sources tab and the Sourcetypes tab. Then find the most recent data (in this case 127.0.0.1) and click on it. Data Summary, where you can see Hosts, Sources, and Sourcetypes After" clicking on the most recent data, which in "this case is bps-T341s, look at the events contained there. Later, when we use streaming data, we can see how the events at the top of this list change rapidly. Here, you will see a listing of events, similar to those shown in the "following screenshot: Events lists for the host value You can click on the Splunk logo in the upper-left corner "of the web page to return to the home page. Under Administrator at the "top-right of the page, click on Logout. Searching Twitter data We will start here by doing a simple search of our Twitter index, which is automatically created by the app once you have enabled Twitter input (as explained previously). In our earlier searches, we used the default index (which the tutorial data was downloaded to), so we didn't have to specify the index we wanted to use. Here, we will use just the Twitter index, so we need to specify that in the search. A simple search Imagine that we wanted to search for tweets containing the word coffee. We could use the code presented here and place it in the search bar: index=twitter text=*coffee* The preceding code searches only your Twitter index and finds all the places where the word coffee is mentioned. You have to put asterisks there, otherwise you will only get the tweets with just "coffee". (Note that the text field is not case sensitive, so tweets with either "coffee" or "Coffee" will be included in the search results.) The asterisks are included before and after the text "coffee" because otherwise we would only get events where just "coffee" was tweeted – a rather rare occurrence, we expect. In fact, when we search our indexed Twitter data without the asterisks around coffee, we got no results. Examining the Twitter event Before going further, it is useful to stop and closely examine the events that are collected as part of the search. The sample tweet shown in the following screenshot shows the large number of fields that are part of each tweet. The > was clicked to expand the event: A Twitter event There are several items to look closely at here: _time: Splunk assigns a timestamp for every event. This is done in UTC (Coordinated Universal Time) time format. contributors: The value for this field is null, as are the values of many Twitter fields. Retweeted_status: Notice the {+} here; in the following event list, you will see there are a number of fields associated with this, which can be seen when the + is selected and the list is expanded. This is the case wherever you see a {+} in a list of fields: Various retweet fields In addition to those shown previously, there are many other fields associated with a tweet. The 140 character (maximum) text field that most people consider to be the tweet is actually a small part of the actual data collected. The implied AND If you want to search on more than one term, there is no need to add AND as it is already implied. If, for example, you want to search for all tweets that include both the text "coffee" and the text "morning", then use: index=twitter text=*coffee* text=*morning* If you don't specify text= for the second term and just put *morning*, Splunk assumes that you want to search for *morning* in any field. Therefore, you could get that word in another field in an event. This isn't very likely in this case, although coffee could conceivably be part of a user's name, such as "coffeelover". But if you were searching for other text strings, such as a computer term like log or error, such terms could be found in a number of fields. So specifying the field you are interested in would be very important. The need to specify OR Unlike AND, you must always specify the word OR. For example, to obtain all events that mention either coffee or morning, enter: index=twitter text=*coffee* OR text=*morning* Finding other words used Sometimes you might want to find out what other words are used in tweets about coffee. You can do that with the following search: index=twitter text=*coffee* | makemv text | mvexpand text | top 30 text This search first searches for the word "coffee" in a text field, then creates a multivalued field from the tweet, and then expands it so that each word is treated as a separate piece of text. Then it takes the top 30 words that it finds. You might be asking yourself how you would use this kind of information. This type of analysis would be of interest to a marketer, who might want to use words that appear to be associated with coffee in composing the script for an advertisement. The following screenshot shows the results that appear (1 of 2 pages). From this search, we can see that the words love, good, and cold might be words worth considering: Search of top 30 text fields found with *coffee* When you do a search like this, you will notice that there are a lot of filler words (a, to, for, and so on) that appear. You can do two things to remedy this. You can increase the limit for top words so that you can see more of the words that come up, or you can rerun the search using the following code. "Coffee" (with a capital C) is listed (on the unshown second page) separately here from "coffee". The reason for this is that while the search is not case sensitive (thus both "coffee" and "Coffee" are picked up when you search on "coffee"), the process of putting the text fields through the makemv and the mvexpand processes ends up distinguishing on the basis of case. We could rerun the search, excluding some of the filler words, using the code shown here: index=twitter text=*coffee* | makemv text | mvexpand text |search NOT text="RT" AND NOT text="a" AND NOT text="to" ANDNOT text="the" | top 30 text Using a lookup table Sometimes it is useful to use a lookup file to avoid having to use repetitive code. It would help us to have a list of all the small words that might be found often in a tweet just by the nature of each word's frequent use in language, so that we might eliminate them from our quest to find words that would be relevant for use in the creation of advertising. If we had a file of such small words, we could use a command indicating not to use any of these more common, irrelevant words when listing the top 30 words associated with our search topic of interest. Thus, for our search for words associated with the text "coffee", we would be interested in words like " dark", "flavorful", and "strong", but not words like "a", "the", and "then". We can do this using a lookup command. There are three types of lookup commands, which are presented in the following table: Command Description lookup Matches a value of one field with a value of another, based on a .csv file with the two fields. Consider a lookup table named lutable that contains fields for machine_name and owner. Consider what happens when the following code snippet is used after a preceding search (indicated by . . . |): . . . | lookup lutable owner Splunk will use the lookup table to match the owner's name with its machine_name and add the machine_name to each event. inputlookup All fields in the .csv file are returned as results. If the following code snippet is used, both machine_name and owner would be searched: . . . | inputlookup lutable outputlookup This code outputs search results to a lookup table. The following code outputs results from the preceding research directly into a table it creates: . . . | outputlookup newtable.csv saves The command we will use here is inputlookup, because we want to reference a .csv file we can create that will include words that we want to filter out as we seek to find possible advertising words associated with coffee. Let's call the .csv file filtered_words.csv, and give it just a single text field, containing words like "is", "the", and "then". Let's rewrite the search to look like the following code: index=twitter text=*coffee*| makemv text | mvexpand text| search NOT [inputlookup filtered_words | fields text ]| top 30 text Using the preceding code, Splunk will search our Twitter index for *coffee*, and then expand the text field so that individual words are separated out. Then it will look for words that do NOT match any of the words in our filtered_words.csv file, and finally output the top 30 most frequently found words among those. As you can see, the lookup table can be very useful. To learn more about Splunk lookup tables, go to http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchReference/Lookup. Summary In this article, we have learned more about how to use Splunk to create reports, dashboards. Splunk Enterprise Software, or Splunk, is an extremely powerful tool for searching, exploring, and visualizing data of all types. Splunk is becoming increasingly popular, as more and more businesses, both large and small, discover its ease and usefulness. Analysts, managers, students, and others can quickly learn how to use the data from their systems, networks, web traffic, and social media to make attractive and informative reports. This is a straightforward, practical, and quick introduction to Splunk that should have you making reports and gaining insights from your data in no time. Resources for Article: Further resources on this subject: Lookups [article] Working with Apps in Splunk [article] Loading data, creating an app, and adding dashboards and reports in Splunk [article]
Read more
  • 0
  • 0
  • 11723

article-image-mapreduce-functions
Packt
03 Mar 2015
11 min read
Save for later

MapReduce functions

Packt
03 Mar 2015
11 min read
 In this article, by John Zablocki, author of the book, Couchbase Essentials, you will be acquainted to MapReduce and how you'll use it to create secondary indexes for our documents. At its simplest, MapReduce is a programming pattern used to process large amounts of data that is typically distributed across several nodes in parallel. In the NoSQL world, MapReduce implementations may be found on many platforms from MongoDB to Hadoop, and of course, Couchbase. Even if you're new to the NoSQL landscape, it's quite possible that you've already worked with a form of MapReduce. The inspiration for MapReduce in distributed NoSQL systems was drawn from the functional programming concepts of map and reduce. While purely functional programming languages haven't quite reached mainstream status, languages such as Python, C#, and JavaScript all support map and reduce operations. (For more resources related to this topic, see here.) Map functions Consider the following Python snippet: numbers = [1, 2, 3, 4, 5] doubled = map(lambda n: n * 2, numbers) #doubled == [2, 4, 6, 8, 10] These two lines of code demonstrate a very simple use of a map() function. In the first line, the numbers variable is created as a list of integers. The second line applies a function to the list to create a new mapped list. In this case, the map() function is supplied as a Python lambda, which is just an inline, unnamed function. The body of lambda multiplies each number by two. This map() function can be made slightly more complex by doubling only odd numbers, as shown in this code: numbers = [1, 2, 3, 4, 5] defdouble_odd(num):   if num % 2 == 0:     return num   else:     return num * 2   doubled = map(double_odd, numbers) #doubled == [2, 2, 6, 4, 10] Map functions are implemented differently in each language or platform that supports them, but all follow the same pattern. An iterable collection of objects is passed to a map function. Each item of the collection is then iterated over with the map function being applied to that iteration. The final result is a new collection where each of the original items is transformed by the map. Reduce functions Like maps, the reduce functions also work by applying a provided function to an iterable data structure. The key difference between the two is that the reduce function works to produce a single value from the input iterable. Using Python's built-in reduce() function, we can see how to produce a sum of integers, as follows: numbers = [1, 2, 3, 4, 5] sum = reduce(lambda x, y: x + y, numbers) #sum == 15 You probably noticed that unlike our map operation, the reduce lambda has two parameters (x and y in this case). The argument passed to x will be the accumulated value of all applications of the function so far, and y will receive the next value to be added to the accumulation. Parenthetically, the order of operations can be seen as ((((1 + 2) + 3) + 4) + 5). Alternatively, the steps are shown in the following list: x = 1, y = 2 x = 3, y = 3 x = 6, y = 4 x = 10, y = 5 x = 15 As this list demonstrates, the value of x is the cumulative sum of previous x and y values. As such, reduce functions are sometimes termed accumulate or fold functions. Regardless of their name, reduce functions serve the common purpose of combining pieces of a recursive data structure to produce a single value. Couchbase MapReduce Creating an index (or view) in Couchbase requires creating a map function written in JavaScript. When the view is created for the first time, the map function is applied to each document in the bucket containing the view. When you update a view, only new or modified documents are indexed. This behavior is known as incremental MapReduce. You can think of a basic map function in Couchbase as being similar to a SQL CREATE INDEX statement. Effectively, you are defining a column or a set of columns, to be indexed by the server. Of course, these are not columns, but rather properties of the documents to be indexed. Basic mapping To illustrate the process of creating a view, first imagine that we have a set of JSON documents as shown here: var books=[     { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow"     },     { "id": 2, "title": "The Godfather", "author": "Mario Puzzo"     },     { "id": 3, "title": "Wiseguy", "author": "Nicholas Pileggi"     } ]; Each document contains title and author properties. In Couchbase, to query these documents by either title or author, we'd first need to write a map function. Without considering how map functions are written in Couchbase, we're able to understand the process with vanilla JavaScript: books.map(function(book) {   return book.author; }); In the preceding snippet, we're making use of the built-in JavaScript array's map() function. Similar to the Python snippets we saw earlier, JavaScript's map() function takes a function as a parameter and returns a new array with mapped objects. In this case, we'll have an array with each book's author, as follows: ["Robert Ludlow", "Mario Puzzo", "Nicholas Pileggi"] At this point, we have a mapped collection that will be the basis for our author index. However, we haven't provided a means for the index to be able to refer back to its original document. If we were using a relational database, we'd have effectively created an index on the Title column with no way to get back to the row that contained it. With a slight modification to our map function, we are able to provide the key (the id property) of the document as well in our index: books.map(function(book) {   return [book.author, book.id]; }); In this slightly modified version, we're including the ID with the output of each author. In this way, the index has its document's key stored with its title. [["The Bourne Identity", 1], ["The Godfather", 2], ["Wiseguy", 3]] We'll soon see how this structure more closely resembles the values stored in a Couchbase index. Basic reducing Not every Couchbase index requires a reduce component. In fact, we'll see that Couchbase already comes with built-in reduce functions that will provide you with most of the reduce behavior you need. However, before relying on only those functions, it's important to understand why you'd use a reduce function in the first place. Returning to the preceding example of the map, let's imagine we have a few more documents in our set, as follows: var books=[     { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow"     },     { "id": 2, "title": "The Bourne Ultimatum", "author": "Robert Ludlow"     },     { "id": 3, "title": "The Godfather", "author": "Mario Puzzo"     },     { "id": 4, "title": "The Bourne Supremacy", "author": "Robert Ludlow"     },     { "id": 5, "title": "The Family", "author": "Mario Puzzo"     },  { "id": 6, "title": "Wiseguy", "author": "Nicholas Pileggi"     } ]; We'll still create our index using the same map function because it provides a way of accessing a book by its author. Now imagine that we want to know how many books an author has written, or (assuming we had more data) the average number of pages written by an author. These questions are not possible to answer with a map function alone. Each application of the map function knows nothing about the previous application. In other words, there is no way for you to compare or accumulate information about one author's book to another book by the same author. Fortunately, there is a solution to this problem. As you've probably guessed, it's the use of a reduce function. As a somewhat contrived example, consider this JavaScript: mapped = books.map(function (book) {     return ([book.id, book.author]); });   counts = {} reduced = mapped.reduce(function(prev, cur, idx, arr) { var key = cur[1];     if (! counts[key]) counts[key] = 0;     ++counts[key] }, null); This code doesn't quite accurately reflect the way you would count books with Couchbase but it illustrates the basic idea. You look for each occurrence of a key (author) and increment a counter when it is found. With Couchbase MapReduce, the mapped structure is supplied to the reduce() function in a better format. You won't need to keep track of items in a dictionary. Couchbase views At this point, you should have a general sense of what MapReduce is, where it came from, and how it will affect the creation of a Couchbase Server view. So without further ado, let's see how to write our first Couchbase view. In fact, there were two to choose from. The bucket we'll use is beer-sample. If you didn't install it, don't worry. You can add it by opening the Couchbase Console and navigating to the Settings tab. Here, you'll find the option to install the bucket, as shown next: First, you need to understand the document structures with which you're working. The following JSON object is a beer document (abbreviated for brevity): {  "name": "Sundog",  "type": "beer",  "brewery_id": "new_holland_brewing_company",  "description": "Sundog is an amber ale...",  "style": "American-Style Amber/Red Ale",  "category": "North American Ale" } As you can see, the beer documents have several properties. We're going to create an index to let us query these documents by name. In SQL, the query would look like this: SELECT Id FROM Beers WHERE Name = ? You might be wondering why the SQL example includes only the Id column in its projection. For now, just know that to query a document using a view with Couchbase, the property by which you're querying must be included in an index. To create that index, we'll write a map function. The simplest example of a map function to query beer documents by name is as follows: function(doc) {   emit(doc.name); } This body of the map function has only one line. It calls the built-in Couchbase emit() function. This function is used to signal that a value should be indexed. The output of this map function will be an array of names. The beer-sample bucket includes brewery data as well. These documents look like the following code (abbreviated for brevity): {   "name": "Thomas Hooker Brewing",   "city": "Bloomfield",   "state": "Connecticut",   "website": "http://www.hookerbeer.com/",   "type": "brewery" } If we reexamine our map function, we'll see an obvious problem; both the brewery and beer documents have a name property. When this map function is applied to the documents in the bucket, it will create an index with documents from either the brewery or beer documents. The problem is that Couchbase documents exist in a single container—the bucket. There is no namespace for a set of related documents. The solution has typically involved including a type or docType property on each document. The value of this property is used to distinguish one document from another. In the case of the beer-sample database, beer documents have type = "beer" and brewery documents have type = "brewery". Therefore, we are easily able to modify our map function to create an index only on beer documents: function(doc) {   if (doc.type == "beer") {     emit(doc.name);   } } The emit() function actually takes two arguments. The first, as we've seen, emits a value to be indexed. The second argument is an optional value and is used by the reduce function. Imagine that we want to count the number of beer types in a particular category. In SQL, we would write the following query: SELECT Category, COUNT(*) FROM Beers GROUP BY Category To achieve the same functionality with Couchbase Server, we'll need to use both map and reduce functions. First, let's write the map. It will create an index on the category property: function(doc) {   if (doc.type == "beer") {     emit(doc.category, 1);   } } The only real difference between our category index and our name index is that we're including an argument for the value parameter of the emit() function. What we'll do with that value is simply count them. This counting will be done in our reduce function: function(keys, values) {   return values.length; } In this example, the values parameter will be given to the reduce function as a list of all values associated with a particular key. In our case, for each beer category, there will be a list of ones (that is, [1, 1, 1, 1, 1, 1]). Couchbase also provides a built-in _count function. It can be used in place of the entire reduce function in the preceding example. Now that we've seen the basic requirements when creating an actual Couchbase view, it's time to add a view to our bucket. The easiest way to do so is to use the Couchbase Console. Summary In this article, you learned the purpose of secondary indexes in a key/value store. We dug deep into MapReduce, both in terms of its history in functional languages and as a tool for NoSQL and big data systems. Resources for Article: Further resources on this subject: Map Reduce? [article] Introduction to Mapreduce [article] Working with Apps Splunk [article]
Read more
  • 0
  • 0
  • 4795

article-image-getting-started-postgresql
Packt
03 Mar 2015
11 min read
Save for later

Getting Started with PostgreSQL

Packt
03 Mar 2015
11 min read
In this article by Ibrar Ahmed, Asif Fayyaz, and Amjad Shahzad, authors of the book PostgreSQL Developer's Guide, we will come across the basic features and functions of PostgreSQL, such as writing queries using psql, data definition in tables, and data manipulation from tables. (For more resources related to this topic, see here.) PostgreSQL is widely considered to be one of the most stable database servers available today, with multiple features that include: A wide range of built-in types MVCC New SQL enhancements, including foreign keys, primary keys, and constraints Open source code, maintained by a team of developers Trigger and procedure support with multiple procedural languages Extensibility in the sense of adding new data types and the client language From the early releases of PostgreSQL (from version 6.0 that is), many changes have been made, with each new major version adding new and more advanced features. The current version is PostgreSQL 9.4 and is available from several sources and in various binary formats. Writing queries using psql Before proceeding, allow me to explain to you that throughout this article, we will use a warehouse database called warehouse_db. In this section, I will show you how you can create such a database, providing you with sample code for assistance. You will need to do the following: We are assuming here that you have successfully installed PostgreSQL and faced no issues. Now, you will need to connect with the default database that is created by the PostgreSQL installer. To do this, navigate to the default path of installation, which is /opt/PostgreSQL/9.4/bin from your command line, and execute the following command that will prompt for a postgres user password that you provided during the installation: /opt/PostgreSQL/9.4/bin$./psql -U postgres Password for user postgres: Using the following command, you can log in to the default database with the user postgres and you will be able to see the following on your command line: psql (9.4beta1) Type "help" for help postgres=# You can then create a new database called warehouse_db using the following statement in the terminal: postgres=# CREATE DATABASE warehouse_db; You can then connect with the warehouse_db database using the following command: postgres=# c warehouse_db You are now connected to the warehouse_db database as the user postgres, and you will have the following warehouse_db shell: warehouse_db=# Let's summarize what we have achieved so far. We are now able to connect with the default database postgres and created a warehouse_db database successfully. It's now time to actually write queries using psql and perform some Data Definition Language (DDL) and Data Manipulation Language (DML) operations, which we will cover in the following sections. In PostgreSQL, we can have multiple databases. Inside the databases, we can have multiple extensions and schemas. Inside each schema, we can have database objects such as tables, views, sequences, procedures, and functions. We are first going to create a schema named record and then we will create some tables in this schema. To create a schema named record in the warehouse_db database, use the following statement: warehouse_db=# CREATE SCHEMA record; Creating, altering, and truncating a table In this section, we will learn about creating a table, altering the table definition, and truncating the table. Creating tables Now, let's perform some DDL operations starting with creating tables. To create a table named warehouse_tbl, execute the following statements: warehouse_db=# CREATE TABLE warehouse_tbl ( warehouse_id INTEGER NOT NULL, warehouse_name TEXT NOT NULL, year_created INTEGER, street_address TEXT, city CHARACTER VARYING(100), state CHARACTER VARYING(2), zip CHARACTER VARYING(10), CONSTRAINT "PRIM_KEY" PRIMARY KEY (warehouse_id) ); The preceding statements created the table warehouse_tbl that has the primary key warehouse_id. Now, as you are familiar with the table creation syntax, let's create a sequence and use that in a table. You can create the hist_id_seq sequence using the following statement: warehouse_db=# CREATE SEQUENCE hist_id_seq; The preceding CREATE SEQUENCE command creates a new sequence number generator. This involves creating and initializing a new special single-row table with the name hist_id_seq. The user issuing the command will own the generator. You can now create the table that implements the hist_id_seq sequence using the following statement: warehouse_db=# CREATE TABLE history ( history_id INTEGER NOT NULL DEFAULT nextval('hist_id_seq'), date TIMESTAMP WITHOUT TIME ZONE, amount INTEGER, data TEXT, customer_id INTEGER, warehouse_id INTEGER, CONSTRAINT "PRM_KEY" PRIMARY KEY (history_id), CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) ); The preceding query will create a history table in the warehouse_db database, and the history_id column uses the sequence as the default input value. In this section, we successfully learned how to create a table and also learned how to use a sequence inside the table creation syntax. Altering tables Now that we have learned how to create multiple tables, we can practice some ALTER TABLE commands by following this section. With the ALTER TABLE command, we can add, remove, or rename table columns. Firstly, with the help of the following example, we will be able to add the phone_no column in the previously created table warehouse_tbl: warehouse_db=# ALTER TABLE warehouse_tbl ADD COLUMN phone_no INTEGER; We can then verify that a column is added in the table by describing the table as follows: warehouse_db=# d warehouse_tbl            Table "public.warehouse_tbl"                  Column     |         Type         | Modifiers ----------------+------------------------+----------- warehouse_id  | integer               | not null warehouse_name | text                   | not null year_created   | integer               | street_address | text                   | city           | character varying(100) | state           | character varying(2)   | zip             | character varying(10) | phone_no       | integer               | Indexes: "PRIM_KEY" PRIMARY KEY, btree (warehouse_id) Referenced by: TABLE "history" CONSTRAINT "FORN_KEY"FOREIGN KEY  (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) TABLE  "history" CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id)  REFERENCES warehouse_tbl(warehouse_id) To drop a column from a table, we can use the following statement: warehouse_db=# ALTER TABLE warehouse_tbl DROP COLUMN phone_no; We can then finally verify that the column has been removed from the table by describing the table again as follows: warehouse_db=# d warehouse_tbl            Table "public.warehouse_tbl"                  Column     |         Type         | Modifiers ----------------+------------------------+----------- warehouse_id   | integer               | not null warehouse_name | text                   | not null year_created   | integer               | street_address | text                   | city           | character varying(100) | state           | character varying(2)   | zip             | character varying(10) | Indexes: "PRIM_KEY" PRIMARY KEY, btree (warehouse_id) Referenced by: TABLE "history" CONSTRAINT "FORN_KEY" FOREIGN KEY  (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) TABLE  "history" CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id)  REFERENCES warehouse_tbl(warehouse_id) Truncating tables The TRUNCATE command is used to remove all rows from a table without providing any criteria. In the case of the DELETE command, the user has to provide the delete criteria using the WHERE clause. To truncate data from the table, we can use the following statement: warehouse_db=# TRUNCATE TABLE warehouse_tbl; We can then verify that the warehouse_tbl table has been truncated by performing a SELECT COUNT(*) query on it using the following statement: warehouse_db=# SELECT COUNT(*) FROM warehouse_tbl; count -------      0 (1 row) Inserting, updating, and deleting data from tables In this section, we will play around with data and learn how to insert, update, and delete data from a table. Inserting data So far, we have learned how to create and alter a table. Now it's time to play around with some data. Let's start by inserting records in the warehouse_tbl table using the following command snippet: warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 1, 'Mark Corp', 2009, '207-F Main Service Road East', 'New London', 'CT', 4321 ); We can then verify that the record has been inserted by performing a SELECT query on the warehouse_tbl table as follows: warehouse_db=# SELECT warehouse_id, warehouse_name, street_address               FROM warehouse_tbl; warehouse_id | warehouse_name |       street_address         ---------------+----------------+------------------------------- >             1 | Mark Corp     | 207-F Main Service Road East (1 row) Updating data Once we have inserted data in our table, we should know how to update it. This can be done using the following statement: warehouse_db=# UPDATE warehouse_tbl SET year_created=2010 WHERE year_created=2009; To verify that a record is updated, let's perform a SELECT query on the warehouse_tbl table as follows: warehouse_db=# SELECT warehouse_id, year_created FROM               warehouse_tbl; warehouse_id | year_created --------------+--------------            1 |         2010 (1 row) Deleting data To delete data from a table, we can use the DELETE command. Let's add a few records to the table and then later on delete data on the basis of certain conditions: warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 2, 'Bill & Co', 2014, 'Lilly Road', 'New London', 'CT', 4321 ); warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 3, 'West point', 2013, 'Down Town', 'New London', 'CT', 4321 ); We can then delete data from the warehouse.tbl table, where warehouse_name is Bill & Co, by executing the following statement: warehouse_db=# DELETE FROM warehouse_tbl WHERE warehouse_name='Bill & Co'; To verify that a record has been deleted, we will execute the following SELECT query: warehouse_db=# SELECT warehouse_id, warehouse_name FROM warehouse_tbl WHERE warehouse_name='Bill & Co'; warehouse_id | warehouse_name --------------+---------------- (0 rows) The DELETE command is used to drop a row from a table, whereas the DROP command is used to drop a complete table. The TRUNCATE command is used to empty the whole table. Summary In this article, we learned how to utilize the SQL language for a collection of everyday DBMS exercises in an easy-to-use practical way. We also figured out how to make a complete database that incorporates DDL (create, alter, and truncate) and DML (insert, update, and delete) operators. Resources for Article: Further resources on this subject: Indexes [Article] Improving proximity filtering with KNN [Article] Using Unrestricted Languages [Article]
Read more
  • 0
  • 0
  • 2587

article-image-postgresql-extensible-rdbms
Packt
03 Mar 2015
18 min read
Save for later

PostgreSQL as an Extensible RDBMS

Packt
03 Mar 2015
18 min read
This article by Usama Dar, the author of the book PostgreSQL Server Programming - Second Edition, explains the process of creating a new operator, overloading it, optimizing it, creating index access methods, and much more. PostgreSQL is an extensible database. I hope you've learned this much by now. It is extensible by virtue of the design that it has. As discussed before, PostgreSQL uses a catalog-driven design. In fact, PostgreSQL is more catalog-driven than most of the traditional relational databases. The key benefit here is that the catalogs can be changed or added to, in order to modify or extend the database functionality. PostgreSQL also supports dynamic loading, that is, a user-written code can be provided as a shared library, and PostgreSQL will load it as required. (For more resources related to this topic, see here.) Extensibility is critical for many businesses, which have needs that are specific to that business or industry. Sometimes, the tools provided by the traditional database systems do not fulfill those needs. People in those businesses know best how to solve their particular problems, but they are not experts in database internals. It is often not possible for them to cook up their own database kernel or modify the core or customize it according to their needs. A truly extensible database will then allow you to do the following: Solve domain-specific problems in a seamless way, like a native solution Build complete features without modifying the core database engine Extend the database without interrupting availability PostgreSQL not only allows you to do all of the preceding things, but also does these, and more with utmost ease. In terms of extensibility, you can do the following things in a PostgreSQL database: Create your own data types Create your own functions Create your own aggregates Create your own operators Create your own index access methods (operator classes) Create your own server programming language Create foreign data wrappers (SQL/MED) and foreign tables What can't be extended? Although PostgreSQL is an extensible platform, there are certain things that you can't do or change without explicitly doing a fork, as follows: You can't change or plug in a new storage engine. If you are coming from the MySQL world, this might annoy you a little. However, PostgreSQL's storage engine is tightly coupled with its executor and the rest of the system, which has its own benefits. You can't plug in your own planner/parser. One can argue for and against the ability to do that, but at the moment, the planner, parser, optimizer, and so on are baked into the system and there is no possibility of replacing them. There has been some talk on this topic, and if you are of the curious kind, you can read some of the discussion at http://bit.ly/1yRMkK7. We will now briefly discuss some more of the extensibility capabilities of PostgreSQL. We will not dive deep into the topics, but we will point you to the appropriate link where more information can be found. Creating a new operator Now, let's take look at how we can add a new operator in PostgreSQL. Adding new operators is not too different from adding new functions. In fact, an operator is syntactically just a different way to use an existing function. For example, the + operator calls a built-in function called numeric_add and passes it the two arguments. When you define a new operator, you must define the data types that the operator expects as arguments and define which function is to be called. Let's take a look at how to define a simple operator. You have to use the CREATE OPERATOR command to create an operator. Let's use that function to create a new Fibonacci operator, ##, which will have an integer on its left-hand side: CREATE OPERATOR ## (PROCEDURE=fib, LEFTARG=integer); Now, you can use this operator in your SQL to calculate a Fibonacci number: testdb=# SELECT 12##;?column?----------144(1 row) Note that we defined that the operator will have an integer on the left-hand side. If you try to put a value on the right-hand side of the operator, you will get an error: postgres=# SELECT ##12;ERROR: operator does not exist: ## integer at character 8HINT: No operator matches the given name and argument type(s). Youmight need to add explicit type casts.STATEMENT: select ##12;ERROR: operator does not exist: ## integerLINE 1: select ##12;^HINT: No operator matches the given name and argument type(s). Youmight need to add explicit type casts. Overloading an operator Operators can be overloaded in the same way as functions. This means, that an operator can have the same name as an existing operator but with a different set of argument types. More than one operator can have the same name, but two operators can't share the same name if they accept the same types and positions of the arguments. As long as there is a function that accepts the same kind and number of arguments that an operator defines, it can be overloaded. Let's override the ## operator we defined in the last example, and also add the ability to provide an integer on the right-hand side of the operator: CREATE OPERATOR ## (PROCEDURE=fib, RIGHTARG=integer); Now, running the same SQL, which resulted in an error last time, should succeed, as shown here: testdb=# SELECT ##12;?column?----------144(1 row) You can drop the operator using the DROP OPERATOR command. You can read more about creating and overloading new operators in the PostgreSQL documentation at http://www.postgresql.org/docs/current/static/sql-createoperator.html and http://www.postgresql.org/docs/current/static/xoper.html. There are several optional clauses in the operator definition that can optimize the execution time of the operators by providing information about operator behavior. For example, you can specify the commutator and the negator of an operator that help the planner use the operators in index scans. You can read more about these optional clauses at http://www.postgresql.org/docs/current/static/xoper-optimization.html. Since this article is just an introduction to the additional extensibility capabilities of PostgreSQL, we will just introduce a couple of optimization options; any serious production quality operator definitions should include these optimization clauses, if applicable. Optimizing operators The optional clauses tell the PostgreSQL server about how the operators behave. These options can result in considerable speedups in the execution of queries that use the operator. However, if you provide these options incorrectly, it can result in a slowdown of the queries. Let's take a look at two optimization clauses called commutator and negator. COMMUTATOR This clause defines the commuter of the operator. An operator A is a commutator of operator B if it fulfils the following condition: x A y = y B x. It is important to provide this information for the operators that will be used in indexes and joins. As an example, the commutator for > is <, and the commutator of = is = itself. This helps the optimizer to flip the operator in order to use an index. For example, consider the following query: SELECT * FROM employee WHERE new_salary > salary; If the index is defined on the salary column, then PostgreSQL can rewrite the preceding query as shown: SELECT * from employee WHERE salary < new_salary This allows PostgreSQL to use a range scan on the index column salary. For a user-defined operator, the optimizer can only do this flip around if the commutator of a user-defined operator is defined: CREATE OPERATOR > (LEFTARG=integer, RIGHTARG=integer, PROCEDURE=comp, COMMUTATOR = <) NEGATOR The negator clause defines the negator of the operator. For example, <> is a negator of =. Consider the following query: SELECT * FROM employee WHERE NOT (dept = 10); Since <> is defined as a negator of =, the optimizer can simplify the preceding query as follows: SELECT * FROM employee WHERE dept <> 10; You can even verify that using the EXPLAIN command: postgres=# EXPLAIN SELECT * FROM employee WHERE NOTdept = 'WATER MGMNT';QUERY PLAN---------------------------------------------------------Foreign Scan on employee (cost=0.00..1.10 rows=1 width=160)Filter: ((dept)::text <> 'WATER MGMNT'::text)Foreign File: /Users/usamadar/testdata.csvForeign File Size: 197(4 rows) Creating index access methods Let's discuss how to index new data types or user-defined types and operators. In PostgreSQL, an index is more of a framework that can be extended or customized for using different strategies. In order to create new index access methods, we have to create an operator class. Let's take a look at a simple example. Let's consider a scenario where you have to store some special data such as an ID or a social security number in the database. The number may contain non-numeric characters, so it is defined as a text type: CREATE TABLE test_ssn (ssn text);INSERT INTO test_ssn VALUES ('222-11-020878');INSERT INTO test_ssn VALUES ('111-11-020978'); Let's assume that the correct order for this data is such that it should be sorted on the last six digits and not the ASCII value of the string. The fact that these numbers need a unique sort order presents a challenge when it comes to indexing the data. This is where PostgreSQL operator classes are useful. An operator allows a user to create a custom indexing strategy. Creating an indexing strategy is about creating your own operators and using them alongside a normal B-tree. Let's start by writing a function that changes the order of digits in the value and also gets rid of the non-numeric characters in the string to be able to compare them better: CREATE OR REPLACE FUNCTION fix_ssn(text)RETURNS text AS $$BEGINRETURN substring($1,8) || replace(substring($1,1,7),'-','');END;$$LANGUAGE 'plpgsql' IMMUTABLE; Let's run the function and verify that it works: testdb=# SELECT fix_ssn(ssn) FROM test_ssn;fix_ssn-------------0208782221102097811111(2 rows) Before an index can be used with a new strategy, we may have to define some more functions depending on the type of index. In our case, we are planning to use a simple B-tree, so we need a comparison function: CREATE OR REPLACE FUNCTION ssn_compareTo(text, text)RETURNS int AS$$BEGINIF fix_ssn($1) < fix_ssn($2)THENRETURN -1;ELSIF fix_ssn($1) > fix_ssn($2)THENRETURN +1;ELSERETURN 0;END IF;END;$$ LANGUAGE 'plpgsql' IMMUTABLE; It's now time to create our operator class: CREATE OPERATOR CLASS ssn_opsFOR TYPE text USING btreeASOPERATOR 1 < ,OPERATOR 2 <= ,OPERATOR 3 = ,OPERATOR 4 >= ,OPERATOR 5 > ,FUNCTION 1 ssn_compareTo(text, text); You can also overload the comparison operators if you need to compare the values in a special way, and use the functions in the compareTo function as well as provide them in the CREATE OPERATOR CLASS command. We will now create our first index using our brand new operator class: CREATE INDEX idx_ssn ON test_ssn (ssn ssn_ops); We can check whether the optimizer is willing to use our special index, as follows: testdb=# SET enable_seqscan=off;testdb=# EXPLAIN SELECT * FROM test_ssn WHERE ssn = '02087822211';QUERY PLAN------------------------------------------------------------------Index Only Scan using idx_ssn on test_ssn (cost=0.13..8.14 rows=1width=32)Index Cond: (ssn = '02087822211'::text)(2 rows) Therefore, we can confirm that the optimizer is able to use our new index. You can read about index access methods in the PostgreSQL documentation at http://www.postgresql.org/docs/current/static/xindex.html. Creating user-defined aggregates User-defined aggregate functions are probably a unique PostgreSQL feature, yet they are quite obscure and perhaps not many people know how to create them. However, once you are able to create this function, you will wonder how you have lived for so long without using this feature. This functionality can be incredibly useful, because it allows you to perform custom aggregates inside the database, instead of querying all the data from the client and doing a custom aggregate in your application code, that is, the number of hits on your website per minute from a specific country. PostgreSQL has a very simple process for defining aggregates. Aggregates can be defined using any functions and in any languages that are installed in the database. Here are the basic steps to building an aggregate function in PostgreSQL: Define a start function that will take in the values of a result set; this function can be defined in any PL language you want. Define an end function that will do something with the final output of the start function. This can be in any PL language you want. Define the aggregate using the CREATE AGGREGATE command, providing the start and end functions you just created. Let's steal an example from the PostgreSQL wiki at http://wiki.postgresql.org/wiki/Aggregate_Median. In this example, we will calculate the statistical median of a set of data. For this purpose, we will define start and end aggregate functions. Let's define the end function first, which takes an array as a parameter and calculates the median. We are assuming here that our start function will pass an array to the following end function: CREATE FUNCTION _final_median(anyarray) RETURNS float8 AS $$WITH q AS(SELECT valFROM unnest($1) valWHERE VAL IS NOT NULLORDER BY 1),cnt AS(SELECT COUNT(*) AS c FROM q)SELECT AVG(val)::float8FROM(SELECT val FROM qLIMIT 2 - MOD((SELECT c FROM cnt), 2)OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) - 1,0)) q2;$$ LANGUAGE sql IMMUTABLE; Now, we create the aggregate as shown in the following code: CREATE AGGREGATE median(anyelement) (SFUNC=array_append,STYPE=anyarray,FINALFUNC=_final_median,INITCOND='{}'); The array_append start function is already defined in PostgreSQL. This function appends an element to the end of an array. In our example, the start function takes all the column values and creates an intermediate array. This array is passed on to the end function, which calculates the median. Now, let's create a table and some test data to run our function: testdb=# CREATE TABLE median_test(t integer);CREATE TABLEtestdb=# INSERT INTO median_test SELECT generate_series(1,10);INSERT 0 10 The generate_series function is a set returning function that generates a series of values, from start to stop with a step size of one. Now, we are all set to test the function: testdb=# SELECT median(t) FROM median_test;median--------5.5(1 row) The mechanics of the preceding example are quite easy to understand. When you run the aggregate, the start function is used to append all the table data from column t into an array using the append_array PostgreSQL built-in. This array is passed on to the final function, _final_median, which calculates the median of the array and returns the result in the same data type as the input parameter. This process is done transparently to the user of the function who simply has a convenient aggregate function available to them. You can read more about the user-defined aggregates in the PostgreSQL documentation in much more detail at http://www.postgresql.org/docs/current/static/xaggr.html. Using foreign data wrappers PostgreSQL foreign data wrappers (FDW) are an implementation of SQL Management of External Data (SQL/MED), which is a standard added to SQL in 2013. FDWs are drivers that allow PostgreSQL database users to read and write data to other external data sources, such as other relational databases, NoSQL data sources, files, JSON, LDAP, and even Twitter. You can query the foreign data sources using SQL and create joins across different systems or even across different data sources. There are several different types of data wrappers developed by different developers and not all of them are production quality. You can see a select list of wrappers on the PostgreSQL wiki at http://wiki.postgresql.org/wiki/Foreign_data_wrappers. Another list of FDWs can be found on PGXN at http://pgxn.org/tag/fdw/. Let's take look at a small example of using file_fdw to access data in a CSV file. First, you need to install the file_fdw extension. If you compiled PostgreSQL from the source, you will need to install the file_fdw contrib module that is distributed with the source. You can do this by going into the contrib/file_fdw folder and running make and make install. If you used an installer or a package for your platform, this module might have been installed automatically. Once the file_fdw module is installed, you will need to create the extension in the database: postgres=# CREATE EXTENSION file_fdw;CREATE EXTENSION Let's now create a sample CSV file that uses the pipe, |, as a separator and contains some employee data: $ cat testdata.csvAARON, ELVIA J|WATER RATE TAKER|WATER MGMNT|81000.00|73862.00AARON, JEFFERY M|POLICE OFFICER|POLICE|74628.00|74628.00AARON, KIMBERLEI R|CHIEF CONTRACT EXPEDITER|FLEETMANAGEMNT|77280.00|70174.00 Now, we should create a foreign server that is pretty much a formality because the file is on the same server. A foreign server normally contains the connection information that a foreign data wrapper uses to access an external data resource. The server needs to be unique within the database: CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw; The next step, is to create a foreign table that encapsulates our CSV file: CREATE FOREIGN TABLE employee (emp_name VARCHAR,job_title VARCHAR,dept VARCHAR,salary NUMERIC,sal_after_tax NUMERIC) SERVER file_serverOPTIONS (format 'csv',header 'false' , filename '/home/pgbook/14/testdata.csv', delimiter '|', null '');''); The CREATE FOREIGN TABLE command creates a foreign table and the specifications of the file are provided in the OPTIONS section of the preceding code. You can provide the format, and if the first line of the file is a header (header 'false'), in our case there is no file header. We then provide the name and path of the file and the delimiter used in the file, which in our case is the pipe symbol |. In this example, we also specify that the null values should be represented as an empty string. Let's run a SQL command on our foreign table: postgres=# select * from employee;-[ RECORD 1 ]-+-------------------------emp_name | AARON, ELVIA Jjob_title | WATER RATE TAKERdept | WATER MGMNTsalary | 81000.00sal_after_tax | 73862.00-[ RECORD 2 ]-+-------------------------emp_name | AARON, JEFFERY Mjob_title | POLICE OFFICERdept | POLICEsalary | 74628.00sal_after_tax | 74628.00-[ RECORD 3 ]-+-------------------------emp_name | AARON, KIMBERLEI Rjob_title | CHIEF CONTRACT EXPEDITERdept | FLEET MANAGEMNTsalary | 77280.00sal_after_tax | 70174.00 Great, looks like our data is successfully loaded from the file. You can also use the d meta command to see the structure of the employee table: postgres=# d employee;Foreign table "public.employee"Column | Type | Modifiers | FDW Options---------------+-------------------+-----------+-------------emp_name | character varying | |job_title | character varying | |dept | character varying | |salary | numeric | |sal_after_tax | numeric | |Server: file_serverFDW Options: (format 'csv', header 'false',filename '/home/pg_book/14/testdata.csv', delimiter '|',"null" '') You can run explain on the query to understand what is going on when you run a query on the foreign table: postgres=# EXPLAIN SELECT * FROM employee WHERE salary > 5000;QUERY PLAN---------------------------------------------------------Foreign Scan on employee (cost=0.00..1.10 rows=1 width=160)Filter: (salary > 5000::numeric)Foreign File: /home/pgbook/14/testdata.csvForeign File Size: 197(4 rows) The ALTER FOREIGN TABLE command can be used to modify the options. More information about the file_fdw is available at http://www.postgresql.org/docs/current/static/file-fdw.html. You can take a look at the CREATE SERVER and CREATE FOREIGN TABLE commands in the PostgreSQL documentation for more information on the many options available. Each of the foreign data wrappers comes with its own documentation about how to use the wrapper. Make sure that an extension is stable enough before it is used in production. The PostgreSQL core development group does not support most of the FDW extensions. If you want to create your own data wrappers, you can find the documentation at http://www.postgresql.org/docs/current/static/fdwhandler.html as an excellent starting point. The best way to learn, however, is to read the code of other available extensions. Summary This includes the ability to add new operators, new index access methods, and create your own aggregates. You can access foreign data sources, such as other databases, files, and web services using PostgreSQL foreign data wrappers. These wrappers are provided as extensions and should be used with caution, as most of them are not officially supported. Even though PostgreSQL is very extensible, you can't plug in a new storage engine or change the parser/planner and executor interfaces. These components are very tightly coupled with each other and are, therefore, highly optimized and mature. Resources for Article: Further resources on this subject: Load balancing MSSQL [Article] Advanced SOQL Statements [Article] Running a PostgreSQL Database Server [Article]
Read more
  • 0
  • 0
  • 9211

article-image-quick-start-guide-flume
Packt
02 Mar 2015
15 min read
Save for later

A Quick Start Guide to Flume

Packt
02 Mar 2015
15 min read
In this article by Steve Hoffman, the author of the book, Apache Flume: Distributed Log Collection for Hadoop - Second Edition, we will learn about the basics that are required to be known before we start working with Apache Flume. This article will help you get started with Flume. So, let's start with the first step: downloading and configuring Flume. (For more resources related to this topic, see here.) Downloading Flume Let's download Flume from http://flume.apache.org/. Look for the download link in the side navigation. You'll see two compressed .tar archives available along with the checksum and GPG signature files used to verify the archives. Instructions to verify the download are on the website, so I won't cover them here. Checking the checksum file contents against the actual checksum verifies that the download was not corrupted. Checking the signature file validates that all the files you are downloading (including the checksum and signature) came from Apache and not some nefarious location. Do you really need to verify your downloads? In general, it is a good idea and it is recommended by Apache that you do so. If you choose not to, I won't tell. The binary distribution archive has bin in the name, and the source archive is marked with src. The source archive contains just the Flume source code. The binary distribution is much larger because it contains not only the Flume source and the compiled Flume components (jars, javadocs, and so on), but also all the dependent Java libraries. The binary package contains the same Maven POM file as the source archive, so you can always recompile the code even if you start with the binary distribution. Go ahead, download and verify the binary distribution to save us some time in getting started. Flume in Hadoop distributions Flume is available with some Hadoop distributions. The distributions supposedly provide bundles of Hadoop's core components and satellite projects (such as Flume) in a way that ensures things such as version compatibility and additional bug fixes are taken into account. These distributions aren't better or worse; they're just different. There are benefits to using a distribution. Someone else has already done the work of pulling together all the version-compatible components. Today, this is less of an issue since the Apache BigTop project started (http://bigtop.apache.org/). Nevertheless, having prebuilt standard OS packages, such as RPMs and DEBs, ease installation as well as provide startup/shutdown scripts. Each distribution has different levels of free and paid options, including paid professional services if you really get into a situation you just can't handle. There are downsides, of course. The version of Flume bundled in a distribution will often lag quite a bit behind the Apache releases. If there is a new or bleeding-edge feature you are interested in using, you'll either be waiting for your distribution's provider to backport it for you, or you'll be stuck patching it yourself. Furthermore, while the distribution providers do a fair amount of testing, such as any general-purpose platform, you will most likely encounter something that their testing didn't cover, in which case, you are still on the hook to come up with a workaround or dive into the code, fix it, and hopefully, submit that patch back to the open source community (where, at a future point, it'll make it into an update of your distribution or the next version). So, things move slower in a Hadoop distribution world. You can see that as good or bad. Usually, large companies don't like the instability of bleeding-edge technology or making changes often, as change can be the most common cause of unplanned outages. You'd be hard pressed to find such a company using the bleeding-edge Linux kernel rather than something like Red Hat Enterprise Linux (RHEL), CentOS, Ubuntu LTS, or any of the other distributions whose target is stability and compatibility. If you are a startup building the next Internet fad, you might need that bleeding-edge feature to get a leg up on the established competition. If you are considering a distribution, do the research and see what you are getting (or not getting) with each. Remember that each of these offerings is hoping that you'll eventually want and/or need their Enterprise offering, which usually doesn't come cheap. Do your homework. Here's a short, nondefinitive list of some of the more established players. For more information, refer to the following links: Cloudera: http://cloudera.com/ Hortonworks: http://hortonworks.com/ MapR: http://mapr.com/ An overview of the Flume configuration file Now that we've downloaded Flume, let's spend some time going over how to configure an agent. A Flume agent's default configuration provider uses a simple Java property file of key/value pairs that you pass as an argument to the agent upon startup. As you can configure more than one agent in a single file, you will need to additionally pass an agent identifier (called a name) so that it knows which configurations to use. In my examples where I'm only specifying one agent, I'm going to use the name agent. By default, the configuration property file is monitored for changes every 30 seconds. If a change is detected, Flume will attempt to reconfigure itself. In practice, many of the configuration settings cannot be changed after the agent has started. Save yourself some trouble and pass the undocumented --no-reload-conf argument when starting the agent (except in development situations perhaps). If you use the Cloudera distribution, the passing of this flag is currently not possible. I've opened a ticket to fix that at https://issues.cloudera.org/browse/DISTRO-648. If this is important to you, please vote it up. Each agent is configured, starting with three parameters: agent.sources=<list of sources>agent.channels=<list of channels>agent.sinks=<list of sinks> Each source, channel, and sink also has a unique name within the context of that agent. For example, if I'm going to transport my Apache access logs, I might define a channel named access. The configurations for this channel would all start with the agent.channels.access prefix. Each configuration item has a type property that tells Flume what kind of source, channel, or sink it is. In this case, we are going to use an in-memory channel whose type is memory. The complete configuration for the channel named access in the agent named agent would be: agent.channels.access.type=memory Any arguments to a source, channel, or sink are added as additional properties using the same prefix. The memory channel has a capacity parameter to indicate the maximum number of Flume events it can hold. Let's say we didn't want to use the default value of 100; our configuration would now look like this: agent.channels.access.type=memoryagent.channels.access.capacity=200 Finally, we need to add the access channel name to the agent.channels property so that the agent knows to load it: agent.channels=access Let's look at a complete example using the canonical "Hello, World!" example. Starting up with "Hello, World!" No technical article would be complete without a "Hello, World!" example. Here is the configuration file we'll be using: agent.sources=s1agent.channels=c1agent.sinks=k agent.sources.s1.type=netcatagent.sources.s1.channels=c1agent.sources.s1.bind=0.0.0.0agent.sources.s1.port=1234 agent.channels.c1.type=memory agent.sinks.k1.type=loggeragent.sinks.k1.channel=c1 Here, I've defined one agent (called agent) who has a source named s1, a channel named c1, and a sink named k1. The s1 source's type is netcat, which simply opens a socket listening for events (one line of text per event). It requires two parameters: a bind IP and a port number. In this example, we are using 0.0.0.0 for a bind address (the Java convention to specify listen on any address) and port 12345. The source configuration also has a parameter called channels (plural), which is the name of the channel(s) the source will append events to, in this case, c1. It is plural, because you can configure a source to write to more than one channel; we just aren't doing that in this simple example. The channel named c1 is a memory channel with a default configuration. The sink named k1 is of the logger type. This is a sink that is mostly used for debugging and testing. It will log all events at the INFO level using Log4j, which it receives from the configured channel, in this case, c1. Here, the channel keyword is singular because a sink can only be fed data from one channel. Using this configuration, let's run the agent and connect to it using the Linux netcat utility to send an event. First, explode the .tar archive of the binary distribution we downloaded earlier: $ tar -zxf apache-flume-1.5.2-bin.tar.gz$ cd apache-flume-1.5.2-bin Next, let's briefly look at the help. Run the flume-ng command with the help command: $ ./bin/flume-ng helpUsage: ./bin/flume-ng <command> [options]... commands:help                 display this help textagent                run a Flume agentavro-client           run an avro Flume clientversion               show Flume version info global options:--conf,-c <conf>     use configs in <conf> directory--classpath,-C <cp>   append to the classpath--dryrun,-d          do not actually start Flume, just print the command--plugins-path <dirs> colon-separated list of plugins.d directories. See the                       plugins.d section in the user guide for more details.                       Default: $FLUME_HOME/plugins.d-Dproperty=value     sets a Java system property value-Xproperty=value     sets a Java -X option agent options:--conf-file,-f <file> specify a config file (required)--name,-n <name>     the name of this agent (required)--help,-h             display help text avro-client options:--rpcProps,-P <file>   RPC client properties file with server connection params--host,-H <host>       hostname to which events will be sent--port,-p <port>       port of the avro source--dirname <dir>       directory to stream to avro source--filename,-F <file>   text file to stream to avro source (default: std input)--headerFile,-R <file> File containing event headers as key/value pairs on each new line--help,-h             display help text Either --rpcProps or both --host and --port must be specified. Note that if <conf> directory is specified, then it is always included first in the classpath. As you can see, there are two ways with which you can invoke the command (other than the simple help and version commands). We will be using the agent command. The use of avro-client will be covered later. The agent command has two required parameters: a configuration file to use and the agent name (in case your configuration contains multiple agents). Let's take our sample configuration and open an editor (vi in my case, but use whatever you like): $ vi conf/hw.conf Next, place the contents of the preceding configuration into the editor, save, and exit back to the shell. Now you can start the agent: $ ./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console The -Dflume.root.logger property overrides the root logger in conf/log4j.properties to use the console appender. If we didn't override the root logger, everything would still work, but the output would go to the log/flume.log file instead of being based on the contents of the default configuration file. Of course, you can edit the conf/log4j.properties file and change the flume.root.logger property (or anything else you like). To change just the path or filename, you can set the flume.log.dir and flume.log.file properties in the configuration file or pass additional flags on the command line as follows: $ ./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console -Dflume.log.dir=/tmp -Dflume.log.file=flume-agent.log You might ask why you need to specify the -c parameter, as the -f parameter contains the complete relative path to the configuration. The reason for this is that the Log4j configuration file should be included on the class path. If you left the -c parameter off the command, you'll see this error: Warning: No configuration directory set! Use --conf <dir> to override.log4j:WARN No appenders could be found for logger (org.apache.flume.lifecycle.LifecycleSupervisor).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info But you didn't do that so you should see these key log lines: 2014-10-05 15:39:06,109 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:140)] Post-validation flume configuration contains configuration foragents: [agent] This line tells you that your agent starts with the name agent. Usually you'd look for this line only to be sure you started the right configuration when you have multiple configurations defined in your configuration file. 2014-10-05 15:39:06,076 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)] Reloadingconfiguration file:conf/hw.conf This is another sanity check to make sure you are loading the correct file, in this case our hw.conf file. 2014-10-05 15:39:06,221 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:138)]Starting new configuration:{ sourceRunners:{s1=EventDrivenSourceRunner: { source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE} }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@442fbe47 counterGroup:{ name:null counters:{} } }}channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} } Once all the configurations have been parsed, you will see this message, which shows you everything that was configured. You can see s1, c1, and k1, and which Java classes are actually doing the work. As you probably guessed, netcat is a convenience for org.apache.flume.source.NetcatSource. We could have used the class name if we wanted. In fact, if I had my own custom source written, I would use its class name for the source's type parameter. You cannot define your own short names without patching the Flume distribution. 2014-10-05 15:39:06,427 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)] CreatedserverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345] Here, we see that our source is now listening on port 12345 for the input. So, let's send some data to it. Finally, open a second terminal. We'll use the nc command (you can use Telnet or anything else similar) to send the Hello World string and press the Return (Enter) key to mark the end of the event: % nc localhost 12345Hello WorldOK The OK message came from the agent after we pressed the Return key, signifying that it accepted the line of text as a single Flume event. If you look at the agent log, you will see the following: 2014-10-05 15:44:11,215 (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 48 65 6C 6C 6F 20 57 6F 72 6C 64Hello World } This log message shows you that the Flume event contains no headers (NetcatSource doesn't add any itself). The body is shown in hexadecimal along with a string representation (for us humans to read, in this case, our Hello World message). If I send the following line and then press the Enter key, you'll get an OK message: The quick brown fox jumped over the lazy dog. You'll see this in the agent's log: 2014-10-05 15:44:57,232 (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]Event: { headers:{} body: 54 68 65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20The quick brown } The event appears to have been truncated. The logger sink, by design, limits the body content to 16 bytes to keep your screen from being filled with more than what you'd need in a debugging context. If you need to see the full contents for debugging, you should use a different sink, perhaps the file_roll sink, which would write to the local filesystem. Summary In this article, we covered how to download the Flume binary distribution. We created a simple configuration file that included one source writing to one channel, feeding one sink. The source listened on a socket for network clients to connect to and to send it event data. These events were written to an in-memory channel and then fed to a Log4j sink to become the output. We then connected to our listening agent using the Linux netcat utility and sent some string events to our Flume agent's source. Finally, we verified that our Log4j-based sink wrote the events out. Resources for Article: Further resources on this subject: About Cassandra [article] Introducing Kafka [article] Transformation [article]
Read more
  • 0
  • 0
  • 7160
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-postgres-add
Packt
27 Feb 2015
7 min read
Save for later

Postgres Add-on

Packt
27 Feb 2015
7 min read
In this article by Patrick Espake, author of the book Learning Heroku Postgres, you will learn how to install and set up PostgreSQL and how to create an app using Postgres. (For more resources related to this topic, see here.) Local setup You need to install PostgreSQL on your computer; this installation is recommended because some commands of the Postgres add-on require PostgreSQL to be installed. Besides that, it's a good idea for your development database to be similar to your production database; this avoids problems between these environments. Next, you will learn how to set up PostgreSQL on Mac OS X, Windows, and Linux. In addition to pgAdmin, this is the most popular and rich feature in PostgreSQL's administration and development platform. The versions recommended for installation are PostgreSQL 9.4.0 and pgAdmin 1.20.0, or the latest available versions. Setting up PostgreSQL on Mac OS X The Postgres.app application is the simplest way to get started with PostgreSQL on Mac OS X, it contains many features in a single installation package: PostgreSQL 9.4.0 PostGIS 2.1.4 Procedural languages: PL/pgSQL, PL/Perl, PL/Python, and PLV8 (JavaScript) Popular extensions such as hstore, uuid-ossp, and others Many command-line tools for managing PostgreSQL and convenient tools for GIS The following screenshot displays the postgresapp website: For installation, visit the address http://postgresapp.com/, carry out the appropriate download, drag it to the applications directory, and then double-click to open. The other alternatives for installing PostgreSQL are to use the default graphic installer, Fink, MacPorts, or Homebrew. All of them are available at http://www.postgresql.org/download/macosx. To install pgAdmin, you should visit http://www.pgadmin.org/download/macosx.php, download the latest available version, and follow the installer instructions. Setting up PostgreSQL on Windows PostgreSQL on Windows is provided using a graphical installer that includes the PostgreSQL server, pgAdmin, and the package manager that is used to download and install additional applications and drivers for PostgreSQL. To install PostgreSQL, visit http://www.postgresql.org/download/windows, click on the download link, and select the the appropriate Windows version: 32 bit or 64 bit. Follow the instructions provided by the installer. After installing PostgreSQL on Windows, you need to set the PATH environment variable so that the psql, pg_dump and pg_restore commands can work through the Command Prompt. Perform the following steps: Open My Computer. Right-click on My Computer and select Properties. Click on Advanced System Settings. Click on the Environment Variables button. From the System variables box, select the Path variable. Click on Edit. At the end of the line, add the bin directory of PostgreSQL: c:Program FilesPostgreSQL9.4bin;c:Program FilesPostgreSQL9.4lib. Click on the OK button to save. The directory follows the pattern c:Program FilesPostgreSQLVERSION..., check your PostgreSQL version. Setting up PostgreSQL on Linux The great majority of Linux distributions already have PostgreSQL in their package manager. You can search the appropriate package for your distribution and install it. If your distribution is Debian or Ubuntu, you can install it with the following command: $ sudo apt-get install postgresql If your Linux distribution is Fedora, Red Hat, CentOS, Scientific Linux, or Oracle Enterprise Linux, you can use the YUM package manager to install PostgreSQL: $ sudo yum install postgresql94-server$ sudo service postgresql-9.4 initdb$ sudo chkconfig postgresql-9.4 on$ sudo service postgresql-9.4 start If your Linux distribution doesn't have PostgreSQL in your package manager, you can install it using the Linux installer. Just visit the website http://www.postgresql.org/download/linux, choose the appropriate installer, 32-bit or 64-bits, and follow the install instructions. You can install pgAdmin through the package manager of your Linux distribution; for Debian or Ubuntu you can use the following command: $ sudo apt-get install pgadmin3 For Linux distributions that use the YUM package manager, you can install through the following command: $ sudo yum install pgadmin3 If your Linux distribution doesn't have pgAdmin in its package manager, you can download and install it following the instructions provided at http://www.pgadmin.org/download/. Creating a local database For the examples in this article, you will need to have a local database created. You will create a new database called my_local_database through pgAdmin. To create the new database, perform the following steps: Open pgAdmin. Connect to the database server through the access credentials that you chose in the installation process. Click on the Databases item in the tree view. Click on the menu Edit -> New Object -> New database. Type the name my_local_database for the database. Click on the OK button to save. Creating a new local database called my_local_database Creating a new app Many features in Heroku can be implemented in two different ways; the first is via the Heroku client, which is installed through the Heroku Toolbelt, and the other is through the web Heroku dashboard. In this section, you will learn how to use both of them. Via the Heroku dashboard Access the website https://dashboard.heroku.com and login. After that, click on the plus sign at the top of the dashboard to create a new app and the following screen will be shown: Creating an app In this step, you should provide the name of your application. In the preceding example, it's learning-heroku-postgres-app. You can choose a name you prefer. Select which region you want to host it on; two options are available: United States or Europe. Heroku doesn't allow duplicated names for applications; each application name supplied is global and, after it has been used once, it will not be available for another person. It can happen that you choose a name that is already being used. In this case, you should choose another name. Choose the best option for you, it is usually recommended you select the region that is closest to you to decrease server response time. Click on the Create App button. Then Heroku will provide some information to perform the first deploy of your application. The website URL and Git repository are created using the following addresses: http://your-app-name.herokuapp.com and git@heroku.com/your-app-name.git. learning-heroku-postgres-app created Next you will create a directory in your computer and link it with Heroku to perform future deployments of your source code. Open your terminal and type the following commands: $ mkdir your-app-name$ cd your-app-name$ git init$ heroku git:remote -a your-app-nameGit remote heroku added Finally, you are able to deploy your source code at any time through these commands: $ git add .$ git commit –am "My updates"$ git push heroku master Via the Heroku client Creating a new application via the Heroku client is very simple. The first step is to create the application directory on your computer. For that, open the Terminal and type the following commands: $ mkdir your-app-name$ cd your-app-name$ git init After that you need to create a new Heroku application through the command: $ heroku apps:create your-app-nameCreating your-app-name... done, stack is cedar-14https://your-app-name.herokuapp.com/ | HYPERLINK "https://git.heroku.com/your-app-name.git" https://git.heroku.com/your-app-name.gitGit remote heroku added Finally, you are able to deploy your source code at any time through these commands: $ git add .$ git commit –am "My updates"$ git push heroku master Another very common case is when you already have a Git repository on your computer with the application's source code and you want to deploy it on Heroku. In this case, you must run the heroku apps:create your-app-name command inside the application directory and the link with Heroku will be created. Summary In this article, you learned how to configure your local environment to work with PostgreSQL and pgAdmin. Besides that, you have also understood how to install Heroku Postgres in your application. In addition, you have understood that the first database is created automatically when the Heroku Postgres add-on is installed in your application and there are several PostgreSQL databases as well. You also learned that the great majority of tasks can be performed in two ways: via the Heroku Client and via the Heroku dashboard. Resources for Article: Further resources on this subject: Building Mobile Apps [article] Managing Heroku from the Command Line [article] Securing the WAL Stream [article]
Read more
  • 0
  • 0
  • 2539

article-image-getting-and-running-cassandra
Packt
27 Feb 2015
20 min read
Save for later

Getting Up and Running with Cassandra

Packt
27 Feb 2015
20 min read
As an application developer, you have almost certainly worked with databases extensively. You must have built products using relational databases like MySQL and PostgreSQL, and perhaps experimented with a document store like MongoDB or a key-value database like Redis. While each of these tools has its strengths, you will now consider whether a distributed database like Cassandra might be the best choice for the task at hand. In this article by Mat Brown, author of the book Learning Apache Cassandra, we'll talk about the major reasons to choose Cassandra from among the many database options available to you. Having established that Cassandra is a great choice, we'll go through the nuts and bolts of getting a local Cassandra installation up and running. By the end of this article, you'll know: When and why Cassandra is a good choice for your application How to install Cassandra on your development machine How to interact with Cassandra using cqlsh How to create a keyspace (For more resources related to this topic, see here.) What Cassandra offers, and what it doesn't Cassandra is a fully distributed, masterless database, offering superior scalability and fault tolerance to traditional single master databases. Compared with other popular distributed databases like Riak, HBase, and Voldemort, Cassandra offers a uniquely robust and expressive interface for modeling and querying data. What follows is an overview of several desirable database capabilities, with accompanying discussions of what Cassandra has to offer in each category. Horizontal scalability Horizontal scalability refers to the ability to expand the storage and processing capacity of a database by adding more servers to a database cluster. A traditional single-master database's storage capacity is limited by the capacity of the server that hosts the master instance. If the data set outgrows this capacity, and a more powerful server isn't available, the data set must be sharded among multiple independent database instances that know nothing of each other. Your application bears responsibility for knowing to which instance a given piece of data belongs. Cassandra, on the other hand, is deployed as a cluster of instances that are all aware of each other. From the client application's standpoint, the cluster is a single entity; the application need not know, nor care, which machine a piece of data belongs to. Instead, data can be read or written to any instance in the cluster, referred to as a node; this node will forward the request to the instance where the data actually belongs. The result is that Cassandra deployments have an almost limitless capacity to store and process data; when additional capacity is required, more machines can simply be added to the cluster. When new machines join the cluster, Cassandra takes care of rebalancing the existing data so that each node in the expanded cluster has a roughly equal share. Cassandra is one of the several popular distributed databases inspired by the Dynamo architecture, originally published in a paper by Amazon. Other widely used implementations of Dynamo include Riak and Voldemort. You can read the original paper at http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf. High availability The simplest database deployments are run as a single instance on a single server. This sort of configuration is highly vulnerable to interruption: if the server is affected by a hardware failure or network connection outage, the application's ability to read and write data is completely lost until the server is restored. If the failure is catastrophic, the data on that server might be lost completely. A master-follower architecture improves this picture a bit. The master instance receives all write operations, and then these operations are replicated to follower instances. The application can read data from the master or any of the follower instances, so a single host becoming unavailable will not prevent the application from continuing to read data. A failure of the master, however, will still prevent the application from performing any write operations, so while this configuration provides high read availability, it doesn't completely provide high availability. Cassandra, on the other hand, has no single point of failure for reading or writing data. Each piece of data is replicated to multiple nodes, but none of these nodes holds the authoritative master copy. If a machine becomes unavailable, Cassandra will continue writing data to the other nodes that share data with that machine, and will queue the operations and update the failed node when it rejoins the cluster. This means in a typical configuration, two nodes must fail simultaneously for there to be any application-visible interruption in Cassandra's availability. How many copies?When you create a keyspace - Cassandra's version of a database - you specify how many copies of each piece of data should be stored; this is called the replication factor. A replication factor of 3 is a common and good choice for many use cases. Write optimization Traditional relational and document databases are optimized for read performance. Writing data to a relational database will typically involve making in-place updates to complicated data structures on disk, in order to maintain a data structure that can be read efficiently and flexibly. Updating these data structures is a very expensive operation from a standpoint of disk I/O, which is often the limiting factor for database performance. Since writes are more expensive than reads, you'll typically avoid any unnecessary updates to a relational database, even at the expense of extra read operations. Cassandra, on the other hand, is highly optimized for write throughput, and in fact never modifies data on disk; it only appends to existing files or creates new ones. This is much easier on disk I/O and means that Cassandra can provide astonishingly high write throughput. Since both writing data to Cassandra, and storing data in Cassandra, are inexpensive, denormalization carries little cost and is a good way to ensure that data can be efficiently read in various access scenarios. Because Cassandra is optimized for write volume, you shouldn't shy away from writing data to the database. In fact, it's most efficient to write without reading whenever possible, even if doing so might result in redundant updates. Just because Cassandra is optimized for writes doesn't make it bad at reads; in fact, a well-designed Cassandra database can handle very heavy read loads with no problem. Structured records The first three database features we looked at are commonly found in distributed data stores. However, databases like Riak and Voldemort are purely key-value stores; these databases have no knowledge of the internal structure of a record that's stored at a particular key. This means useful functions like updating only part of a record, reading only certain fields from a record, or retrieving records that contain a particular value in a given field are not possible. Relational databases like PostgreSQL, document stores like MongoDB, and, to a limited extent, newer key-value stores like Redis do have a concept of the internal structure of their records, and most application developers are accustomed to taking advantage of the possibilities this allows. None of these databases, however, offer the advantages of a masterless distributed architecture. In Cassandra, records are structured much in the same way as they are in a relational database—using tables, rows, and columns. Thus, applications using Cassandra can enjoy all the benefits of masterless distributed storage while also getting all the advanced data modeling and access features associated with structured records. Secondary indexes A secondary index, commonly referred to as an index in the context of a relational database, is a structure allowing efficient lookup of records by some attribute other than their primary key. This is a widely useful capability: for instance, when developing a blog application, you would want to be able to easily retrieve all of the posts written by a particular author. Cassandra supports secondary indexes; while Cassandra's version is not as versatile as indexes in a typical relational database, it's a powerful feature in the right circumstances. Efficient result ordering It's quite common to want to retrieve a record set ordered by a particular field; for instance, a photo sharing service will want to retrieve the most recent photographs in descending order of creation. Since sorting data on the fly is a fundamentally expensive operation, databases must keep information about record ordering persisted on disk in order to efficiently return results in order. In a relational database, this is one of the jobs of a secondary index. In Cassandra, secondary indexes can't be used for result ordering, but tables can be structured such that rows are always kept sorted by a given column or columns, called clustering columns. Sorting by arbitrary columns at read time is not possible, but the capacity to efficiently order records in any way, and to retrieve ranges of records based on this ordering, is an unusually powerful capability for a distributed database. Immediate consistency When we write a piece of data to a database, it is our hope that that data is immediately available to any other process that may wish to read it. From another point of view, when we read some data from a database, we would like to be guaranteed that the data we retrieve is the most recently updated version. This guarantee is called immediate consistency, and it's a property of most common single-master databases like MySQL and PostgreSQL. Distributed systems like Cassandra typically do not provide an immediate consistency guarantee. Instead, developers must be willing to accept eventual consistency, which means when data is updated, the system will reflect that update at some point in the future. Developers are willing to give up immediate consistency precisely because it is a direct tradeoff with high availability. In the case of Cassandra, that tradeoff is made explicit through tunable consistency. Each time you design a write or read path for data, you have the option of immediate consistency with less resilient availability, or eventual consistency with extremely resilient availability. Discretely writable collections While it's useful for records to be internally structured into discrete fields, a given property of a record isn't always a single value like a string or an integer. One simple way to handle fields that contain collections of values is to serialize them using a format like JSON, and then save the serialized collection into a text field. However, in order to update collections stored in this way, the serialized data must be read from the database, decoded, modified, and then written back to the database in its entirety. If two clients try to perform this kind of modification to the same record concurrently, one of the updates will be overwritten by the other. For this reason, many databases offer built-in collection structures that can be discretely updated: values can be added to, and removed from collections, without reading and rewriting the entire collection. Cassandra is no exception, offering list, set, and map collections, and supporting operations like "append the number 3 to the end of this list". Neither the client nor Cassandra itself needs to read the current state of the collection in order to update it, meaning collection updates are also blazingly efficient. Relational joins In real-world applications, different pieces of data relate to each other in a variety of ways. Relational databases allow us to perform queries that make these relationships explicit, for instance, to retrieve a set of events whose location is in the state of New York (this is assuming events and locations are different record types). Cassandra, however, is not a relational database, and does not support anything like joins. Instead, applications using Cassandra typically denormalize data and make clever use of clustering in order to perform the sorts of data access that would use a join in a relational database. For data sets that aren't already denormalized, applications can also perform client-side joins, which mimic the behavior of a relational database by performing multiple queries and joining the results at the application level. Client-side joins are less efficient than reading data that has been denormalized in advance, but offer more flexibility. MapReduce MapReduce is a technique for performing aggregate processing on large amounts of data in parallel; it's a particularly common technique in data analytics applications. Cassandra does not offer built-in MapReduce capabilities, but it can be integrated with Hadoop in order to perform MapReduce operations across Cassandra data sets, or Spark for real-time data analysis. The DataStax Enterprise product provides integration with both of these tools out-of-the-box. Comparing Cassandra to the alternatives Now that you've got an in-depth understanding of the feature set that Cassandra offers, it's time to figure out which features are most important to you, and which database is the best fit. The following table lists a handful of commonly used databases, and key features that they do or don't have: Feature Cassandra PostgreSQL MongoDB Redis Riak Structured records Yes Yes Yes Limited No Secondary indexes Yes Yes Yes No Yes Discretely writable collections Yes Yes Yes Yes No Relational joins No Yes No No No Built-in MapReduce No No Yes No Yes Fast result ordering Yes Yes Yes Yes No Immediate consistency Configurable at query level Yes Yes Yes Configurable at cluster level Transparent sharding Yes No Yes No Yes No single point of failure Yes No No No Yes High throughput writes Yes No No Yes Yes As you can see, Cassandra offers a unique combination of scalability, availability, and a rich set of features for modeling and accessing data. Installing Cassandra Now that you're acquainted with Cassandra's substantial powers, you're no doubt chomping at the bit to try it out. Happily, Cassandra is free, open source, and very easy to get running. Installing on Mac OS X First, we need to make sure that we have an up-to-date installation of the Java Runtime Environment. Open the Terminal application, and type the following into the command prompt: $ java -version You will see an output that looks something like the following: java version "1.8.0_25"Java(TM) SE Runtime Environment (build 1.8.0_25-b17)Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode) Pay particular attention to the java version listed: if it's lower than 1.7.0_25, you'll need to install a new version. If you have an older version of Java or if Java isn't installed at all, head to https://www.java.com/en/download/mac_download.jsp and follow the download instructions on the page. You'll need to set up your environment so that Cassandra knows where to find the latest version of Java. To do this, set up your JAVA_HOME environment variable to the install location, and your PATH to include the executable in your new Java installation as follows: $ export JAVA_HOME="/Library/Internet Plug- Ins/JavaAppletPlugin.plugin/Contents/Home"$ export PATH="$JAVA_HOME/bin":$PATH You should put these two lines at the bottom of your .bashrc file to ensure that things still work when you open a new terminal. The installation instructions given earlier assume that you're using the latest version of Mac OS X (at the time of writing this, 10.10 Yosemite). If you're running a different version of OS X, installing Java might require different steps. Check out https://www.java.com/en/download/faq/java_mac.xml for detailed installation information. Once you've got the right version of Java, you're ready to install Cassandra. It's very easy to install Cassandra using Homebrew; simply type the following: $ brew install cassandra$ pip install cassandra-driver cql$ cassandra Here's what we just did: Installed Cassandra using the Homebrew package manager Installed the CQL shell and its dependency, the Python Cassandra driver Started the Cassandra server Installing on Ubuntu First, we need to make sure that we have an up-to-date installation of the Java Runtime Environment. Open the Terminal application, and type the following into the command prompt: $ java -version You will see an output that looks similar to the following: java version "1.7.0_65"OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3- 0ubuntu0.14.04.1)OpenJDK 64-bit Server VM (build 24.65-b04, mixed mode) Pay particular attention to the java version listed: it should start with 1.7. If you have an older version of Java, or if Java isn't installed at all, you can install the correct version using the following command: $ sudo apt-get install openjdk-7-jre-headless Once you've got the right version of Java, you're ready to install Cassandra. First, you need to add Apache's Debian repositories to your sources list. Add the following lines to the file /etc/apt/sources.list: deb http://www.apache.org/dist/cassandra/debian 21x maindeb-src http://www.apache.org/dist/cassandra/debian 21x main In the Terminal application, type the following into the command prompt: $ gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D$ gpg --export --armor F758CE318D77295D | sudo apt-key add -$ gpg --keyserver pgp.mit.edu --recv-keys 2B5C1B00$ gpg --export --armor 2B5C1B00 | sudo apt-key add -$ gpg --keyserver pgp.mit.edu --recv-keys 0353B12C$ gpg --export --armor 0353B12C | sudo apt-key add -$ sudo apt-get update$ sudo apt-get install cassandra$ cassandra Here's what we just did: Added the Apache repositories for Cassandra 2.1 to our sources list Added the public keys for the Apache repo to our system and updated our repository cache Installed Cassandra Started the Cassandra server Installing on Windows The easiest way to install Cassandra on Windows is to use the DataStax Community Edition. DataStax is a company that provides enterprise-level support for Cassandra; they also release Cassandra packages at both free and paid tiers. DataStax Community Edition is free, and does not differ from the Apache package in any meaningful way. DataStax offers a graphical installer for Cassandra on Windows, which is available for download at planetcassandra.org/cassandra. On this page, locate Windows Server 2008/Windows 7 or Later (32-Bit) from the Operating System menu (you might also want to look for 64-bit if you run a 64-bit version of Windows), and choose MSI Installer (2.x) from the version columns. Download and run the MSI file, and follow the instructions, accepting the defaults: Once the installer completes this task, you should have an installation of Cassandra running on your machine. Bootstrapping the project We will build an application called MyStatus, which allows users to post status updates for their friends to read. CQL – the Cassandra Query Language Since this is about Cassandra and not targeted to users of any particular programming language or application framework, we will focus entirely on the database interactions that MyStatus will require. Code examples will be in Cassandra Query Language (CQL). Specifically, we'll use version 3.1.1 of CQL, which is available in Cassandra 2.0.6 and later versions. As the name implies, CQL is heavily inspired by SQL; in fact, many CQL statements are equally valid SQL statements. However, CQL and SQL are not interchangeable. CQL lacks a grammar for relational features such as JOIN statements, which are not possible in Cassandra. Conversely, CQL is not a subset of SQL; constructs for retrieving the update time of a given column, or performing an update in a lightweight transaction, which are available in CQL, do not have an SQL equivalent. Interacting with Cassandra Most common programming languages have drivers for interacting with Cassandra. When selecting a driver, you should look for libraries that support the CQL binary protocol, which is the latest and most efficient way to communicate with Cassandra. The CQL binary protocol is a relatively new introduction; older versions of Cassandra used the Thrift protocol as a transport layer. Although Cassandra continues to support Thrift, avoid Thrift-based drivers, as they are less performant than the binary protocol. Here are CQL binary drivers available for some popular programming languages: Language Driver Available at Java DataStax Java Driver github.com/datastax/java-driver Python DataStax Python Driver github.com/datastax/python-driver Ruby DataStax Ruby Driver github.com/datastax/ruby-driver C++ DataStax C++ Driver github.com/datastax/cpp-driver C# DataStax C# Driver github.com/datastax/csharp-driver JavaScript (Node.js) node-cassandra-cql github.com/jorgebay/node-cassandra-cql PHP phpbinarycql github.com/rmcfrazier/phpbinarycql While you will likely use one of these drivers in your applications, you can simply use the cqlsh tool, which is a command-line interface for executing CQL queries and viewing the results. To start cqlsh on OS X or Linux, simply type cqlsh into your command line; you should see something like this: $ cqlshConnected to Test Cluster at localhost:9160.[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]Use HELP for help.cqlsh> On Windows, you can start cqlsh by finding the Cassandra CQL Shell application in the DataStax Community Edition group in your applications. Once you open it, you should see the same output we just saw. Creating a keyspace A keyspace is a collection of related tables, equivalent to a database in a relational system. To create the keyspace for our MyStatus application, issue the following statement in the CQL shell:    CREATE KEYSPACE "my_status"   WITH REPLICATION = {      'class': 'SimpleStrategy', 'replication_factor': 1    }; Here we created a keyspace called my_status. When we create a keyspace, we have to specify replication options. Cassandra provides several strategies for managing replication of data; SimpleStrategy is the best strategy as long as your Cassandra deployment does not span multiple data centers. The replication_factor value tells Cassandra how many copies of each piece of data are to be kept in the cluster; since we are only running a single instance of Cassandra, there is no point in keeping more than one copy of the data. In a production deployment, you would certainly want a higher replication factor; 3 is a good place to start. A few things at this point are worth noting about CQL's syntax: It's syntactically very similar to SQL; as we further explore CQL, the impression of similarity will not diminish. Double quotes are used for identifiers such as keyspace, table, and column names. As in SQL, quoting identifier names is usually optional, unless the identifier is a keyword or contains a space or another character that will trip up the parser. Single quotes are used for string literals; the key-value structure we use for replication is a map literal, which is syntactically similar to an object literal in JSON. As in SQL, CQL statements in the CQL shell must terminate with a semicolon. Selecting a keyspace Once you've created a keyspace, you would want to use it. In order to do this, employ the USE command: USE "my_status"; This tells Cassandra that all future commands will implicitly refer to tables inside the my_status keyspace. If you close the CQL shell and reopen it, you'll need to reissue this command. Summary In this article, you explored the reasons to choose Cassandra from among the many databases available, and having determined that Cassandra is a great choice, you installed it on your development machine. You had your first taste of the Cassandra Query Language when you issued your first command via the CQL shell in order to create a keyspace. You're now poised to begin working with Cassandra in earnest. Resources for Article: Further resources on this subject: Getting Started with Apache Cassandra [article] Basic Concepts and Architecture of Cassandra [article] About Cassandra [article]
Read more
  • 0
  • 0
  • 5655

article-image-agile-data-modeling-neo4j
Packt
25 Feb 2015
18 min read
Save for later

Agile data modeling with Neo4j

Packt
25 Feb 2015
18 min read
In this article by Sumit Gupta, author of the book Neo4j Essentials, we will discuss data modeling in Neo4j, which is evolving and flexible enough to adapt to changing business requirements. It captures the new data sources, entities, and their relationships as they naturally occur, allowing the database to easily adapt to the changes, which in turn results in an extremely agile development and provides quick responsiveness to changing business requirements. (For more resources related to this topic, see here.) Data modeling is a multistep process and involves the following steps: Define requirements or goals in the form of questions that need to be executed and answered by the domain-specific model. Once we have our goals ready, we can dig deep into the data and identify entities and their associations/relationships. Now, as we have our graph structure ready, the next step is to form patterns from our initial questions/goals and execute them against the graph model. This whole process is applied in an iterative and incremental manner, similar to what we do in agile, and has to be repeated again whenever we change our goals or add new goals/questions, which need to be answered by your graph model. Let's see in detail how data is organized/structured and implemented in Neo4j to bring in the agility of graph models. Based on the principles of graph data structure available at http://en.wikipedia.org/wiki/Graph_(abstract_data_type), Neo4j implements the property graph data model at storage level, which is efficient, flexible, adaptive, and capable of effectively storing/representing any kind of data in the form of nodes, properties, and relationships. Neo4j not only implements the property graph model, but has also evolved the traditional model and added the feature of tagging nodes with labels, which is now referred to as the labeled property graph. Essentially, in Neo4j, everything needs to be defined in either of the following forms: Nodes: A node is the fundamental unit of a graph, which can also be viewed as an entity, but based on a domain model, it can be something else too. Relationships: These defines the connection between two nodes. Relationships also have types, which further differentiate relationships from one another. Properties: Properties are viewed as attributes and do not have their own existence. They are related either to nodes or to relationships. Nodes and relationships can both have their own set of properties. Labels: Labels are used to construct a group of nodes into sets. Nodes that are labeled with the same label belong to the same set, which can be further used to create indexes for faster retrieval, mark temporary states of certain nodes, and there could be many more, based on the domain model. Let's see how all of these four forms are related to each other and represented within Neo4j. A graph essentially consists of nodes, which can also have properties. Nodes are linked to other nodes. The link between two nodes is known as a relationship, which also can have properties. Nodes can also have labels, which are used for grouping the nodes. Let's take up a use case to understand data modeling in Neo4j. John is a male and his age is 24. He is married to a female named Mary whose age is 20. John and Mary got married in 2012. Now, let's develop the data model for the preceding use case in Neo4j: John and Mary are two different nodes. Marriage is the relationship between John and Mary. Age of John, age of Mary, and the year of their marriage become the properties. Male and Female become the labels. Easy, simple, flexible, and natural… isn't it? The data structure in Neo4j is adaptive and effectively can model everything that is not fixed and evolves over a period of time. The next step in data modeling is fetching the data from the data model, which is done through traversals. Traversals are another important aspect of graphs, where you need to follow paths within the graph starting from a given node and then following its relationships with other nodes. Neo4j provides two kinds of traversals: breadth first available at http://en.wikipedia.org/wiki/Breadth-first_search and depth first available at http://en.wikipedia.org/wiki/Depth-first_search. If you are from the RDBMS world, then you must now be wondering, "What about the schema?" and you will be surprised to know that Neo4j is a schemaless or schema-optional graph database. We do not have to define the schema unless we are at a stage where we want to provide some structure to our data for performance gains. Once performance becomes a focus area, then you can define a schema and create indexes/constraints/rules over data. Unlike the traditional models where we freeze requirements and then draw our models, Neo4j embraces data modeling in an agile way so that it can be evolved over a period of time and is highly responsive to the dynamic and changing business requirements. Read-only Cypher queries In this section, we will discuss one of the most important aspects of Neo4j, that is, read-only Cypher queries. Read-only Cypher queries are not only the core component of Cypher but also help us in exploring and leveraging various patterns and pattern matching constructs. It either begins with MATCH, OPTIONAL MATCH, or START, which can be used in conjunction with the WHERE clause and further followed by WITH and ends with RETURN. Constructs such as ORDER BY, SKIP, and LIMIT can also be used with WITH and RETURN. We will discuss in detail about read-only constructs, but before that, let's create a sample dataset and then we will discuss constructs/syntax of read-only Cypher queries with illustration. Creating a sample dataset – movie dataset Let's perform the following steps to clean up our Neo4j database and insert some data which will help us in exploring various constructs of Cypher queries: Open your Command Prompt or Linux shell and open the Neo4j shell by typing <$NEO4J_HOME>/bin/neo4j-shell. Execute the following commands on your Neo4j shell for cleaning all the previous data: //Delete all relationships between Nodes MATCH ()-[r]-() delete r; //Delete all Nodes MATCH (n) delete n; Now we will create a sample dataset, which will contain movies, artists, directors, and their associations. Execute the following set of Cypher queries in your Neo4j shell to create the list of movies: CREATE (:Movie {Title : 'Rocky', Year : '1976'}); CREATE (:Movie {Title : 'Rocky II', Year : '1979'}); CREATE (:Movie {Title : 'Rocky III', Year : '1982'}); CREATE (:Movie {Title : 'Rocky IV', Year : '1985'}); CREATE (:Movie {Title : 'Rocky V', Year : '1990'}); CREATE (:Movie {Title : 'The Expendables', Year : '2010'}); CREATE (:Movie {Title : 'The Expendables II', Year : '2012'}); CREATE (:Movie {Title : 'The Karate Kid', Year : '1984'}); CREATE (:Movie {Title : 'Rocky', Year : '1976'}); CREATE (:Movie {Title : 'Rocky II', Year : '1979'}); CREATE (:Movie {Title : 'Rocky III', Year : '1982'}); CREATE (:Movie {Title : 'Rocky IV', Year : '1985'}); CREATE (:Movie {Title : 'Rocky V', Year : '1990'}); CREATE (:Movie {Title : 'The Expendables', Year : '2010'}); CREATE (:Movie {Title : 'The Expendables II', Year : '2012'}); CREATE (:Movie {Title : 'The Karate Kid', Year : '1984'}); CREATE (:Movie {Title : 'The Karate Kid II', Year : '1986'}); Execute the following set of Cypher queries in your Neo4j shell to create the list of artists: CREATE (:Artist {Name : 'Sylvester Stallone', WorkedAs : ["Actor", "Director"]}); CREATE (:Artist {Name : 'John G. Avildsen', WorkedAs : ["Director"]}); CREATE (:Artist {Name : 'Ralph Macchio', WorkedAs : ["Actor"]}); CREATE (:Artist {Name : 'Simon West', WorkedAs : ["Director"]}); Execute the following set of cypher queries in your Neo4j shell to create the relationships between artists and movies: Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky"}) CREATE (artist)-[:ACTED_IN {Role : "Rocky Balboa"}]->(movie); Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky II"}) CREATE (artist)-[:ACTED_IN {Role : "Rocky Balboa"}]->(movie); Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky III"}) CREATE (artist)-[:ACTED_IN {Role : "Rocky Balboa"}]->(movie); Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky IV"}) CREATE (artist)-[:ACTED_IN {Role : "Rocky Balboa"}]->(movie); Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky V"}) CREATE (artist)-[:ACTED_IN {Role : "Rocky Balboa"}]->(movie); Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "The Expendables"}) CREATE (artist)-[:ACTED_IN {Role : "Barney Ross"}]->(movie); Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "The Expendables II"}) CREATE (artist)-[:ACTED_IN {Role : "Barney Ross"}]->(movie); Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky II"}) CREATE (artist)-[:DIRECTED]->(movie); Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky III"}) CREATE (artist)-[:DIRECTED]->(movie); Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky IV"}) CREATE (artist)-[:DIRECTED]->(movie); Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "The Expendables"}) CREATE (artist)-[:DIRECTED]->(movie); Match (artist:Artist {Name : "John G. Avildsen"}), (movie:Movie {Title: "Rocky"}) CREATE (artist)-[:DIRECTED]->(movie); Match (artist:Artist {Name : "John G. Avildsen"}), (movie:Movie {Title: "Rocky V"}) CREATE (artist)-[:DIRECTED]->(movie); Match (artist:Artist {Name : "John G. Avildsen"}), (movie:Movie {Title: "The Karate Kid"}) CREATE (artist)-[:DIRECTED]->(movie); Match (artist:Artist {Name : "John G. Avildsen"}), (movie:Movie {Title: "The Karate Kid II"}) CREATE (artist)-[:DIRECTED]->(movie); Match (artist:Artist {Name : "Ralph Macchio"}), (movie:Movie {Title: "The Karate Kid"}) CREATE (artist)-[:ACTED_IN {Role:"Daniel LaRusso"}]->(movie); Match (artist:Artist {Name : "Ralph Macchio"}), (movie:Movie {Title: "The Karate Kid II"}) CREATE (artist)-[:ACTED_IN {Role:"Daniel LaRusso"}]->(movie); Match (artist:Artist {Name : "Simon West"}), (movie:Movie {Title: "The Expendables II"}) CREATE (artist)-[:DIRECTED]->(movie); Next, browse your data through the Neo4j browser. Click on Get some data from the left navigation pane and then execute the query by clicking on the right arrow sign that will appear on the extreme right corner just below the browser navigation bar, and it should look something like this: Now, let's understand the different pieces of read-only queries and execute those against our movie dataset. Working with the MATCH clause MATCH is the most important clause used to fetch data from the database. It accepts a pattern, which defines "What to search?" and "From where to search?". If the latter is not provided, then Cypher will scan the whole tree and use indexes (if defined) in order to make searching faster and performance more efficient. Working with nodes Let's start asking questions from our movie dataset and then form Cypher queries, execute them on <$NEO4J_HOME>/bin/neo4j-shell against the movie dataset and get the results that will produce answers to our questions: How do we get all nodes and their properties? Answer: MATCH (n) RETURN n; Explanation: We are instructing Cypher to scan the complete database and capture all nodes in a variable n and then return the results, which in our case will be printed on our Neo4j shell. How do we get nodes with specific properties or labels? Answer: Match with label, MATCH (n:Artist) RETURN n; or MATCH (n:Movies) RETURN n; Explanation: We are instructing Cypher to scan the complete database and capture all nodes, which contain the value of label as Artist or Movies. Answer: Match with a specific property MATCH (n:Artist {WorkedAs:["Actor"]}) RETURN n; Explanation: We are instructing Cypher to scan the complete database and capture all nodes that contain the value of label as Artist and the value of property WorkedAs is ["Actor"]. Since we have defined the WorkedAs collection, we need to use square brackets, but in all other cases, we should not use square brackets. We can also return specific columns (similar to SQL). For example, the preceding statement can also be formed as MATCH (n:Artist {WorkedAs:["Actor"]}) RETURN n.name as Name;. Working with relationships Let's understand the process of defining relationships in the form of Cypher queries in the same way as you did in the previous section while working with nodes: How do we get nodes that are associated or have relationships with other nodes? Answer: MATCH (n)-[r]-(n1) RETURN n,r,n1; Explanation: We are instructing Cypher to scan the complete database and capture all nodes, their relationships, and nodes with which they have relationships in variables n, r, and n1 and then further return/print the results on your Neo4j shell. Also, in the preceding query, we have used - and not -> as we do not care about the direction of relationships that we retrieve. How do we get nodes, their associated properties that have some specific type of relationship, or the specific property of a relationship? Answer: MATCH (n)-[r:ACTED_IN {Role : "Rocky Balboa"}]->(n1) RETURN n,r,n1; Explanation: We are instructing Cypher to scan the complete database and capture all nodes, their relationships, and nodes, which have a relationship as ACTED_IN and with the property of Role as Rocky Balboa. Also, in the preceding query, we do care about the direction (incoming/outgoing) of a relationship, so we are using ->. For matching multiple relations replace [r:ACTED_IN] with [r:ACTED_IN | DIRECTED] and use single quotes or escape characters wherever there are special characters in the name of relationships. How do we get a coartist? Answer: MATCH (n {Name : "Sylvester Stallone"})-[r]->(x)<-[r1]-(n1) return n.Name as Artist,type(r),x.Title as Movie, type(r1), n1.Name as Artist2; Explanation: We are trying to find out all artists that are related to Sylvester Stallone in some manner or the other. Once you run the preceding query, you will see something like the following image, which should be self-explanatory. Also, see the usage of as and type. as is similar to the SQL construct and is used to define a meaningful name to the column presenting the results, and type is a special keyword that gives the type of relationship between two nodes. How do we get the path and number of hops between two nodes? Answer: MATCH p = (:Movie{Title:"The Karate Kid"})-[:DIRECTED*0..4]-(:Movie{Title:"Rocky V"}) return p; Explanation: Paths are the distance between two nodes. In the preceding statement, we are trying to find out all paths between two nodes, which are between 0 (minimum) and 4 (maximum) hops away from each other and are only connected through the relationship DIRECTED. You can also find the path, originating only from a particular node and re-write your query as MATCH p = (:Movie{Title:"The Karate Kid"})-[:DIRECTED*0..4]-() return p;. Integration of the BI tool – QlikView In this section, we will talk about the integration of the BI tool—QlikView with Neo4j. QlikView is available only on the Windows platform, so this section is only applicable for Windows users. Neo4j as an open source database exposes its core APIs for developers to write plugins and extends its intrinsic capabilities. Neo4j JDBC is one such plugin that enables the integration of Neo4j with various BI / visualization and ETL tools such as QlikView, Jaspersoft, Talend, Hive, Hbase, and many more. Let's perform the following steps for integrating Neo4j with QlikView on Windows: Download, install, and configure the following required software: Download the Neo4j JDBC driver directly from https://github.com/neo4j-contrib/neo4j-jdbc as the source code. You need to compile and create a JAR file or you can also directly download the compiled sources from http://dist.neo4j.org/neo4j-jdbc/neo4j-jdbc-2.0.1-SNAPSHOT-jar-with-dependencies.jar. Depending upon your Windows platform (32 bit or 64 bit), download the QlikView Desktop Personal Edition. In this article, we will be using QlikView Desktop Personal Edition 11.20.12577.0 SR8. Install QlikView and follow the instructions as they appear on your screen. After installation, you will see the QlikView icon in your Windows start menu. QlikView leverages QlikView JDBC Connector for integration with JDBC data sources, so our next step would be to install QlikView JDBC Connector. Let's perform the following steps to install and configure the QlikView JDBC Connector: Download QlikView JDBC Connector either from http://www.tiq-solutions.de/display/enghome/ENJDBC or from https://s3-eu-west-1.amazonaws.com/tiq-solutions/JDBCConnector/JDBCConnector_Setup.zip. Open the downloaded JDBCConnector_Setup.zip file and install the provided connector. Once the installation is complete, open JDBC Connector from your Windows Start menu and click on Active for a 30-day trial (if you haven't already done so during installation). Create a new profile of the name Qlikview-Neo4j and make it your default profile. Open the JVM VM Options tab and provide the location of jvm.dll, which can be located at <$JAVA_HOME>/jre/bin/client/jvm.dll. Click on Open Log-Folder to check the logs related to the Database connections. You can configure the Logging Level and also define the JVM runtime options such as -Xmx and -Xms in the textbox provided for Option in the preceding screenshot. Browse through the JDBC Driver tab, click on Add Library, and provide the location of your <$NEO4J-JDBC driver>.jar, and add the dependent JAR files. Instead of adding individual libraries, we can also add a folder containing the same list of libraries by clicking on the Add Folder option. We can also use non JDBC-4 compliant drivers by mentioning the name of the driver class in the Advanced Settings tab. There is no need to do that, however, if you are setting up a configuration profile that uses a JDBC-4 compliant driver. Open the License Registration tab and request Online Trial license, which will be valid for 30 days. Assuming that you are connected to the Internet, the trial license will be applied immediately. Save your settings and close QlikView JDBC Connector configuration. Open <$NEO4J_HOME>binneo4jshell and execute the following set of Cypher statements one by one to create sample data in your Neo4j database; then, in further steps, we will visualize this data in QlikView: CREATE (movies1:Movies {Title : 'Rocky', Year : '1976'}); CREATE (movies2:Movies {Title : 'Rocky II', Year : '1979'}); CREATE (movies3:Movies {Title : 'Rocky III', Year : '1982'}); CREATE (movies4:Movies {Title : 'Rocky IV', Year : '1985'}); CREATE (movies5:Movies {Title : 'Rocky V', Year : '1990'}); CREATE (movies6:Movies {Title : 'The Expendables', Year : '2010'}); CREATE (movies7:Movies {Title : 'The Expendables II', Year : '2012'}); CREATE (movies8:Movies {Title : 'The Karate Kid', Year : '1984'}); CREATE (movies9:Movies {Title : 'The Karate Kid II', Year : '1986'}); Open the QlikView Desktop Personal Edition and create a new view by navigating to File | New. The Getting Started wizard may appear as we are creating a new view. Close this wizard. Navigate to File | EditScript and change your database to JDBCConnector.dll (32). In the same window, click on Connect and enter "jdbc:neo4j://localhost:7474/" in the "url" box. Leave the username and password as empty and click on OK. You will see that a CUSTOM CONNECT TO statement is added in the box provided. Next insert the highlighted Cypher statements in the provided window just below the CUSTOM CONNECT TO statement. Save the script and close the EditScript window. Now, on your Qlikviewsheet, execute the script by pressing Ctrl + R on our keyboard. Next, add a new TableObject on your Qlikviewsheet, select "MovieTitle" from the provided fields and click on OK. And we are done!!!! You will see the data appearing in the listbox in the newly created Table Object. The data is fetched from the Neo4j database and QlikView is used to render this data. The same process is used for connecting to other JDBC-compliant BI / visualization / ETL tools such as Jasper, Talend, Hive, Hbase, and so on. We just need to define appropriate JDBC Type-4 drivers in JDBC Connector. We can also use ODBC-JDBC Bridge provided by EasySoft at http://www.easysoft.com/products/data_access/odbc_jdbc_gateway/index.html. EasySoft provides the ODBC-JDBC Gateway, which facilitates ODBC access from applications such as MS Access, MS Excel, Delphi, and C++ to Java databases. It is a fully functional ODBC 3.5 driver that allows you to access any JDBC data source from any ODBC-compatible application. Summary In this article, you have learned the basic concepts of data modeling in Neo4j and have walked you through the process of BI integration with Neo4j. Resources for Article: Further resources on this subject: Working Neo4j Embedded Database [article] Introducing Kafka [article] First Steps With R [article]
Read more
  • 0
  • 0
  • 2654

article-image-spark-programming-model
Packt
20 Feb 2015
13 min read
Save for later

The Spark Programming Model

Packt
20 Feb 2015
13 min read
In this article by Nick Pentreath, author of the book Machine Learning with Spark, we will delve into a high-level overview of Spark's design, we will introduce the SparkContext object as well as the Spark shell, which we will use to interactively explore the basics of the Spark programming model. While this section provides a brief overview and examples of using Spark, we recommend that you read the following documentation to get a detailed understanding:Spark Quick Start: http://spark.apache.org/docs/latest/quick-start.htmlSpark Programming guide, which covers Scala, Java, and Python: http://spark.apache.org/docs/latest/programming-guide.html (For more resources related to this topic, see here.) SparkContext and SparkConf The starting point of writing any Spark program is SparkContext (or JavaSparkContext in Java). SparkContext is initialized with an instance of a SparkConf object, which contains various Spark cluster-configuration settings (for example, the URL of the master node). Once initialized, we will use the various methods found in the SparkContext object to create and manipulate distributed datasets and shared variables. The Spark shell (in both Scala and Python, which is unfortunately not supported in Java) takes care of this context initialization for us, but the following lines of code show an example of creating a context running in the local mode in Scala: val conf = new SparkConf().setAppName("Test Spark App").setMaster("local[4]")val sc = new SparkContext(conf) This creates a context running in the local mode with four threads, with the name of the application set to Test Spark App. If we wish to use default configuration values, we could also call the following simple constructor for our SparkContext object, which works in exactly the same way: val sc = new SparkContext("local[4]", "Test Spark App") The Spark shell Spark supports writing programs interactively using either the Scala or Python REPL (that is, the Read-Eval-Print-Loop, or interactive shell). The shell provides instant feedback as we enter code, as this code is immediately evaluated. In the Scala shell, the return result and type is also displayed after a piece of code is run. To use the Spark shell with Scala, simply run ./bin/spark-shell from the Spark base directory. This will launch the Scala shell and initialize SparkContext, which is available to us as the Scala value, sc. Your console output should look similar to the following screenshot: To use the Python shell with Spark, simply run the ./bin/pyspark command. Like the Scala shell, the Python SparkContext object should be available as the Python variable sc. You should see an output similar to the one shown in this screenshot: Resilient Distributed Datasets The core of Spark is a concept called the Resilient Distributed Dataset (RDD). An RDD is a collection of "records" (strictly speaking, objects of some type) that is distributed or partitioned across many nodes in a cluster (for the purposes of the Spark local mode, the single multithreaded process can be thought of in the same way). An RDD in Spark is fault-tolerant; this means that if a given node or task fails (for some reason other than erroneous user code, such as hardware failure, loss of communication, and so on), the RDD can be reconstructed automatically on the remaining nodes and the job will still complete. Creating RDDs RDDs can be created from existing collections, for example, in the Scala Spark shell that you launched earlier: val collection = List("a", "b", "c", "d", "e")val rddFromCollection = sc.parallelize(collection) RDDs can also be created from Hadoop-based input sources, including the local filesystem, HDFS, and Amazon S3. A Hadoop-based RDD can utilize any input format that implements the Hadoop InputFormat interface, including text files, other standard Hadoop formats, HBase, Cassandra, and many more. The following code is an example of creating an RDD from a text file located on the local filesystem: val rddFromTextFile = sc.textFile("LICENSE") The preceding textFile method returns an RDD where each record is a String object that represents one line of the text file. Spark operations Once we have created an RDD, we have a distributed collection of records that we can manipulate. In Spark's programming model, operations are split into transformations and actions. Generally speaking, a transformation operation applies some function to all the records in the dataset, changing the records in some way. An action typically runs some computation or aggregation operation and returns the result to the driver program where SparkContext is running. Spark operations are functional in style. For programmers familiar with functional programming in Scala or Python, these operations should seem natural. For those without experience in functional programming, don't worry; the Spark API is relatively easy to learn. One of the most common transformations that you will use in Spark programs is the map operator. This applies a function to each record of an RDD, thus mapping the input to some new output. For example, the following code fragment takes the RDD we created from a local text file and applies the size function to each record in the RDD. Remember that we created an RDD of Strings. Using map, we can transform each string to an integer, thus returning an RDD of Ints: val intsFromStringsRDD = rddFromTextFile.map(line => line.size) You should see output similar to the following line in your shell; this indicates the type of the RDD: intsFromStringsRDD: org.apache.spark.rdd.RDD[Int] = MappedRDD[5] at map at <console>:14 In the preceding code, we saw the => syntax used. This is the Scala syntax for an anonymous function, which is a function that is not a named method (that is, one defined using the def keyword in Scala or Python, for example). The line => line.size syntax means that we are applying a function where the input variable is to the left of the => operator, and the output is the result of the code to the right of the => operator. In this case, the input is line, and the output is the result of calling line.size. In Scala, this function that maps a string to an integer is expressed as String => Int.This syntax saves us from having to separately define functions every time we use methods such as map; this is useful when the function is simple and will only be used once, as in this example. Now, we can apply a common action operation, count, to return the number of records in our RDD: intsFromStringsRDD.count The result should look something like the following console output: 14/01/29 23:28:28 INFO SparkContext: Starting job: count at <console>:17...14/01/29 23:28:28 INFO SparkContext: Job finished: count at <console>:17, took 0.019227 sres4: Long = 398 Perhaps we want to find the average length of each line in this text file. We can first use the sum function to add up all the lengths of all the records and then divide the sum by the number of records: val sumOfRecords = intsFromStringsRDD.sumval numRecords = intsFromStringsRDD.countval aveLengthOfRecord = sumOfRecords / numRecords The result will be as follows: aveLengthOfRecord: Double = 52.06030150753769 Spark operations, in most cases, return a new RDD, with the exception of most actions, which return the result of a computation (such as Long for count and Double for sum in the preceding example). This means that we can naturally chain together operations to make our program flow more concise and expressive. For example, the same result as the one in the preceding line of code can be achieved using the following code: val aveLengthOfRecordChained = rddFromTextFile.map(line => line.size).sum / rddFromTextFile.count An important point to note is that Spark transformations are lazy. That is, invoking a transformation on an RDD does not immediately trigger a computation. Instead, transformations are chained together and are effectively only computed when an action is called. This allows Spark to be more efficient by only returning results to the driver when necessary so that the majority of operations are performed in parallel on the cluster. This means that if your Spark program never uses an action operation, it will never trigger an actual computation, and you will not get any results. For example, the following code will simply return a new RDD that represents the chain of transformations: val transformedRDD = rddFromTextFile.map(line => line.size).filter(size => size > 10).map(size => size * 2) This returns the following result in the console: transformedRDD: org.apache.spark.rdd.RDD[Int] = MappedRDD[8] at map at <console>:14 Notice that no actual computation happens and no result is returned. If we now call an action, such as sum, on the resulting RDD, the computation will be triggered: val computation = transformedRDD.sum You will now see that a Spark job is run, and it results in the following console output: ...14/11/27 21:48:21 INFO SparkContext: Job finished: sum at <console>:16, took 0.193513 scomputation: Double = 60468.0 The complete list of transformations and actions possible on RDDs as well as a set of more detailed examples are available in the Spark programming guide (located at http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations), and the API documentation (the Scala API documentation) is located at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD). Caching RDDs One of the most powerful features of Spark is the ability to cache data in memory across a cluster. This is achieved through use of the cache method on an RDD: rddFromTextFile.cache Calling cache on an RDD tells Spark that the RDD should be kept in memory. The first time an action is called on the RDD that initiates a computation, the data is read from its source and put into memory. Hence, the first time such an operation is called, the time it takes to run the task is partly dependent on the time it takes to read the data from the input source. However, when the data is accessed the next time (for example, in subsequent queries in analytics or iterations in a machine learning model), the data can be read directly from memory, thus avoiding expensive I/O operations and speeding up the computation, in many cases, by a significant factor. If we now call the count or sum function on our cached RDD, we will see that the RDD is loaded into memory: val aveLengthOfRecordChained = rddFromTextFile.map(line => line.size).sum / rddFromTextFile.count Indeed, in the following output, we see that the dataset was cached in memory on the first call, taking up approximately 62 KB and leaving us with around 270 MB of memory free: ...14/01/30 06:59:27 INFO MemoryStore: ensureFreeSpace(63454) called with curMem=32960, maxMem=31138775014/01/30 06:59:27 INFO MemoryStore: Block rdd_2_0 stored as values to memory (estimated size 62.0 KB, free 296.9 MB)14/01/30 06:59:27 INFO BlockManagerMasterActor$BlockManagerInfo:Added rdd_2_0 in memory on 10.0.0.3:55089 (size: 62.0 KB, free: 296.9 MB)... Now, we will call the same function again: val aveLengthOfRecordChainedFromCached = rddFromTextFile.map(line => line.size).sum / rddFromTextFile.count We will see from the console output that the cached data is read directly from memory: ...14/01/30 06:59:34 INFO BlockManager: Found block rdd_2_0 locally... Spark also allows more fine-grained control over caching behavior. You can use the persist method to specify what approach Spark uses to cache data. More information on RDD caching can be found here: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. Broadcast variables and accumulators Another core feature of Spark is the ability to create two special types of variables: broadcast variables and accumulators. A broadcast variable is a read-only variable that is made available from the driver program that runs the SparkContext object to the nodes that will execute the computation. This is very useful in applications that need to make the same data available to the worker nodes in an efficient manner, such as machine learning algorithms. Spark makes creating broadcast variables as simple as calling a method on SparkContext as follows: val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e")) The console output shows that the broadcast variable was stored in memory, taking up approximately 488 bytes, and it also shows that we still have 270 MB available to us: 14/01/30 07:13:32 INFO MemoryStore: ensureFreeSpace(488) called with curMem=96414, maxMem=31138775014/01/30 07:13:32 INFO MemoryStore: Block broadcast_1 stored as values to memory(estimated size 488.0 B, free 296.9 MB)broadCastAList: org.apache.spark.broadcast.Broadcast[List[String]] = Broadcast(1) A broadcast variable can be accessed from nodes other than the driver program that created it (that is, the worker nodes) by calling value on the variable: sc.parallelize(List("1", "2", "3")).map(x => broadcastAList.value ++ x).collect This code creates a new RDD with three records from a collection (in this case, a Scala List) of ("1", "2", "3"). In the map function, it returns a new collection with the relevant record from our new RDD appended to the broadcastAList that is our broadcast variable. Notice that we used the collect method in the preceding code. This is a Spark action that returns the entire RDD to the driver as a Scala (or Python or Java) collection. We will often use collect when we wish to apply further processing to our results locally within the driver program. Note that collect should generally only be used in cases where we really want to return the full result set to the driver and perform further processing. If we try to call collect on a very large dataset, we might run out of memory on the driver and crash our program.It is preferable to perform as much heavy-duty processing on our Spark cluster as possible, preventing the driver from becoming a bottleneck. In many cases, however, collecting results to the driver is necessary, such as during iterations in many machine learning models. On inspecting the result, we will see that for each of the three records in our new RDD, we now have a record that is our original broadcasted List, with the new element appended to it (that is, there is now either "1", "2", or "3" at the end): ...14/01/31 10:15:39 INFO SparkContext: Job finished: collect at <console>:15, took 0.025806 sres6: Array[List[Any]] = Array(List(a, b, c, d, e, 1), List(a, b, c, d, e, 2), List(a, b, c, d, e, 3)) An accumulator is also a variable that is broadcasted to the worker nodes. The key difference between a broadcast variable and an accumulator is that while the broadcast variable is read-only, the accumulator can be added to. There are limitations to this, that is, in particular, the addition must be an associative operation so that the global accumulated value can be correctly computed in parallel and returned to the driver program. Each worker node can only access and add to its own local accumulator value, and only the driver program can access the global value. Accumulators are also accessed within the Spark code using the value method. For more details on broadcast variables and accumulators, see the Shared Variables section of the Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html#shared-variables. Summary In this article, we learned the basics of Spark's programming model and API using the interactive Scala console. Resources for Article: Further resources on this subject: Ridge Regression [article] Clustering with K-Means [article] Machine Learning Examples Applicable to Businesses [article]
Read more
  • 0
  • 0
  • 4922
article-image-visualize
Packt
19 Feb 2015
17 min read
Save for later

Visualize This!

Packt
19 Feb 2015
17 min read
This article is written by Michael Phillips, the author of the book TIBCO Spotfire: A Comprehensive Primer, discusses that human beings are fundamentally visual in the way they process information. The invention of writing was as much about visually representing our thoughts to others as it was about record keeping and accountancy. In the modern world, we are bombarded with formalized visual representations of information, from the ubiquitous opinion poll pie chart to clever and sophisticated infographics. The website http://data-art.net/resources/history_of_vis.php provides an informative and entertaining quick history of data visualization. If you want a truly breathtaking demonstration of the power of data visualization, seek out Hans Rosling's The best stats you've ever seen at http://ted.com. (For more resources related to this topic, see here.) We will spend time getting to know some of Spotfire's data capabilities. It's important that you continue to think about data; how it's structured, how it's related, and where it comes from. Building good visualizations requires visual imagination, but it also requires data literacy. This article is all about getting you to think about the visualization of information and empowering you to use Spotfire to do so. Apart from learning the basic features and properties of the various Spotfire visualization types, there is much more to learn about the seamless interactivity that Spotfire allows you to build in to your analyses. We will be taking a close look at 7 of the 16 visualization types provided by Spotfire, but these 7 visualization types are the most commonly used. We will cover the following topics: Displaying information quickly in tabular form Enriching your visualizations with color categorization Visualizing categorical information using bar charts Dividing a visualization across a trellis grid Key Spotfire concept—marking Visualizing trends using line charts Visualizing proportions using pie charts Visualizing relationships using scatter plots Visualizing hierarchical relationships using treemaps Key Spotfire concept—filters Enhancing tabular presentations using graphical tables Now let's have some fun! Displaying information quickly in tabular form While working through the data examples, we used the Spotfire Table visualization, but now we're going to take a closer look. People will nearly always want to see the "underlying data", the details behind any visualization you create. The Table visualization meets this need. It's very important not to confuse a table in the general data sense with the Spotfire Table visualization; the underlying data table remains immutable and complete in the background. The Table visualization is a highly manipulatable view of the underlying data table and should be treated as a visualization, not a data table. The data used here is BaseballPlayerData.xls There is always more than one way to do the same thing in Spotfire, and this is particularly true for the manipulation of visualizations. Let's start with some very quick manipulations: First, insert a table visualization by going to the Insert menu, selecting New Visualization, and then Table. To move a column, left-click on the column name, hold, and drag it. To sort by a column, left-click on the column name. To sort by more than one column, left-click on the first column name and then press Shift + left-click on the subsequent columns in order of sort precedence. To widen or narrow a column, hover the mouse over the right-hand edge of the column title until you see the cursor change to a two-way arrow, and then click and drag it. These and other properties of the Table visualization are also accessed via visualization properties. As you work through the various Spotfire visualizations, you'll notice that some types have more options than others, but there are common trends and an overall consistency in conventions. Visualization properties can be opened in a number of ways: By right-clicking on the visualization, a table in this case, and selecting Properties. By going to the Edit menu and selecting Visualization Properties. By clicking on the Visualization Properties icon, as shown in the following screenshot, in the icon tray below the main menu bar. It's beyond the scope of this book to explore every property and option. The context-sensitive help provided by Spotfire is excellent and explains all the options in glorious detail. I'd like to highlight four important properties of the Table visualization: The General property allows you to change the table visualization title, not the name of the underlying data table. It also allows you to hide the title altogether. The Data property allows you to switch the underlying data table, if you have more than one table loaded into your analysis. The Columns property allows you to hide columns and order the columns you do want to show. The Show/Hide Items property allows you to limit what is shown by a rule you define, such as top five hitters. After clicking on the Add button, you select the relevant column from a dropdown list, choose Rule type (Top), and finally, choose Value for the rule (5). The resulting visualization will only show the rows of data that meet the rule you defined. Enriching your visualizations with color categorization Color is a strong feature in Spotfire and an important visualization tool, often underestimated by report creators. It can be seen as merely a nice-to-have customization, but paying attention to color can be the difference between creating a stimulating and intuitive data visualization rather than an uninspiring and even confusing corporate report. Take some pride and care in the visual aesthetics of your analytics creations! Let's take a look at the color properties of the Table visualization. Open the Table visualization properties again, select Colors, and then Add the column Runs. Now, you can play with a color gradient, adding points by clicking on the Add Point button and customizing the colors. It's as easy as left-clicking on any color box and then selecting from a prebuilt palette or going into a full RGB selection dialog by choosing More Colors…. The result is a heatmap type effect for runs scored, with yellow representing low run totals, transitioning to green as the run total approaches the average value in the data, and becoming blue for the highest run totals. Visualizing categorical information using bar charts We saw how the Table visualization is perfect for showing and ordering detailed information. It's quite similar to a spreadsheet. The Bar Chart visualization is very good for visualizing categorical information, that is, where you have categories with supporting hard numbers—sales by region, for example. The region is the category, whereas the sales is the hard number or fact. Bar charts are typically used to show a distribution. Depending on your data or your analytic requirement, the bars can be ordered by value, placed side by side, stacked on top of each other, or arranged vertically or horizontally. There is a special case of the category and value combination and that is where you want to plot the frequencies of a set of numerical values. This type of bar chart is referred to as a histogram, and although it is number against number, it is still, in essence, a distribution plot. It is very common in fact to transform the continuous number range in such cases into a set of discrete bins or categories for the plot. For example, you could take some demographic data and plot age as the category and the number of people at that age as the value (the frequency) on a bar chart. The result, for a general population, would approach a bell-shaped curve. Let's create a bar chart using the baseball data. The data we will use is BaseballPlayerData.xls, which you can download from http://www.insidespotfire.com. Create a new page by right-clicking on any page tab and selecting New Page. You can also select New Page from the Insert menu or click on the new page icon in the icon bar below the main menu. Create a Bar Chart visualization by left-clicking on the bar chart icon or by selecting New Visualization and then Bar Chart from the Insert menu. Spotfire will automatically create a default chart, that is, rarely exactly what you want, so the next step is to configure the chart. Two distributions might be interesting to look at: the distribution of home runs across all the teams and the distribution of player salaries across all the teams. The axes are easy to change; simply use the axes selectors.   If the bars are vertical, it means that the category—Team, in our case—should be on the horizontal axis, with the value—Home Runs or Salary—on the vertical axis, representing the height of the bars.   We're going to pick Home Runs from the vertical axis selector and then an appropriate aggregation dropdown, which is highlighted in red in the screenshot. Sum would be a valid option, but let's go with Avg (Average). Similarly, select Team from the horizontal axis dropdown selector. The vertical, or value, axis must be an aggregation because there is more than one home run value for each category. You must decide if you want a sum, an average, a minimum, and so on. You can modify the visualization properties just as you did for the Table visualization. Some of the options are the same; some are specific to the bar chart. We're going to select the Sort bars by value option in the Appearance property. This will order the bars in descending order of value. We're also going to check the option Vertically under Scale labels | Show labels for the Category Axis property. There are two more actions to perform: create an identical bar chart except with average salary as the value axis, and give each bar chart an appropriate title (Visualization Properties|General|Title:). To copy an existing visualization, simply right-click on it and select Duplicate Visualization. We can now compare the distribution of home run average and salary average across all the baseball teams, but there's a better way to do this in a single visualization using color. Close the salary distribution bar chart by left-clicking on X in the upper right-hand corner of the visualization (X appears when you hover the mouse) or right-clicking on the visualization and selecting Close. Now, open the home run bar chart visualization properties, go to the Colors property, and color by Avg(Salary). Select a Gradient color mode, and add a median point by clicking on the Add Point button and selecting Median from the dropdown list of options on the added point. Finally, choose a suitable heat map range of colors; something like blue (min) through pale yellow (median) through red (max). You will still see the distribution of home runs across the baseball teams, but now you will have a superimposed salary heat map. Texas and Cleveland appear to be getting much more bang for their buck than the NY Yankees. Dividing a visualization across a trellis grid Trellising, whereby you divide a series of visualizations into individual panels, is a useful technique when you want to subdivide your analysis. In the example we've been working with, we might, for instance, want to split the visualization by league. Open the visualization properties for the home runs distribution bar chart colored by salary and select the Trellis property. Go to Panels and split by League (use the dropdown column selector). Spotfire allows you to build layers of information with even basic visualizations such as the bar chart. In one chart, we see the home run distribution by team, salary distribution by team, and breakdown by league. Key Spotfire concept – marking It's time to introduce one of the most important Spotfire concepts, called marking, which is central to the interactivity that makes Spotfire such a powerful analysis tool. Marking refers to the action of selecting data in a visualization. Every element you see is selectable, or markable, that is, a single row or multiple rows in a table, a single bar or multiple bars in a bar chart. You need to understand two aspects to marking. First, there is the visual effect, or color(s) you see, when you mark (select) visualization elements. Second, there is the behavior that follows marking: what happens to data and the display of data when you mark something. How to change the marking color From Spotfire v5.5 onward, you can choose, on a visualization-by-visualization basis, two distinct visual effects for marking: Use a separate color for marked items: all marked items are uniformly colored with the marking color, and all unmarked items retain their existing color. Keep existing color attributes and fade out unmarked items: all marked items keep their existing color, and all unmarked items also keep their existing color but with a high degree of color fade applied, leaving the marked items strongly highlighted. The second option is not available in versions older than v5.5 but is the default option in Versions 5.5 onward. The setting is made in the visualization's Appearance property by checking or unchecking the option Use separate color for marked items. The default color when using a separate color for marked items is dark green, but this can be changed by going to Edit|Document Properties|Markings|Edit. The new option has the advantage of retaining any underlying coloring you defined, but you might not like how the rest of the chart is washed out. Which approach you choose depends on what information you think is critical for your particular situation. When you create a new analysis, a default marking is created and applied to every visualization you create by default. You can change the color of the marking in Document Properties, which is found in the Edit menu. Just open Document Properties, click on the Markings tab, select the marking, click on the Edit button, and change the color. You can also create as many markings as you need, giving them convenient names for reference purposes, but we'll just focus on using one for now. How to set the marking behavior of a visualization Marking behavior depends fundamentally on data relationships. The data within a single data table is intrinsically related; the data in separate data tables must be explicitly related before you configure marking behavior for visualizations based on separate datasets. When you mark something in a visualization, five things can happen depending on the data involved and how you configured your visualizations: Conditions Behavior Two visualizations with the same underlying data table (they can be on different pages in the analysis file) and the same marking scheme applied. Marking data on one visualization will automatically mark the same data on the other. Two visualizations with related underlying data tables and the same marking scheme applied. The same as the previous condition's behavior, but subject to differences in data granularity. For example, marking a baseball team in one visualization will mark all the team's players in another visualization that is based on a more detailed table related by team. Two visualizations with the same or related data tables where one has been configured with data dependency on the marking in the other. Nothing will display in the marking-dependent visualization other than what is marked in the reference visualization. Visualizations with unrelated underlying data tables. No marking interaction will occur, and the visualizations will mark completely independently of one another. Two visualizations with the same underlying data table or related data tables and with different marking schemes applied. Marking data on one visualization will not show on the other because the marking schemes are different. Here's how we set these behaviors: Open the visualization properties of the bar chart we have been working with and navigate to the Data property.   You'll notice that two settings refer to marking: Marking and Limit data using markings. Use the dropdown under Marking to select the marking to be used for the visualization. Having no marking is an option. Visualizations with the same marking will display synchronous selection, subject to the data relation conditions described earlier. The options under Limit data using markings determine how the visualization will be limited to marking elsewhere in the analysis. The default here is no dependency. If you select a marking, then the visualization will only display data selected elsewhere with that marking. It's not good to have the same marking for Marking and Limit data using markings. If you are using the limit data setting, select no marking, or create a second marking and select it under Marking. You're possibly a bit confused by now. Fortunately, marking is much harder to describe than to use! Let's build a tangible example. We'll start a new analysis, so close any analysis you have open and create a new one, loading the player-level baseball data (BaseballPlayerData.xls). Add two bar charts and a table. You can rearrange the layout by left-clicking on the title bar of a visualization, holding, and dragging it. Position the visualizations any way you wish, but you can place the two bar charts side by side, with the table below them spanning both. Save your analysis file at this point and at regular intervals. It's good behavior to save regularly as you build an analysis. It will save you a lot of grief if your PC fails in any way. There is no autosave function in Spotfire. For the first bar chart, set the following visualization properties: Property Value General | Title Home Runs Data | Marking Marking Data | Limit data using markings Nothing checked Appearance | Orientation Vertical bars Appearance | Sort bars by value Check Category Axis | Columns Team Value Axis | Columns Avg(Home Runs) Colors | Columns Avg(Salary) Colors | Color mode Gradient Add Point for median Max = strong red; Median = pale yellow; Min = strong blue Labels | Show labels for Marked Rows Labels | Types of labels | Complete bar Check For the second bar chart, set the following visualization properties: Property Value General | Title Roster Data | Marking Marking Data | Limit data using markings Nothing checked Appearance | Orientation Horizontal bars Appearance | Sort bars by value Check Category Axis | Columns Team Value Axis | Columns Count(Player Name) Colors | Columns Position Colors | Color mode Categorical For the table, set the following visualization properties: Property Value General | Title Details Data | Marking (None) Data | Limit data using markings Check Marking   Columns Team, Player Name, Games Played, Home Runs, Salary, Position Now start selecting visualization elements with your mouse. You can click on elements such as bars or segments of bars, or you can click and drag a rectangular block around multiple elements. When you select a bar on the Home Runs bar chart, the corresponding team bar automatically selects the Roster bar chart, and details for all the players in that team display in the Details table. When you select a bar segment on the Roster bar chart, the corresponding team bar automatically selects on the Home Runs bar chart and only players in the selected position for the team selected appear in the details. There are some very useful additional functions associated with marking, and you can access these by right-clicking on a marked item. They are Unmark, Invert, Delete, Filter To, and Filer Out. You can also unmark by left-clicking on any blank space in the visualization. Play with this analysis file until you are comfortable with the marking concept and functionality. Summary This article is a small taste of the book TIBCO Spotfire: A comprehensive primer. You've seen how the Table visualization is an easy and traditional way to display detailed information in tabular form and how the Bar Chart visualization is excellent for visualizing categorical information, such as distributions. You've learned how to enrich visualizations with color categorization and how to divide a visualization across a trellis grid. You've also been introduced to the key Spotfire concept of marking. Apart from gaining a functional understanding of these Spotfire concepts and techniques, you should have gained some insight into the science and art of data visualization. Resources for Article: Further resources on this subject: The Spotfire Architecture Overview [article] Interacting with Data for Dashboards [article] Setting Up and Managing E-mails and Batch Processing [article]
Read more
  • 0
  • 0
  • 6769

article-image-using-nlp-apis
Packt
12 Feb 2015
22 min read
Save for later

Using NLP APIs

Packt
12 Feb 2015
22 min read
In this article by Richard M Reese, author of the book, Natural Language Processing with Java, we will demonstrate the NER process using OpenNLP, Stanford API, and LingPipe. They each provide alternate techniques that can often do a good job identifying entities in text. The following declaration will serve as the sample text to demonstrate the APIs: String sentences[] = {"Joe was the last person to see Fred. ", "He saw him in Boston at McKenzie's pub at 3:00 where he paid "    + "$2.45 for an ale. ",    "Joe wanted to go to Vermont for the day to visit a cousin who "    + "works at IBM, but Sally and he had to look for Fred"}; Using OpenNLP for NER We will demonstrate the use of the TokenNameFinderModel class to perform NLP using the OpenNLP API. In addition, we will demonstrate how to determine the probability that the entity identified is correct. The general approach is to convert the text into a series of tokenized sentences, create an instance of the TokenNameFinderModel class using an appropriate model, and then use the find method to identify the entities in the text. The next example demonstrates the use of the TokenNameFinderModel class. We will use a simple sentence initially and then use multiple sentences. The sentence is defined here:    String sentence = "He was the last person to see Fred."; We will use the models found in the en-token.bin and en-ner-person.bin files for the tokenizer and name finder models respectively. InputStream for these files is opened using a try-with-resources block as shown here:    try (InputStream tokenStream = new FileInputStream(            new File(getModelDir(), "en-token.bin"));            InputStream modelStream = new FileInputStream(                new File(getModelDir(), "en-ner-person.bin"));) {        ...      } catch (Exception ex) {        // Handle exceptions    } Within the try block, the TokenizerModel and Tokenizer objects are created:    TokenizerModel tokenModel = new TokenizerModel(tokenStream);    Tokenizer tokenizer = new TokenizerME(tokenModel); Next, an instance of the NameFinderME class is created using the person model:    TokenNameFinderModel entityModel =        new TokenNameFinderModel(modelStream);    NameFinderME nameFinder = new NameFinderME(entityModel); We can now use the tokenize method to tokenize the text and the find method to identify the person in the text. The find method will use the tokenized String array as input and return an array of the Span objects as follows:    String tokens[] = tokenizer.tokenize(sentence);    Span nameSpans[] = nameFinder.find(tokens); The following for statement displays the person found in the sentence. Its positional information and the person are displayed on separate lines:    for (int i = 0; i < nameSpans.length; i++) {        System.out.println("Span: " + nameSpans[i].toString());        System.out.println("Entity: "            + tokens[nameSpans[i].getStart()]);    } The output is as follows: Span: [7..9) person Entity: Fred Often we will work with multiple sentences. To demonstrate this we will use the previously defined sentences string array. The previous for statement is replaced with the following sequence. The tokenize method is invoked against each sentence and then the entity information is displayed as before:    for (String sentence : sentences) {        String tokens[] = tokenizer.tokenize(sentence);        Span nameSpans[] = nameFinder.find(tokens);        for (int i = 0; i < nameSpans.length; i++) {            System.out.println("Span: " + nameSpans[i].toString());            System.out.println("Entity: "                + tokens[nameSpans[i].getStart()]);        }        System.out.println();    } The output is as follows. There is an extra blank line between the two people detected because the second sentence did not contain a person. Span: [0..1) person Entity: Joe Span: [7..9) person Entity: Fred     Span: [0..1) person Entity: Joe Span: [19..20) person Entity: Sally Span: [26..27) person Entity: Fred Determining the accuracy of the entity When TokenNameFinderModel identifies entities in text, it computes a probability for that entity. We can access this information using the probs method as shown next. The method returns an array of doubles, which corresponds to the elements of the nameSpans array:    double[] spanProbs = nameFinder.probs(nameSpans); Add this statement to the previous example immediately after the use of the find method. Then add the next statement at the end of the nested for statement:    System.out.println("Probability: " + spanProbs[i]); When the example is executed, you will get the following output. The probability fields reflect the confidence level of the entity assignment. For the first entity, the model is 80.529 percent confident that Joe is a person: Span: [0..1) person Entity: Joe Probability: 0.8052914774025202 Span: [7..9) person Entity: Fred Probability: 0.9042160889302772     Span: [0..1) person Entity: Joe Probability: 0.9620970782763985 Span: [19..20) person Entity: Sally Probability: 0.964568603518126 Span: [26..27) person Entity: Fred Probability: 0.990383039618594 Using other entity types OpenNLP supports different libraries as listed in the following table. These models can be downloaded from http://opennlp.sourceforge.net/models-1.5/. The prefix, en, specifies English as the language while ner indicates that the model is for NER. English finder models File name Location name finder model en-ner-location.bin Money name finder model en-ner-money.bin Organization name finder model en-ner-organization.bin Percentage name finder model en-ner-percentage.bin Person name finder model en-ner-person.bin Time name finder model en-ner-time.bin If we modify the statement to use a different model file, we can see how they work against the sample sentences:    InputStream modelStream = new FileInputStream(        new File(getModelDir(), "en-ner-time.bin"));) { When the en-ner-money.bin model is used, the index into the tokens array in the earlier code sequence has to be increased by one. Otherwise, all that is returned is the dollar sign. The various outputs are shown in the following table. Model Output en-ner-location.bin Span: [4..5) location Entity: Boston Probability: 0.8656908776583051 Span: [5..6) location Entity: Vermont Probability: 0.9732488014011262 en-ner-money.bin Span: [14..16) money Entity: 2.45 Probability: 0.7200919701507937 en-ner-organization.bin Span: [16..17) organization Entity: IBM Probability: 0.9256970736336729 en-ner-time.bin The model was not able to detect time in this text sequence The model failed to find the time entities in the sample text. This illustrates that the model did not have enough confidence that it found any time entities in the text. Processing multiple entities types We can also handle multiple entity types at the same time. This involves creating instances of NameFinderME, based on each model within a loop and applying the model against each sentence, keeping track of the entities as they are found. We will illustrate this process with the next example. It requires rewriting the previous try block to create the InputStream within the block as shown here:    try {        InputStream tokenStream = new FileInputStream(            new File(getModelDir(), "en-token.bin"));        TokenizerModel tokenModel = new TokenizerModel(tokenStream);        Tokenizer tokenizer = new TokenizerME(tokenModel);        ...    } catch (Exception ex) {        // Handle exceptions    } Within the try block, we will define a string array to hold the names of the model files. As shown here, we will use models for people, locations, and organizations:    String modelNames[] = {"en-ner-person.bin",        "en-ner-location.bin", "en-ner-organization.bin"}; ArrayList is created to hold the entities as they are discovered:    ArrayList<String> list = new ArrayList(); A for-each statement is used to load one model at a time and then to create an instance of the NameFinderME class:    for(String name : modelNames) {        TokenNameFinderModel entityModel = new TokenNameFinderModel(            new FileInputStream(new File(getModelDir(), name)));        NameFinderME nameFinder = new NameFinderME(entityModel);        ...    } Previously, we did not try to identify which sentences the entities were found in. This is not hard to do, but we need to use a simple for statement instead of a for-each statement to keep track of the sentence indexes. This is shown next where the previous example has been modified to use the integer variable index to keep the sentences. Otherwise, the code works the same way as before:    for (int index = 0; index < sentences.length; index++) {        String tokens[] = tokenizer.tokenize(sentences[index]);        Span nameSpans[] = nameFinder.find(tokens);        for(Span span : nameSpans) {            list.add("Sentence: " + index                + " Span: " + span.toString() + " Entity: "                + tokens[span.getStart()]);        }    } The entities discovered are then displayed:    for(String element : list) {        System.out.println(element);    } The output is as follows: Sentence: 0 Span: [0..1) person Entity: Joe Sentence: 0 Span: [7..9) person Entity: Fred Sentence: 2 Span: [0..1) person Entity: Joe Sentence: 2 Span: [19..20) person Entity: Sally Sentence: 2 Span: [26..27) person Entity: Fred Sentence: 1 Span: [4..5) location Entity: Boston Sentence: 2 Span: [5..6) location Entity: Vermont Sentence: 2 Span: [16..17) organization Entity: IBM Using the Stanford API for NER We will demonstrate the CRFClassifier class as used to perform NER. This class implements what is known as a linear chain Conditional Random Field (CRF) sequence model. To demonstrate the use of the CRFClassifier class we will start with a declaration of the classifier file string as shown here:    String model = getModelDir() +        "\english.conll.4class.distsim.crf.ser.gz"; The classifier is then created using the model:    CRFClassifier<CoreLabel> classifier =        CRFClassifier.getClassifierNoExceptions(model); The classify method takes a single string representing the text to be processed. To use the sentences text we need to convert it to a simple string:    String sentence = "";    for (String element : sentences) {        sentence += element;    } The classify method is then applied to the text:    List<List<CoreLabel>> entityList = classifier.classify(sentence); A List of CoreLabel is returned. The object returned is a list that contains another list. The contained list is a list of CoreLabel. The CoreLabel class represents a word with additional information attached to it. The internal list contains a list of these words. In the outer for-each statement in the next code sequence, the internalList variable represents one sentence of the text. In the inner for-each statement, each word in that inner list is displayed. The word method returns the word and the get method returns the type of the word. The words and their types are then displayed:    for (List<CoreLabel> internalList: entityList) {        for (CoreLabel coreLabel : internalList) {            String word = coreLabel.word();            String category = coreLabel.get(                CoreAnnotations.AnswerAnnotation.class);            System.out.println(word + ":" + category);      }    } Part of the output is as follows. It has been truncated because every word is displayed. The O represents the Other category: Joe:PERSON was:O the:O last:O person:O to:O see:O Fred:PERSON .:O He:O ... look:O for:O Fred:PERSON To filter out those words that are not relevant, replace the println statement with the following statements. This will eliminate the other categories:    if (!"O".equals(category)) {        System.out.println(word + ":" + category);    } The output is simpler now: Joe:PERSON Fred:PERSON Boston:LOCATION McKenzie:PERSON Joe:PERSON Vermont:LOCATION IBM:ORGANIZATION Sally:PERSON Fred:PERSON Using LingPipe for NER Here we will demonstrate how name entity models and the ExactDictionaryChunker class are used to perform NER analysis. Using LingPipe's name entity models LingPipe has a few named entity models that we can use with chunking. These files consist of a serialized object that can be read from a file and then applied to text. These objects implement the Chunker interface. The chunking process results in a series of Chunking objects that identify the entities of interest. A list of the NER models is in found in the following table. These models can be downloaded from http://alias-i.com/lingpipe/web/models.html. Genre Corpus File English News MUC-6 ne-en-news-muc6.AbstractCharLmRescoringChunker English Genes GeneTag ne-en-bio-genetag.HmmChunker English Genomics GENIA ne-en-bio-genia.TokenShapeChunker We will use the model found in the file, ne-en-news-muc6.AbstractCharLmRescoringChunker, to demonstrate how this class is used. We start with a try-catch block to deal with exceptions as shown next. The file is opened and used with the AbstractExternalizable class' static readObject method to create an instance of a Chunker class. This method will read in the serialized model:    try {        File modelFile = new File(getModelDir(),            "ne-en-news-muc6.AbstractCharLmRescoringChunker");          Chunker chunker = (Chunker)            AbstractExternalizable.readObject(modelFile);        ...    } catch (IOException | ClassNotFoundException ex) {        // Handle exception    } The Chunker and Chunking interfaces provide methods that work with a set of chunks of text. Its chunk method returns an object that implements the Chunking instance. The following sequence displays the chunks found in each sentence of the text as shown here:    for (int i = 0; i < sentences.length; ++i) {        Chunking chunking = chunker.chunk(sentences[i]);        System.out.println("Chunking=" + chunking);    } The output of this sequence is as follows: Chunking=Joe was the last person to see Fred. : [0-3:PERSON@-Infinity, 31-35:ORGANIZATION@-Infinity] Chunking=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. : [14-20:LOCATION@-Infinity, 24-32:PERSON@-Infinity] Chunking=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, but Sally and he had to look for Fred : [0-3:PERSON@-Infinity, 20-27:ORGANIZATION@-Infinity, 71-74:ORGANIZATION@-Infinity, 109-113:ORGANIZATION@-Infinity] Instead, we can use methods of the Chunk class to extract specific pieces of information as illustrated next. We will replace the previous for statement with the following for-each statement. This calls a displayChunkSet method:    for (String sentence : sentences) {        displayChunkSet(chunker, sentence);    } The output that follows shows the result. However, it did not always match the entity type correctly: Type: PERSON Entity: [Joe] Score: -Infinity Type: ORGANIZATION Entity: [Fred] Score: -Infinity Type: LOCATION Entity: [Boston] Score: -Infinity Type: PERSON Entity: [McKenzie] Score: -Infinity Type: PERSON Entity: [Joe] Score: -Infinity Type: ORGANIZATION Entity: [Vermont] Score: -Infinity Type: ORGANIZATION Entity: [IBM] Score: -Infinity Type: ORGANIZATION Entity: [Fred] Score: -Infinity Using the ExactDictionaryChunker class The ExactDictionaryChunker class provides an easy way to create a dictionary of entities and their types, which can be used to find them later in text. It uses a MapDictionary object to store entries and then the ExactDictionaryChunker class is used to extract chunks based on the dictionary. The AbstractDictionary interface supports basic operations for entities, category, and score. The score is used in the matching process. The MapDictionary and TrieDictionary classes implement the AbstractDictionary interface. The TrieDictionary class stores information using a character trie structure. This approach uses less memory when that is a concern. We will use the MapDictionary class for our example. To illustrate this approach we start with a declaration of the MapDictionary:    private MapDictionary<String> dictionary; The dictionary will contain the entities that we are interested in finding. We need to initialize the model as performed in the following initializeDictionary method. The DictionaryEntry constructor used here accepts three arguments: String: It gives the name of the entity String: It gives the category of the entity Double: It represents a score for the entity The score is used when determining matches. A few entities are declared and added to the dictionary:    private static void initializeDictionary() {         dictionary = new MapDictionary<String>();        dictionary.addEntry(            new DictionaryEntry<String>("Joe","PERSON",1.0));        dictionary.addEntry(            new DictionaryEntry<String>("Fred","PERSON",1.0));        dictionary.addEntry(            new DictionaryEntry<String>("Boston","PLACE",1.0));        dictionary.addEntry(            new DictionaryEntry<String>("pub","PLACE",1.0));        dictionary.addEntry(            new DictionaryEntry<String>("Vermont","PLACE",1.0));        dictionary.addEntry(            new DictionaryEntry<String>("IBM","ORGANIZATION",1.0));        dictionary.addEntry(            new DictionaryEntry<String>("Sally","PERSON",1.0));    } An ExactDictionaryChunker instance will use this dictionary. The arguments of the ExactDictionaryChunker class are detailed here: Dictionary<String>: It is a dictionary containing the entities TokenizerFactory: It is a tokenizer used by the chunker boolean: If true, the chunker should return all matches boolean: If true, the matches are case sensitive Matches can be overlapping. For example, in the phrase, "The First National Bank", the entity "bank" could be used by itself or in conjunction with the rest of the phrase. The third parameter determines if all of the matches are returned. In the following sequence, the dictionary is initialized. We then create an instance of the ExactDictionaryChunker class using the Indo-European tokenizer where we return all matches and ignore the case of the tokens:    initializeDictionary();    ExactDictionaryChunker dictionaryChunker        = new ExactDictionaryChunker(dictionary,            IndoEuropeanTokenizerFactory.INSTANCE, true, false); The dictionaryChunker object is used with each sentence as shown next. We will use displayChunkSet:    for (String sentence : sentences) {        System.out.println("nTEXT=" + sentence);        displayChunkSet(dictionaryChunker, sentence);    } When executed, we get the following output: TEXT=Joe was the last person to see Fred. Type: PERSON Entity: [Joe] Score: 1.0 Type: PERSON Entity: [Fred] Score: 1.0   TEXT=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. Type: PLACE Entity: [Boston] Score: 1.0 Type: PLACE Entity: [pub] Score: 1.0   TEXT=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, but Sally and he had to look for Fred Type: PERSON Entity: [Joe] Score: 1.0 Type: PLACE Entity: [Vermont] Score: 1.0 Type: ORGANIZATION Entity: [IBM] Score: 1.0 Type: PERSON Entity: [Sally] Score: 1.0 Type: PERSON Entity: [Fred] Score: 1.0 This does a pretty good job, but it requires a lot of effort to create the dictionary or a large vocabulary. Training a model We will use the OpenNLP to demonstrate how a model is trained. The training file used must have the following: Marks to demarcate the entities One sentence per line We will use the following model file named en-ner-person.train: <START:person> Joe <END> was the last person to see <START:person> Fred <END>. He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. <START:person> Joe <END> wanted to go to Vermont for the day to visit a cousin who works at IBM, but <START:person> Sally <END> and he had to look for <START:person> Fred <END>. Several methods of this example are capable of throwing exceptions. These statements will be placed in try-with-resource block as shown here where the model's output stream is created:    try (OutputStream modelOutputStream = new BufferedOutputStream(            new FileOutputStream(new File("modelFile")));) {        ...    } catch (IOException ex) {        // Handle exception    } Within the block, we create an OutputStream<String> object using the PlainTextByLineStream class. This class' constructor takes FileInputStream and returns each line as a String object. The en-ner-person.train file is used as the input file as shown here. UTF-8 refers to the encoding sequence used:    ObjectStream<String> lineStream = new PlainTextByLineStream(        new FileInputStream("en-ner-person.train"), "UTF-8"); The lineStream object contains stream that are annotated with tags delineating the entities in the text. These need to be converted to the NameSample objects so that the model can be trained. This conversion is performed by the NameSampleDataStream class as shown next. A NameSample object holds the names for the entities found in the text:    ObjectStream<NameSample> sampleStream =        new NameSampleDataStream(lineStream); The train method can now be executed as shown next:    TokenNameFinderModel model = NameFinderME.train(        "en", "person", sampleStream,        Collections.<String, Object>emptyMap(), 100, 5); The arguments of the method are as detailed in the following table. Parameter Meaning "en" Language Code "person" Entity type sampleStream Sample data null Resources 100 The number of iterations 5 The cutoff The model is then serialized to the file:    model.serialize(modelOutputStream); The output of this sequence is as follows. It has been shortened to conserve space. Basic information about the model creation is detailed: Indexing events using cutoff of 5      Computing event counts... done. 53 events    Indexing... done. Sorting and merging events... done. Reduced 53 events to 46. Done indexing. Incorporating indexed data for training...    Number of Event Tokens: 46    Number of Outcomes: 2     Number of Predicates: 34 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-36.73680056967707 0.05660377358490566 2: ... loglikelihood=-17.499660626361216 0.9433962264150944 3: ... loglikelihood=-13.216835449617108 0.9433962264150944 4: ... loglikelihood=-11.461783667999262 0.9433962264150944 5: ... loglikelihood=-10.380239416084963 0.9433962264150944 6: ... loglikelihood=-9.570622475692486 0.9433962264150944 7: ... loglikelihood=-8.919945779143012 0.9433962264150944 ... 99: ... loglikelihood=-3.513810438211968 0.9622641509433962 100: ... loglikelihood=-3.507213816708068 0.9622641509433962 Evaluating the model The model can be evaluated using the TokenNameFinderEvaluator class. The evaluation process uses marked up sample text to perform the evaluation. For this simple example, a file called en-ner-person.eval was created that contained the following text: <START:person> Bill <END> went to the farm to see <START:person> Sally <END>. Unable to find <START:person> Sally <END> he went to town. There he saw <START:person> Fred <END> who had seen <START:person> Sally <END> at the book store with <START:person> Mary <END>. The following code is used to perform the evaluation. The previous model is used as the argument of the TokenNameFinderEvaluator constructor. A NameSampleDataStream instance is created based on the evaluation file. The TokenNameFinderEvaluator class' evaluate method performs the evaluation:    TokenNameFinderEvaluator evaluator =        new TokenNameFinderEvaluator(new NameFinderME(model));      lineStream = new PlainTextByLineStream(        new FileInputStream("en-ner-person.eval"), "UTF-8");    sampleStream = new NameSampleDataStream(lineStream);    evaluator.evaluate(sampleStream); To determine how well the model worked with the evaluation data, the getFMeasure method is executed. The results are then displayed:    FMeasure result = evaluator.getFMeasure();    System.out.println(result.toString()); The following output displays the precision, recall, and F-Measure. It indicates that 50 percent of the entities found exactly match the evaluation data. The recall is the percentage of entities defined in the corpus that were found in the same location. The performance measure is the harmonic mean, defined as F1 = 2 * Precision * Recall / (Recall + Precision): Precision: 0.5 Recall: 0.25 F-Measure: 0.3333333333333333 The data and evaluation sets should be much larger to create a better model. The intent here was to demonstrate the basic approach used to train and evaluate a POS model. Summary The NER involves detecting entities and then classifying them. Common categories include names, locations, and things. This is an important task that many applications use to support searching, resolving references, and finding meaning in text. The process is frequently used in downstream tasks. We investigated several techniques for performing NER. Regular expression is one approach that is supported both by core Java classes and NLP APIs. This technique is useful for many applications and there are a large number of regular expression libraries available. Dictionary-based approaches are also possible and work well for some applications. However, they require considerable effort to populate at times. We used LingPipe's MapDictionary class to illustrate this approach. Trained models can also be used to perform NER. We examine several of these and demonstrated how to train a model using the Open NLP NameFinderME class.
Read more
  • 0
  • 0
  • 2795

article-image-hive-hadoop
Packt
10 Feb 2015
36 min read
Save for later

Hive in Hadoop

Packt
10 Feb 2015
36 min read
In this article by Garry Turkington and Gabriele Modena, the author of the book Learning Hadoop 2. explain how MapReduce is a powerful paradigm that enables complex data processing that can reveal valuable insights. It does require a different mindset and some training and experience on the model of breaking processing analytics into a series of map and reduce steps. There are several products that are built atop Hadoop to provide higher-level or more familiar views of the data held within HDFS, and Pig is a very popular one. This article will explore the other most common abstraction implemented atop Hadoop: SQL. In this article, we will cover the following topics: What the use cases for SQL on Hadoop are and why it is so popular HiveQL, the SQL dialect introduced by Apache Hive Using HiveQL to perform SQL-like analysis of the Twitter dataset How HiveQL can approximate common features of relational databases such as joins and views (For more resources related to this topic, see here.) Why SQL on Hadoop So far we have seen how to write Hadoop programs using the MapReduce APIs and how Pig Latin provides a scripting abstraction and a wrapper for custom business logic by means of UDFs. Pig is a very powerful tool, but its dataflow-based programming model is not familiar to most developers or business analysts. The traditional tool of choice for such people to explore data is SQL. Back in 2008 Facebook released Hive, the first widely used implementation of SQL on Hadoop. Instead of providing a way of more quickly developing map and reduce tasks, Hive offers an implementation of HiveQL, a query language based on SQL. Hive takes HiveQL statements and immediately and automatically translates the queries into one or more MapReduce jobs. It then executes the overall MapReduce program and returns the results to the user. This interface to Hadoop not only reduces the time required to produce results from data analysis, it also significantly widens the net as to who can use Hadoop. Instead of requiring software development skills, anyone who's familiar with SQL can use Hive. The combination of these attributes is that HiveQL is often used as a tool for business and data analysts to perform ad hoc queries on the data stored on HDFS. With Hive, the data analyst can work on refining queries without the involvement of a software developer. Just as with Pig, Hive also allows HiveQL to be extended by means of User Defined Functions, enabling the base SQL dialect to be customized with business-specific functionality. Other SQL-on-Hadoop solutions Though Hive was the first product to introduce and support HiveQL, it is no longer the only one. There are others, but we will mostly discuss Hive and Impala as they have been the most successful. While introducing the core features and capabilities of SQL on Hadoop however, we will give examples using Hive; even though Hive and Impala share many SQL features, they also have numerous differences. We don't want to constantly have to caveat each new feature with exactly how it is supported in Hive compared to Impala. We'll generally be looking at aspects of the feature set that are common to both, but if you use both products, it's important to read the latest release notes to understand the differences. Prerequisites Before diving into specific technologies, let's generate some data that we'll use in the examples throughout this article. We'll create a modified version of a former Pig script as the main functionality for this. The script in this article assumes that the Elephant Bird JARs used previously are available in the /jar directory on HDFS. The full source code is at https://github.com/learninghadoop2/book-examples/ch7/extract_for_hive.pig, but the core of extract_for_hive.pig is as follows: -- load JSON data tweets = load '$inputDir' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); -- Tweets tweets_tsv = foreach tweets { generate    (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,    (chararray)$0#'id_str', (chararray)$0#'text' as text,    (chararray)$0#'in_reply_to', (boolean)$0#'retweeted' as is_retweeted, (chararray)$0#'user'#'id_str' as user_id, (chararray)$0#'place'#'id' as place_id; } store tweets_tsv into '$outputDir/tweets' using PigStorage('u0001'); -- Places needed_fields = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,      (chararray)$0#'id_str' as id_str, $0#'place' as place; } place_fields = foreach needed_fields { generate    (chararray)place#'id' as place_id,    (chararray)place#'country_code' as co,    (chararray)place#'country' as country,    (chararray)place#'name' as place_name,    (chararray)place#'full_name' as place_full_name,    (chararray)place#'place_type' as place_type; } filtered_places = filter place_fields by co != ''; unique_places = distinct filtered_places; store unique_places into '$outputDir/places' using PigStorage('u0001');   -- Users users = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str' as id_str, $0#'user' as user; } user_fields = foreach users {    generate    (chararray)CustomFormatToISO(user#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)user#'id_str' as user_id, (chararray)user#'location' as user_location, (chararray)user#'name' as user_name, (chararray)user#'description' as user_description, (int)user#'followers_count' as followers_count, (int)user#'friends_count' as friends_count, (int)user#'favourites_count' as favourites_count, (chararray)user#'screen_name' as screen_name, (int)user#'listed_count' as listed_count;   } unique_users = distinct user_fields; store unique_users into '$outputDir/users' using PigStorage('u0001'); Run this script as follows: $ pig –f extract_for_hive.pig –param inputDir=<json input> -param outputDir=<output path> The preceding code writes data into three separate TSV files for the tweet, user, and place information. Notice that in the store command, we pass an argument when calling PigStorage. This single argument changes the default field separator from a tab character to unicode value U0001, or you can also use Ctrl +C + A. This is often used as a separator in Hive tables and will be particularly useful to us as our tweet data could contain tabs in other fields. Overview of Hive We will now show how you can import data into Hive and run a query against the table abstraction Hive provides over the data. In this example, and in the remainder of the article, we will assume that queries are typed into the shell that can be invoked by executing the hive command. Recently a client called Beeline also became available and will likely be the preferred CLI client in the near future. When importing any new data into Hive, there is generally a three-stage process: Create the specification of the table into which the data is to be imported Import the data into the created table Execute HiveQL queries against the table Most of the HiveQL statements are direct analogues to similarly named statements in standard SQL. We assume only a passing knowledge of SQL throughout this article, but if you need a refresher, there are numerous good online learning resources. Hive gives a structured query view of our data, and to enable that, we must first define the specification of the table's columns and import the data into the table before we can execute any queries. A table specification is generated using a CREATE statement that specifies the table name, the name and types of its columns, and some metadata about how the table is stored: CREATE table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; The statement creates a new table tweets defined by a list of names for columns in the dataset and their data type. We specify that fields are delimited by the Unicode U0001 character and that the format used to store data is TEXTFILE. Data can be imported from a location in HDFS tweets/ into hive using the LOAD DATA statement: LOAD DATA INPATH 'tweets' OVERWRITE INTO TABLE tweets; By default, data for Hive tables is stored on HDFS under /user/hive/warehouse. If a LOAD statement is given a path to data on HDFS, it will not simply copy the data into /user/hive/warehouse, but will move it there instead. If you want to analyze data on HDFS that is used by other applications, then either create a copy or use the EXTERNAL mechanism that will be described later. Once data has been imported into Hive, we can run queries against it. For instance: SELECT COUNT(*) FROM tweets; The preceding code will return the total number of tweets present in the dataset. HiveQL, like SQL, is not case sensitive in terms of keywords, columns, or table names. By convention, SQL statements use uppercase for SQL language keywords, and we will generally follow this when using HiveQL within files, as will be shown later. However, when typing interactive commands, we will frequently take the line of least resistance and use lowercase. If you look closely at the time taken by the various commands in the preceding example, you'll notice that loading data into a table takes about as long as creating the table specification, but even the simple count of all rows takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times. The nature of Hive tables Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. Both the CREATE TABLE and LOAD DATA statements do not truly create concrete table data as such; instead, they produce the metadata that will be used when Hive generates MapReduce jobs to access the data conceptually stored in the table but actually residing on HDFS. Even though the HiveQL statements refer to a specific table structure, it is Hive's responsibility to generate code that correctly maps this to the actual on-disk format in which the data files are stored. This might seem to suggest that Hive isn't a real database; this is true, it isn't. Whereas a relational database will require a table schema to be defined before data is ingested and then ingest only data that conforms to that specification, Hive is much more flexible. The less concrete nature of Hive tables means that schemas can be defined based on the data as it has already arrived and not on some assumption of how the data should be, which might prove to be wrong. Though changeable data formats are troublesome regardless of technology, the Hive model provides an additional degree of freedom in handling the problem when, not if, it arises. Hive architecture Until version 2, Hadoop was primarily a batch system. Internally, Hive compiles HiveQL statements into MapReduce jobs. Hive queries have traditionally been characterized by high latency. This has changed with the Stinger initiative and the improvements introduced in Hive 0.13 that we will discuss later. Hive runs as a client application that processes HiveQL queries, converts them into MapReduce jobs, and submits these to a Hadoop cluster either to native MapReduce in Hadoop 1 or to the MapReduce Application Master running on YARN in Hadoop 2. Regardless of the model, Hive uses a component called the metastore, in which it holds all its metadata about the tables defined in the system. Ironically, this is stored in a relational database dedicated to Hive's usage. In the earliest versions of Hive, all clients communicated directly with the metastore, but this meant that every user of the Hive CLI tool needed to know the metastore username and password. HiveServer was created to act as a point of entry for remote clients, which could also act as a single access-control point and which controlled all access to the underlying metastore. Because of limitations in HiveServer, the newest way to access Hive is through the multi-client HiveServer2. HiveServer2 introduces a number of improvements over its predecessor, including user authentication and support for multiple connections from the same client. More information can be found at https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2. Instances of HiveServer and HiveServer2 can be manually executed with the hive --service hiveserver and hive --service hiveserver2 commands, respectively. In the examples we saw before and in the remainder of this article, we implicitly use HiveServer to submit queries via the Hive command-line tool. HiveServer2 comes with Beeline. For compatibility and maturity reasons, Beeline being relatively new, both tools are available on Cloudera and most other major distributions. The Beeline client is part of the core Apache Hive distribution and so is also fully open source. Beeline can be executed in embedded version with the following command: $ beeline -u jdbc:hive2:// Data types HiveQL supports many of the common data types provided by standard database systems. These include primitive types, such as float, double, int, and string, through to structured collection types that provide the SQL analogues to types such as arrays, structs, and unions (structs with options for some fields). Since Hive is implemented in Java, primitive types will behave like their Java counterparts. We can distinguish Hive data types into the following five broad categories: Numeric: tinyint, smallint, int, bigint, float, double, and decimal Date and time: timestamp and date String: string, varchar, and char Collections: array, map, struct, and uniontype Misc: boolean, binary, and NULL DDL statements HiveQL provides a number of statements to create, delete, and alter databases, tables, and views. The CREATE DATABASE <name> statement creates a new database with the given name. A database represents a namespace where table and view metadata is contained. If multiple databases are present, the USE <database name> statement specifies which one to use to query tables or create new metadata. If no database is explicitly specified, Hive will run all statements against the default database. SHOW [DATABASES, TABLES, VIEWS] displays the databases currently available within a data warehouse and which table and view metadata is present within the database currently in use: CREATE DATABASE twitter; SHOW databases; USE twitter; SHOW TABLES; The CREATE TABLE [IF NOT EXISTS] <name> statement creates a table with the given name. As alluded to earlier, what is really created is the metadata representing the table and its mapping to files on HDFS as well as a directory in which to store the data files. If a table or view with the same name already exists, Hive will raise an exception. Both table and column names are case insensitive. In older versions of Hive (0.12 and earlier), only alphanumeric and underscore characters were allowed in table and column names. As of Hive 0.13, the system supports unicode characters in column names. Reserved words, such as load and create, need to be escaped by backticks (the ` character) to be treated literally. The EXTERNAL keyword specifies that the table exists in resources out of Hive's control, which can be a useful mechanism to extract data from another source at the beginning of a Hadoop-based Extract-Transform-Load (ETL) pipeline. The LOCATION clause specifies where the source file (or directory) is to be found. The EXTERNAL keyword and LOCATION clause have been used in the following code: CREATE EXTERNAL TABLE tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/tweets'; This table will be created in metastore, but the data will not be copied into the /user/hive/warehouse directory. Note that Hive has no concept of primary key or unique identifier. Uniqueness and data normalization are aspects to be addressed before loading data into the data warehouse. The CREATE VIEW <view name> … AS SELECT statement creates a view with the given name. For example, we can create a view to isolate retweets from other messages, as follows: CREATE VIEW retweets COMMENT 'Tweets that have been retweeted' AS SELECT * FROM tweets WHERE retweeted = true; Unless otherwise specified, column names are derived from the defining SELECT statement. Hive does not currently support materialized views. The DROP TABLE and DROP VIEW statements remove both metadata and data for a given table or view. When dropping an EXTERNAL table or a view, only metadata will be removed and the actual data files will not be affected. Hive allows table metadata to be altered via the ALTER TABLE statement, which can be used to change a column type, name, position, and comment or to add and replace columns. When adding columns, it is important to remember that only metadata will be changed and not the dataset itself. This means that if we were to add a column in the middle of the table which didn't exist in older files, then while selecting from older data, we might get wrong values in the wrong columns. This is because we would be looking at old files with a new format Similarly, ALTER VIEW <view name> AS <select statement> changes the definition of an existing view. File formats and storage The data files underlying a Hive table are no different from any other file on HDFS. Users can directly read the HDFS files in the Hive tables using other tools. They can also use other tools to write to HDFS files that can be loaded into Hive through CREATE EXTERNAL TABLE or through LOAD DATA INPATH. Hive uses the Serializer and Deserializer classes, SerDe, as well as FileFormat to read and write table rows. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified in a CREATE TABLE statement. The DELIMITED clause instructs the system to read delimited files. Delimiter characters can be escaped using the ESCAPED BY clause. Hive currently uses the following FileFormat classes to read and write HDFS files: TextInputFormat and HiveIgnoreKeyTextOutputFormat: will read/write data in plain text file format SequenceFileInputFormat and SequenceFileOutputFormat: classes read/write data in the Hadoop SequenceFile format Additionally, the following SerDe classes can be used to serialize and deserialize data: MetadataTypedColumnsetSerDe: This will read/write delimited records such as CSV or tab-separated records ThriftSerDe, and DynamicSerDe: These will read/write Thrift objects JSON As of version 0.13, Hive ships with the native org.apache.hive.hcatalog.data.JsonSerDe JSON SerDe. For older versions of Hive, Hive-JSON-Serde (found at https://github.com/rcongiu/Hive-JSON-Serde) is arguably one of the most feature-rich JSON serialization/deserialization modules. We can use either module to load JSON tweets without any need for preprocessing and just define a Hive schema that matches the content of a JSON document. In the following example, we use Hive-JSON-Serde. As with any third-party module, we load the SerDe JARS into Hive with the following code: ADD JAR JAR json-serde-1.3-jar-with-dependencies.jar; Then, we issue the usual create statement, as follows: CREATE EXTERNAL TABLE tweets (    contributors string,    coordinates struct <      coordinates: array <float>,      type: string>,    created_at string,    entities struct <      hashtags: array <struct <            indices: array <tinyint>,            text: string>>, … ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION 'tweets'; With this SerDe, we can map nested documents (such as entities or users) to the struct or map types. We tell Hive that the data stored at LOCATION 'tweets' is text (STORED AS TEXTFILE) and that each row is a JSON object (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'). In Hive 0.13 and later, we can express this property as ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'. Manually specifying the schema for complex documents can be a tedious and error-prone process. The hive-json module (found at https://github.com/hortonworks/hive-json) is a handy utility to analyze large documents and generate an appropriate Hive schema. Depending on the document collection, further refinement might be necessary. In our example, we used a schema generated with hive-json that maps the tweets JSON to a number of struct data types. This allows us to query the data using a handy dot notation. For instance, we can extract the screen name and description fields of a user object with the following code: SELECT user.screen_name, user.description FROM tweets_json LIMIT 10; Avro AvroSerde (https://cwiki.apache.org/confluence/display/Hive/AvroSerDe) allows us to read and write data in Avro format. Starting from 0.14, Avro-backed tables can be created using the STORED AS AVRO statement, and Hive will take care of creating an appropriate Avro schema for the table. Prior versions of Hive are a bit more verbose. This dataset was created using Pig's AvroStorage class, which generated the following schema: { "type":"record", "name":"record", "fields": [    {"name":"topic","type":["null","int"]},    {"name":"source","type":["null","int"]},    {"name":"rank","type":["null","float"]} ] } The table structure is captured in an Avro record, which contains header information (a name and optional namespace to qualify the name) and an array of the fields. Each field is specified with its name and type as well as an optional documentation string. For a few of the fields, the type is not a single value, but instead a pair of values, one of which is null. This is an Avro union, and this is the idiomatic way of handling columns that might have a null value. Avro specifies null as a concrete type, and any location where another type might have a null value needs to be specified in this way. This will be handled transparently for us when we use the following schema. With this definition, we can now create a Hive table that uses this schema for its table specification, as follows: CREATE EXTERNAL TABLE tweets_pagerank ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{    "type":"record",    "name":"record",    "fields": [        {"name":"topic","type":["null","int"]},        {"name":"source","type":["null","int"]},        {"name":"rank","type":["null","float"]}    ] }') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '${data}/ch5-pagerank'; Then, look at the following table definition from within Hive (note also that HCatalog): DESCRIBE tweets_pagerank; OK topic                 int                   from deserializer   source               int                   from deserializer   rank                 float                 from deserializer In the DDL, we told Hive that data is stored in Avro format using AvroContainerInputFormat and AvroContainerOutputFormat. Each row needs to be serialized and deserialized using org.apache.hadoop.hive.serde2.avro.AvroSerDe. The table schema is inferred by Hive from the Avro schema embedded in avro.schema.literal. Alternatively, we can store a schema on HDFS and have Hive read it to determine the table structure. Create the preceding schema in a file called pagerank.avsc—this is the standard file extension for Avro schemas. Then place it on HDFS; we prefer to have a common location for schema files such as /schema/avro. Finally, define the table using the avro.schema.url SerDe property WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc'). If Avro dependencies are not present in the classpath, we need to add the Avro MapReduce JAR to our environment before accessing individual fields. Within Hive, on the Cloudera CDH5 VM: ADD JAR /opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar; We can also use this table like any other. For instance, we can query the data to select the user and topic pairs with a high PageRank: SELECT source, topic from tweets_pagerank WHERE rank >= 0.9; Columnar stores Hive can also take advantage of columnar storage via the ORC (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) and Parquet (https://cwiki.apache.org/confluence/display/Hive/Parquet) formats. If a table is defined with very many columns, it is not unusual for any given query to only process a small subset of these columns. But even in a SequenceFile each full row and all its columns will be read from disk, decompressed, and processed. This consumes a lot of system resources for data that we know in advance is not of interest. Traditional relational databases also store data on a row basis, and a type of database called columnar changed this to be column-focused. In the simplest model, instead of one file for each table, there would be one file for each column in the table. If a query only needed to access five columns in a table with 100 columns in total, then only the files for those five columns will be read. Both ORC and Parquet use this principle as well as other optimizations to enable much faster queries. Queries Tables can be queried using the familiar SELECT … FROM statement. The WHERE statement allows the specification of filtering conditions, GROUP BY aggregates records, ORDER BY specifies sorting criteria, and LIMIT specifies the number of records to retrieve. Aggregate functions, such as count and sum, can be applied to aggregated records. For instance, the following code returns the top 10 most prolific users in the dataset: SELECT user_id, COUNT(*) AS cnt FROM tweets GROUP BY user_id ORDER BY cnt DESC LIMIT 10 The following are the top 10 most prolific users in the dataset: NULL 7091 1332188053 4 959468857 3 1367752118 3 362562944 3 58646041 3 2375296688 3 1468188529 3 37114209 3 2385040940 3 We can improve the readability of the hive output by setting the following: SET hive.cli.print.header=true; This will instruct hive, though not beeline, to print column names as part of the output. You can add the command to the .hiverc file usually found in the root of the executing user's home directory to have it apply to all hive CLI sessions. HiveQL implements a JOIN operator that enables us to combine tables together. In the Prerequisites section, we generated separate datasets for the user and place objects. Let's now load them into hive using external tables. We first create a user table to store user data, as follows: CREATE EXTERNAL TABLE user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/users'; We then create a place table to store location data, as follows: CREATE EXTERNAL TABLE place ( place_id string, country_code string, country string, `name` string, full_name string, place_type string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/places'; We can use the JOIN operator to display the names of the 10 most prolific users, as follows: SELECT tweets.user_id, user.name, COUNT(tweets.user_id) AS cnt FROM tweets JOIN user ON user.user_id = tweets.user_id GROUP BY tweets.user_id, user.user_id, user.name ORDER BY cnt DESC LIMIT 10; Only equality, outer, and left (semi) joins are supported in Hive. Notice that there might be multiple entries with a given user ID but different values for the followers_count, friends_count, and favourites_count columns. To avoid duplicate entries, we count only user_id from the tweets tables. We can rewrite the previous query as follows: SELECT tweets.user_id, u.name, COUNT(*) AS cnt FROM tweets join (SELECT user_id, name FROM user GROUP BY user_id, name) u ON u.user_id = tweets.user_id GROUP BY tweets.user_id, u.name ORDER BY cnt DESC LIMIT 10; Instead of directly joining the user table, we execute a subquery, as follows: SELECT user_id, name FROM user GROUP BY user_id, name; The subquery extracts unique user IDs and names. Note that Hive has limited support for subqueries, historically only permitting a subquery in the FROM clause of a SELECT statement. Hive 0.13 has added limited support for subqueries within the WHERE clause also. HiveQL is an ever-evolving rich language, a full exposition of which is beyond the scope of this article. A description of its query and ddl capabilities can be found at  https://cwiki.apache.org/confluence/display/Hive/LanguageManual. Structuring Hive tables for given workloads Often Hive isn't used in isolation, instead tables are created with particular workloads in mind or needs invoked in ways that are suitable for inclusion in automated processes. We'll now explore some of these scenarios. Partitioning a table With columnar file formats, we explained the benefits of excluding unneeded data as early as possible when processing a query. A similar concept has been used in SQL for some time: table partitioning. When creating a partitioned table, a column is specified as the partition key. All values with that key are then stored together. In Hive's case, different subdirectories for each partition key are created under the table directory in the warehouse location on HDFS. It's important to understand the cardinality of the partition column. With too few distinct values, the benefits are reduced as the files are still very large. If there are too many values, then queries might need a large number of files to be scanned to access all the required data. Perhaps the most common partition key is one based on date. We could, for example, partition our user table from earlier based on the created_at column, that is, the date the user was first registered. Note that since partitioning a table by definition affects its file structure, we create this table now as a non-external one, as follows: CREATE TABLE partitioned_user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; To load data into a partition, we can explicitly give a value for the partition into which to insert the data, as follows: INSERT INTO TABLE partitioned_user PARTITION( created_at_date = '2014-01-01') SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count FROM user; This is at best verbose, as we need a statement for each partition key value; if a single LOAD or INSERT statement contains data for multiple partitions, it just won't work. Hive also has a feature called dynamic partitioning, which can help us here. We set the following three variables: SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; SET hive.exec.max.dynamic.partitions.pernode=5000; The first two statements enable all partitions (nonstrict option) to be dynamic. The third one allows 5,000 distinct partitions to be created on each mapper and reducer node. We can then simply use the name of the column to be used as the partition key, and Hive will insert data into partitions depending on the value of the key for a given row: INSERT INTO TABLE partitioned_user PARTITION( created_at_date ) SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user; Even though we use only a single partition column here, we can partition a table by multiple column keys; just have them as a comma-separated list in the PARTITIONED BY clause. Note that the partition key columns need to be included as the last columns in any statement being used to insert into a partitioned table. In the preceding code we use Hive's to_date function to convert the created_at timestamp to a YYYY-MM-DD formatted string. Partitioned data is stored in HDFS as /path/to/warehouse/<database>/<table>/key=<value>. In our example, the partitioned_user table structure will look like /user/hive/warehouse/default/partitioned_user/created_at=2014-04-01. If data is added directly to the filesystem, for instance by some third-party processing tool or by hadoop fs -put, the metastore won't automatically detect the new partitions. The user will need to manually run an ALTER TABLE statement such as the following for each newly added partition: ALTER TABLE <table_name> ADD PARTITION <location>; To add metadata for all partitions not currently present in the metastore we can use: MSCK REPAIR TABLE <table_name>; statement. On EMR, this is equivalent to executing the following statement: ALTER TABLE <table_name> RECOVER PARTITIONS; Notice that both statements will work also with EXTERNAL tables. Overwriting and updating data Partitioning is also useful when we need to update a portion of a table. Normally a statement of the following form will replace all the data for the destination table: INSERT OVERWRITE INTO <table>… If OVERWRITE is omitted, then each INSERT statement will add additional data to the table. Sometimes, this is desirable, but often, the source data being ingested into a Hive table is intended to fully update a subset of the data and keep the rest untouched. If we perform an INSERT OVERWRITE statement (or a LOAD OVERWRITE statement) into a partition of a table, then only the specified partition will be affected. Thus, if we were inserting user data and only wanted to affect the partitions with data in the source file, we could achieve this by adding the OVERWRITE keyword to our previous INSERT statement. We can also add caveats to the SELECT statement. Say, for example, we only wanted to update data for a certain month: INSERT INTO TABLE partitioned_user PARTITION (created_at_date) SELECT created_at , user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user WHERE to_date(created_at) BETWEEN '2014-03-01' and '2014-03-31'; Bucketing and sorting Partitioning a table is a construct that you take explicit advantage of by using the partition column (or columns) in the WHERE clause of queries against the tables. There is another mechanism called bucketing that can further segment how a table is stored and does so in a way that allows Hive itself to optimize its internal query plans to take advantage of the structure. Let's create bucketed versions of our tweets and user tables; note the following additional CLUSTER BY and SORT BY statements in the CREATE TABLE statements: CREATE table bucketed_tweets ( tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE;   CREATE TABLE bucketed_user ( user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) SORTED BY(name) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; Note that we changed the tweets table to also be partitioned; you can only bucket a table that is partitioned. Just as we need to specify a partition column when inserting into a partitioned table, we must also take care to ensure that data inserted into a bucketed table is correctly clustered. We do this by setting the following flag before inserting the data into the table: SET hive.enforce.bucketing=true; Just as with partitioned tables, you cannot apply the bucketing function when using the LOAD DATA statement; if you wish to load external data into a bucketed table, first insert it into a temporary table, and then use the INSERT…SELECT… syntax to populate the bucketed table. When data is inserted into a bucketed table, rows are allocated to a bucket based on the result of a hash function applied to the column specified in the CLUSTERED BY clause. One of the greatest advantages of bucketing a table comes when we need to join two tables that are similarly bucketed, as in the previous example. So, for example, any query of the following form would be vastly improved: SET hive.optimize.bucketmapjoin=true; SELECT … FROM bucketed_user u JOIN bucketed_tweet t ON u.user_id = t.user_id; With the join being performed on the column used to bucket the table, Hive can optimize the amount of processing as it knows that each bucket contains the same set of user_id columns in both tables. While determining which rows against which to match, only those in the bucket need to be compared against, and not the whole table. This does require that the tables are both clustered on the same column and that the bucket numbers are either identical or one is a multiple of the other. In the latter case, with say one table clustered into 32 buckets and another into 64, the nature of the default hash function used to allocate data to a bucket means that the IDs in bucket 3 in the first table will cover those in both buckets 3 and 35 in the second. Sampling data Bucketing a table can also help while using Hive's ability to sample data in a table. Sampling allows a query to gather only a specified subset of the overall rows in the table. This is useful when you have an extremely large table with moderately consistent data patterns. In such a case, applying a query to a small fraction of the data will be much faster and will still give a broadly representative result. Note, of course, that this only applies to queries where you are looking to determine table characteristics, such as pattern ranges in the data; if you are trying to count anything, then the result needs to be scaled to the full table size. For a non-bucketed table, you can sample in a mechanism similar to what we saw earlier by specifying that the query should only be applied to a certain subset of the table: SELECT max(friends_count) FROM user TABLESAMPLE(BUCKET 2 OUT OF 64 ON name); In this query, Hive will effectively hash the rows in the table into 64 buckets based on the name column. It will then only use the second bucket for the query. Multiple buckets can be specified, and if RAND() is given as the ON clause, then the entire row is used by the bucketing function. Though successful, this is highly inefficient as the full table needs to be scanned to generate the required subset of data. If we sample on a bucketed table and ensure the number of buckets sampled is equal to or a multiple of the buckets in the table, then Hive will only read the buckets in question. For example: SELECT MAX(friends_count) FROM bucketed_user TABLESAMPLE(BUCKET 2 OUT OF 32 on user_id); In the preceding query against the bucketed_user table, which is created with 64 buckets on the user_id column, the sampling, since it is using the same column, will only read the required buckets. In this case, these will be buckets 2 and 34 from each partition. A final form of sampling is block sampling. In this case, we can specify the required amount of the table to be sampled, and Hive will use an approximation of this by only reading enough source data blocks on HDFS to meet the required size. Currently, the data size can be specified as either a percentage of the table, as an absolute data size, or as a number of rows (in each block). The syntax for TABLESAMPLE is as follows, which will sample 0.5 percent of the table, 1 GB of data or 100 rows per split, respectively: TABLESAMPLE(0.5 PERCENT) TABLESAMPLE(1G) TABLESAMPLE(100 ROWS) If these latter forms of sampling are of interest, then consult the documentation, as there are some specific limitations on the input format and file formats that are supported. Writing scripts We can place Hive commands in a file and run them with the -f option in the hive CLI utility: $ cat show_tables.hql show tables; $ hive -f show_tables.hql We can parameterize HiveQL statements by means of the hiveconf mechanism. This allows us to specify an environment variable name at the point it is used rather than at the point of invocation. For example: $ cat show_tables2.hql show tables like '${hiveconf:TABLENAME}'; $ hive -hiveconf TABLENAME=user -f show_tables2.hql The variable can also be set within the Hive script or an interactive session: SET TABLE_NAME='user'; The preceding hiveconf argument will add any new variables in the same namespace as the Hive configuration options. As of Hive 0.8, there is a similar option called hivevar that adds any user variables into a distinct namespace. Using hivevar, the preceding command would be as follows: $ cat show_tables3.hql show tables like '${hivevar:TABLENAME}'; $ hive -hivevar TABLENAME=user –f show_tables3.hql Or we can write the command interactively: SET hivevar_TABLE_NAME='user'; Summary In this article, we learned that in its early days, Hadoop was sometimes erroneously seen as the latest supposed relational database killer. Over time, it has become more apparent that the more sensible approach is to view it as a complement to RDBMS technologies and that, in fact, the RDBMS community has developed tools such as SQL that are also valuable in the Hadoop world. HiveQL is an implementation of SQL on Hadoop and was the primary focus of this article. In regard to HiveQL and its implementations, we covered the following topics: How HiveQL provides a logical model atop data stored in HDFS in contrast to relational databases where the table structure is enforced in advance How HiveQL offers the ability to extend its core set of operators with user-defined code and how this contrasts to the Pig UDF mechanism The recent history of Hive developments, such as the Stinger initiative, that have seen Hive transition to an updated implementation that uses Tez Resources for Article: Further resources on this subject: Big Data Analysis [Article] Understanding MapReduce [Article] Amazon DynamoDB - Modelling relationships, Error handling [Article]
Read more
  • 0
  • 0
  • 4808
article-image-threejs-materials-and-texture
Packt
06 Feb 2015
11 min read
Save for later

Three.js - Materials and Texture

Packt
06 Feb 2015
11 min read
In this article by Jos Dirksen author of the book Three.js Cookbook, we will learn how Three.js offers a large number of different materials and supports many different types of textures. These textures provide a great way to create interesting effects and graphics. In this article, we'll show you recipes that allow you to get the most out of these components provided by Three.js. (For more resources related to this topic, see here.) Using HTML canvas as a texture Most often when you use textures, you use static images. With Three.js, however, it is also possible to create interactive textures. In this recipe, we will show you how you can use an HTML5 canvas element as an input for your texture. Any change to this canvas is automatically reflected after you inform Three.js about this change in the texture used on the geometry. Getting ready For this recipe, we need an HTML5 canvas element that can be displayed as a texture. We can create one ourselves and add some output, but for this recipe, we've chosen something else. We will use a simple JavaScript library, which outputs a clock to a canvas element. The resulting mesh will look like this (see the 04.03-use-html-canvas-as-texture.html example): The JavaScript used to render the clock was based on the code from this site: http://saturnboy.com/2013/10/html5-canvas-clock/. To include the code that renders the clock in our page, we need to add the following to the head element: <script src="../libs/clock.js"></script> How to do it... To use a canvas as a texture, we need to perform a couple of steps: The first thing we need to do is create the canvas element: var canvas = document.createElement('canvas'); canvas.width=512; canvas.height=512; Here, we create an HTML canvas element programmatically and define a fixed width. Now that we've got a canvas, we need to render the clock that we use as the input for this recipe on it. The library is very easy to use; all you have to do is pass in the canvas element we just created: clock(canvas); At this point, we've got a canvas that renders and updates an image of a clock. What we need to do now is create a geometry and a material and use this canvas element as a texture for this material: var cubeGeometry = new THREE.BoxGeometry(10, 10, 10); var cubeMaterial = new THREE.MeshLambertMaterial(); cubeMaterial.map = new THREE.Texture(canvas); var cube = new THREE.Mesh(cubeGeometry, cubeMaterial); To create a texture from a canvas element, all we need to do is create a new instance of THREE.Texture and pass in the canvas element we created in step 1. We assign this texture to the cubeMaterial.map property, and that's it. If you run the recipe at this step, you might see the clock rendered on the sides of the cubes. However, the clock won't update itself. We need to tell Three.js that the canvas element has been changed. We do this by adding the following to the rendering loop: cubeMaterial.map.needsUpdate = true; This informs Three.js that our canvas texture has changed and needs to be updated the next time the scene is rendered. With these four simple steps, you can easily create interactive textures and use everything you can create on a canvas element as a texture in Three.js. How it works... How this works is actually pretty simple. Three.js uses WebGL to render scenes and apply textures. WebGL has native support for using HTML canvas element as textures, so Three.js just passes on the provided canvas element to WebGL and it is processed as any other texture. Making part of an object transparent You can create a lot of interesting visualizations using the various materials available with Three.js. In this recipe, we'll look at how you can use the materials available with Three.js to make part of an object transparent. This will allow you to create complex-looking geometries with relative ease. Getting ready Before we dive into the required steps in Three.js, we first need to have the texture that we will use to make an object partially transparent. For this recipe, we will use the following texture, which was created in Photoshop: You don't have to use Photoshop; the only thing you need to keep in mind is that you use an image with a transparent background. Using this texture, in this recipe, we'll show you how you can create the following (04.08-make-part-of-object-transparent.html): As you can see in the preceeding, only part of the sphere is visible, and you can look through the sphere to see the back at the other side of the sphere. How to do it... Let's look at the steps you need to take to accomplish this: The first thing we do is create the geometry. For this recipe, we use THREE.SphereGeometry: var sphereGeometry = new THREE.SphereGeometry(6, 20, 20); Just like all the other recipes, you can use whatever geometry you want. In the second step, we create the material: var mat = new THREE.MeshPhongMaterial(); mat.map = new THREE.ImageUtils.loadTexture( "../assets/textures/partial-transparency.png"); mat.transparent = true; mat.side = THREE.DoubleSide; mat.depthWrite = false; mat.color = new THREE.Color(0xff0000); As you can see in this fragment, we create THREE.MeshPhongMaterial and load the texture we saw in the Getting ready section of this recipe. To render this correctly, we also need to set the side property to THREE.DoubleSide so that the inside of the sphere is also rendered, and we need to set the depthWrite property to false. This will tell WebGL that we still want to test our vertices against the WebGL depth buffer, but we don't write to it. Often, you need to set this to false when working with more complex transparent objects or particles. Finally, add the sphere to the scene: var sphere = new THREE.Mesh(sphereGeometry, mat); scene.add(sphere); With these simple steps, you can create really interesting effects by just experimenting with textures and geometries. There's more With Three.js, it is possible to repeat textures (refer to the Setup repeating textures recipe). You can use this to create interesting-looking objects such as this: The code required to set a texture to repeat is the following: var mat = new THREE.MeshPhongMaterial(); mat.map = new THREE.ImageUtils.loadTexture( "../assets/textures/partial-transparency.png"); mat.transparent = true; mat.map.wrapS = mat.map.wrapT = THREE.RepeatWrapping; mat.map.repeat.set( 4, 4 ); mat.depthWrite = false; mat.color = new THREE.Color(0x00ff00); By changing the mat.map.repeat.set values, you define how often the texture is repeated. Using a cubemap to create reflective materials With the approach Three.js uses to render scenes in real time, it is difficult and very computationally intensive to create reflective materials. Three.js, however, provides a way you can cheat and approximate reflectivity. For this, Three.js uses cubemaps. In this recipe, we'll explain how to create cubemaps and use them to create reflective materials. Getting ready A cubemap is a set of six images that can be mapped to the inside of a cube. They can be created from a panorama picture and look something like this: In Three.js, we map such a map on the inside of a cube or sphere and use that information to calculate reflections. The following screenshot (example 04.10-use-reflections.html) shows what this looks like when rendered in Three.js: As you can see in the preceeding screenshot, the objects in the center of the scene reflect the environment they are in. This is something often called a skybox. To get ready, the first thing we need to do is get a cubemap. If you search on the Internet, you can find some ready-to-use cubemaps, but it is also very easy to create one yourself. For this, go to http://gonchar.me/panorama/. On this page, you can upload a panoramic picture and it will be converted to a set of pictures you can use as a cubemap. For this, perform the following steps: First, get a 360 degrees panoramic picture. Once you have one, upload it to the http://gonchar.me/panorama/ website by clicking on the large OPEN button:  Once uploaded, the tool will convert the panorama picture to a cubemap as shown in the following screenshot:  When the conversion is done, you can download the various cube map sites. The recipe in this book uses the naming convention provided by Cube map sides option, so download them. You'll end up with six images with names such as right.png, left.png, top.png, bottom.png, front.png, and back.png. Once you've got the sides of the cubemap, you're ready to perform the steps in the recipe. How to do it... To use the cubemap we created in the previous section and create reflecting material,we need to perform a fair number of steps, but it isn't that complex: The first thing you need to do is create an array from the cubemap images you downloaded: var urls = [ '../assets/cubemap/flowers/right.png', '../assets/cubemap/flowers/left.png', '../assets/cubemap/flowers/top.png', '../assets/cubemap/flowers/bottom.png', '../assets/cubemap/flowers/front.png', '../assets/cubemap/flowers/back.png' ]; With this array, we can create a cubemap texture like this: var cubemap = THREE.ImageUtils.loadTextureCube(urls); cubemap.format = THREE.RGBFormat; From this cubemap, we can use THREE.BoxGeometry and a custom THREE.ShaderMaterial object to create a skybox (the environment surrounding our meshes): var shader = THREE.ShaderLib[ "cube" ]; shader.uniforms[ "tCube" ].value = cubemap; var material = new THREE.ShaderMaterial( { fragmentShader: shader.fragmentShader, vertexShader: shader.vertexShader, uniforms: shader.uniforms, depthWrite: false, side: THREE.DoubleSide }); // create the skybox var skybox = new THREE.Mesh( new THREE.BoxGeometry( 10000, 10000, 10000 ), material ); scene.add(skybox); Three.js provides a custom shader (a piece of WebGL code) that we can use for this. As you can see in the code snippet, to use this WebGL code, we need to define a THREE.ShaderMaterial object. With this material, we create a giant THREE.BoxGeometry object that we add to scene. Now that we've created the skybox, we can define the reflecting objects: var sphereGeometry = new THREE.SphereGeometry(4,15,15); var envMaterial = new THREE.MeshBasicMaterial( {envMap:cubemap}); var sphere = new THREE.Mesh(sphereGeometry, envMaterial); As you can see, we also pass in the cubemap we created as a property (envmap) to the material. This informs Three.js that this object is positioned inside a skybox, defined by the images that make up cubemap. The last step is to add the object to the scene, and that's it: scene.add(sphere); In the example in the beginning of this recipe, you saw three geometries. You can use this approach with all different types of geometries. Three.js will determine how to render the reflective area. How it works... Three.js itself doesn't really do that much to render the cubemap object. It relies on a standard functionality provided by WebGL. In WebGL, there is a construct called samplerCube. With samplerCube, you can sample, based on a specific direction, which color matches the cubemap object. Three.js uses this to determine the color value for each part of the geometry. The result is that on each mesh, you can see a reflection of the surrounding cubemap using the WebGL textureCube function. In Three.js, this results in the following call (taken from the WebGL shader in GLSL): vec4 cubeColor = textureCube( tCube, vec3( -vReflect.x, vReflect.yz ) ); A more in-depth explanation on how this works can be found at http://codeflow.org/entries/2011/apr/18/advanced-webgl-part-3-irradiance-environment-map/#cubemap-lookup. There's more... In this recipe, we created the cubemap object by providing six separate images. There is, however, an alternative way to create the cubemap object. If you've got a 360 degrees panoramic image, you can use the following code to directly create a cubemap object from that image: var texture = THREE.ImageUtils.loadTexture( 360-degrees.png', new THREE.UVMapping()); Normally when you create a cubemap object, you use the code shown in this recipe to map it to a skybox. This usually gives the best results but requires some extra code. You can also use THREE.SphereGeometry to create a skybox like this: var mesh = new THREE.Mesh( new THREE.SphereGeometry( 500, 60, 40 ), new THREE.MeshBasicMaterial( { map: texture })); mesh.scale.x = -1; This applies the texture to a sphere and with mesh.scale, turns this sphere inside out. Besides reflection, you can also use a cubemap object for refraction (think about light bending through water drops or glass objects): All you have to do to make a refractive material is load the cubemap object like this: var cubemap = THREE.ImageUtils.loadTextureCube(urls, new THREE.CubeRefractionMapping()); And define the material in the following way: var envMaterial = new THREE.MeshBasicMaterial({envMap:cubemap}); envMaterial.refractionRatio = 0.95; Summary In this article, we learned about the different textures and materials supported by Three.js Resources for Article:  Further resources on this subject: Creating the maze and animating the cube [article] Working with the Basic Components That Make Up a Three.js Scene [article] Mesh animation [article]
Read more
  • 0
  • 0
  • 25991

article-image-extending-elasticsearch-scripting
Packt
06 Feb 2015
21 min read
Save for later

Extending ElasticSearch with Scripting

Packt
06 Feb 2015
21 min read
In article by Alberto Paro, the author of ElasticSearch Cookbook Second Edition, we will cover about the following recipes: (For more resources related to this topic, see here.) Installing additional script plugins Managing scripts Sorting data using scripts Computing return fields with scripting Filtering a search via scripting Introduction ElasticSearch has a powerful way of extending its capabilities with custom scripts, which can be written in several programming languages. The most common ones are Groovy, MVEL, JavaScript, and Python. In this article, we will see how it's possible to create custom scoring algorithms, special processed return fields, custom sorting, and complex update operations on records. The scripting concept of ElasticSearch can be seen as an advanced stored procedures system in the NoSQL world; so, for an advanced usage of ElasticSearch, it is very important to master it. Installing additional script plugins ElasticSearch provides native scripting (a Java code compiled in JAR) and Groovy, but a lot of interesting languages are also available, such as JavaScript and Python. In older ElasticSearch releases, prior to version 1.4, the official scripting language was MVEL, but due to the fact that it was not well-maintained by MVEL developers, in addition to the impossibility to sandbox it and prevent security issues, MVEL was replaced with Groovy. Groovy scripting is now provided by default in ElasticSearch. The other scripting languages can be installed as plugins. Getting ready You will need a working ElasticSearch cluster. How to do it... In order to install JavaScript language support for ElasticSearch (1.3.x), perform the following steps: From the command line, simply enter the following command: bin/plugin --install elasticsearch/elasticsearch-lang-javascript/2.3.0 This will print the following result: -> Installing elasticsearch/elasticsearch-lang-javascript/2.3.0... Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-lang-javascript/ elasticsearch-lang-javascript-2.3.0.zip... Downloading ....DONE Installed lang-javascript If the installation is successful, the output will end with Installed; otherwise, an error is returned. To install Python language support for ElasticSearch, just enter the following command: bin/plugin -install elasticsearch/elasticsearch-lang-python/2.3.0 The version number depends on the ElasticSearch version. Take a look at the plugin's web page to choose the correct version. How it works... Language plugins allow you to extend the number of supported languages to be used in scripting. During the ElasticSearch startup, an internal ElasticSearch service called PluginService loads all the installed language plugins. In order to install or upgrade a plugin, you need to restart the node. The ElasticSearch community provides common scripting languages (a list of the supported scripting languages is available on the ElasticSearch site plugin page at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html), and others are available in GitHub repositories (a simple search on GitHub allows you to find them). The following are the most commonly used languages for scripting: Groovy (http://groovy.codehaus.org/): This language is embedded in ElasticSearch by default. It is a simple language that provides scripting functionalities. This is one of the fastest available language extensions. Groovy is a dynamic, object-oriented programming language with features similar to those of Python, Ruby, Perl, and Smalltalk. It also provides support to write a functional code. JavaScript (https://github.com/elasticsearch/elasticsearch-lang-javascript): This is available as an external plugin. The JavaScript implementation is based on Java Rhino (https://developer.mozilla.org/en-US/docs/Rhino) and is really fast. Python (https://github.com/elasticsearch/elasticsearch-lang-python): This is available as an external plugin, based on Jython (http://jython.org). It allows Python to be used as a script engine. Considering several benchmark results, it's slower than other languages. There's more... Groovy is preferred if the script is not too complex; otherwise, a native plugin provides a better environment to implement complex logic and data management. The performance of every language is different; the fastest one is the native Java. In the case of dynamic scripting languages, Groovy is faster, as compared to JavaScript and Python. In order to access document properties in Groovy scripts, the same approach will work as in other scripting languages: doc.score: This stores the document's score. doc['field_name'].value: This extracts the value of the field_name field from the document. If the value is an array or if you want to extract the value as an array, you can use doc['field_name'].values. doc['field_name'].empty: This returns true if the field_name field has no value in the document. doc['field_name'].multivalue: This returns true if the field_name field contains multiple values. If the field contains a geopoint value, additional methods are available, as follows: doc['field_name'].lat: This returns the latitude of a geopoint. If you need the value as an array, you can use the doc['field_name'].lats method. doc['field_name'].lon: This returns the longitude of a geopoint. If you need the value as an array, you can use the doc['field_name'].lons method. doc['field_name'].distance(lat,lon): This returns the plane distance, in miles, from a latitude/longitude point. If you need to calculate the distance in kilometers, you should use the doc['field_name'].distanceInKm(lat,lon) method. doc['field_name'].arcDistance(lat,lon): This returns the arc distance, in miles, from a latitude/longitude point. If you need to calculate the distance in kilometers, you should use the doc['field_name'].arcDistanceInKm(lat,lon) method. doc['field_name'].geohashDistance(geohash): This returns the distance, in miles, from a geohash value. If you need to calculate the same distance in kilometers, you should use doc['field_name'] and the geohashDistanceInKm(lat,lon) method. By using these helper methods, it is possible to create advanced scripts in order to boost a document by a distance that can be very handy in developing geolocalized centered applications. Managing scripts Depending on your scripting usage, there are several ways to customize ElasticSearch to use your script extensions. In this recipe, we will see how to provide scripts to ElasticSearch via files, indexes, or inline. Getting ready You will need a working ElasticSearch cluster populated with the populate script (chapter_06/populate_aggregations.sh), available at https://github.com/aparo/ elasticsearch-cookbook-second-edition. How to do it... To manage scripting, perform the following steps: Dynamic scripting is disabled by default for security reasons; we need to activate it in order to use dynamic scripting languages such as JavaScript or Python. To do this, we need to turn off the disable flag (script.disable_dynamic: false) in the ElasticSearch configuration file (config/elasticseach.yml) and restart the cluster. To increase security, ElasticSearch does not allow you to specify scripts for non-sandbox languages. Scripts can be placed in the scripts directory inside the configuration directory. To provide a script in a file, we'll put a my_script.groovy script in the config/scripts location with the following code content: doc["price"].value * factor If the dynamic script is enabled (as done in the first step), ElasticSearch allows you to store the scripts in a special index, .scripts. To put my_script in the index, execute the following command in the command terminal: curl -XPOST localhost:9200/_scripts/groovy/my_script -d '{ "script":"doc["price"].value * factor" }' The script can be used by simply referencing it in the script_id field; use the following command: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "match_all": {} }, "sort": {    "_script" : {      "script_id" : "my_script",      "lang" : "groovy",      "type" : "number",      "ignore_unmapped" : true,      "params" : {        "factor" : 1.1      },      "order" : "asc"    } } }' How it works... ElasticSearch allows you to load your script in different ways; each one of these methods has their pros and cons. The most secure way to load or import scripts is to provide them as files in the config/scripts directory. This directory is continuously scanned for new files (by default, every 60 seconds). The scripting language is automatically detected by the file extension, and the script name depends on the filename. If the file is put in subdirectories, the directory path becomes part of the filename; for example, if it is config/scripts/mysub1/mysub2/my_script.groovy, the script name will be mysub1_mysub2_my_script. If the script is provided via a filesystem, it can be referenced in the code via the "script": "script_name" parameter. Scripts can also be available in the special .script index. These are the REST end points: To retrieve a script, use the following code: GET http://<server>/_scripts/<language>/<id"> To store a script use the following code: PUT http://<server>/_scripts/<language>/<id> To delete a script use the following code: DELETE http://<server>/_scripts/<language>/<id> The indexed script can be referenced in the code via the "script_id": "id_of_the_script" parameter. The recipes that follow will use inline scripting because it's easier to use it during the development and testing phases. Generally, a good practice is to develop using the inline dynamic scripting in a request, because it's faster to prototype. Once the script is ready and no changes are needed, it can be stored in the index since it is simpler to call and manage. In production, a best practice is to disable dynamic scripting and store the script on the disk (generally, dumping the indexed script to disk). See also The scripting page on the ElasticSearch website at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html Sorting data using script ElasticSearch provides scripting support for the sorting functionality. In real world applications, there is often a need to modify the default sort by the match score using an algorithm that depends on the context and some external variables. Some common scenarios are given as follows: Sorting places near a point Sorting by most-read articles Sorting items by custom user logic Sorting items by revenue Getting ready You will need a working ElasticSearch cluster and an index populated with the script, which is available at https://github.com/aparo/ elasticsearch-cookbook-second-edition. How to do it... In order to sort using scripting, perform the following steps: If you want to order your documents by the price field multiplied by a factor parameter (that is, sales tax), the search will be as shown in the following code: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "match_all": {} }, "sort": {    "_script" : {      "script" : "doc["price"].value * factor",      "lang" : "groovy",      "type" : "number",      "ignore_unmapped" : true,    "params" : {        "factor" : 1.1      },            "order" : "asc"        }    } }' In this case, we have used a match_all query and a sort script. If everything is correct, the result returned by ElasticSearch should be as shown in the following code: { "took" : 7, "timed_out" : false, "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0 }, "hits" : {    "total" : 1000,    "max_score" : null,    "hits" : [ {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "161",      "_score" : null, "_source" : … truncated …,      "sort" : [ 0.0278578661440021 ]    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "634",      "_score" : null, "_source" : … truncated …,     "sort" : [ 0.08131364254827411 ]    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "465",      "_score" : null, "_source" : … truncated …,      "sort" : [ 0.1094966959069832 ]    } ] } } How it works... The sort scripting allows you to define several parameters, as follows: order (default "asc") ("asc" or "desc"): This determines whether the order must be ascending or descending. script: This contains the code to be executed. type: This defines the type to convert the value. params (optional, a JSON object): This defines the parameters that need to be passed. lang (by default, groovy): This defines the scripting language to be used. ignore_unmapped (optional): This ignores unmapped fields in a sort. This flag allows you to avoid errors due to missing fields in shards. Extending the sort with scripting allows the use of a broader approach to score your hits. ElasticSearch scripting permits the use of every code that you want. You can create custom complex algorithms to score your documents. There's more... Groovy provides a lot of built-in functions (mainly taken from Java's Math class) that can be used in scripts, as shown in the following table: Function Description time() The current time in milliseconds sin(a) Returns the trigonometric sine of an angle cos(a) Returns the trigonometric cosine of an angle tan(a) Returns the trigonometric tangent of an angle asin(a) Returns the arc sine of a value acos(a) Returns the arc cosine of a value atan(a) Returns the arc tangent of a value toRadians(angdeg) Converts an angle measured in degrees to an approximately equivalent angle measured in radians toDegrees(angrad) Converts an angle measured in radians to an approximately equivalent angle measured in degrees exp(a) Returns Euler's number raised to the power of a value log(a) Returns the natural logarithm (base e) of a value log10(a) Returns the base 10 logarithm of a value sqrt(a) Returns the correctly rounded positive square root of a value cbrt(a) Returns the cube root of a double value IEEEremainder(f1, f2) Computes the remainder operation on two arguments, as prescribed by the IEEE 754 standard ceil(a) Returns the smallest (closest to negative infinity) value that is greater than or equal to the argument and is equal to a mathematical integer floor(a) Returns the largest (closest to positive infinity) value that is less than or equal to the argument and is equal to a mathematical integer rint(a) Returns the value that is closest in value to the argument and is equal to a mathematical integer atan2(y, x) Returns the angle theta from the conversion of rectangular coordinates (x,y_) to polar coordinates (r,_theta) pow(a, b) Returns the value of the first argument raised to the power of the second argument round(a) Returns the closest integer to the argument random() Returns a random double value abs(a) Returns the absolute value of a value max(a, b) Returns the greater of the two values min(a, b) Returns the smaller of the two values ulp(d) Returns the size of the unit in the last place of the argument signum(d) Returns the signum function of the argument sinh(x) Returns the hyperbolic sine of a value cosh(x) Returns the hyperbolic cosine of a value tanh(x) Returns the hyperbolic tangent of a value hypot(x,y) Returns sqrt(x^2+y^2) without an intermediate overflow or underflow acos(a) Returns the arc cosine of a value atan(a) Returns the arc tangent of a value If you want to retrieve records in a random order, you can use a script with a random method, as shown in the following code: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "match_all": {} }, "sort": {    "_script" : {      "script" : "Math.random()",      "lang" : "groovy",      "type" : "number",      "params" : {}    } } }' In this example, for every hit, the new sort value is computed by executing the Math.random() scripting function. See also The official ElasticSearch documentation at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html Computing return fields with scripting ElasticSearch allows you to define complex expressions that can be used to return a new calculated field value. These special fields are called script_fields, and they can be expressed with a script in every available ElasticSearch scripting language. Getting ready You will need a working ElasticSearch cluster and an index populated with the script (chapter_06/populate_aggregations.sh), which is available at https://github.com/aparo/ elasticsearch-cookbook-second-edition. How to do it... In order to compute return fields with scripting, perform the following steps: Return the following script fields: "my_calc_field": This concatenates the text of the "name" and "description" fields "my_calc_field2": This multiplies the "price" value by the "discount" parameter From the command line, execute the following code: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/ _search?&pretty=true&size=3' -d '{ "query": {    "match_all": {} }, "script_fields" : {    "my_calc_field" : {      "script" : "doc["name"].value + " -- " + doc["description"].value"    },    "my_calc_field2" : {      "script" : "doc["price"].value * discount",      "params" : {       "discount" : 0.8      }    } } }' If everything works all right, this is how the result returned by ElasticSearch should be: { "took" : 4, "timed_out" : false, "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0 }, "hits" : {    "total" : 1000,    "max_score" : 1.0,    "hits" : [ {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "4",      "_score" : 1.0,      "fields" : {        "my_calc_field" : "entropic -- accusantium",        "my_calc_field2" : 5.480038242170081      }    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "9",      "_score" : 1.0,      "fields" : {        "my_calc_field" : "frankie -- accusantium",        "my_calc_field2" : 34.79852410178313      }    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "11",      "_score" : 1.0,      "fields" : {        "my_calc_field" : "johansson -- accusamus",        "my_calc_field2" : 11.824173084636591      }    } ] } } How it works... The scripting fields are similar to executing an SQL function on a field during a select operation. In ElasticSearch, after a search phase is executed and the hits to be returned are calculated, if some fields (standard or script) are defined, they are calculated and returned. The script field, which can be defined with all the supported languages, is processed by passing a value to the source of the document and, if some other parameters are defined in the script (in the discount factor example), they are passed to the script function. The script function is a code snippet; it can contain everything that the language allows you to write, but it must be evaluated to a value (or a list of values). See also The Installing additional script plugins recipe in this article to install additional languages for scripting The Sorting using script recipe to have a reference of the extra built-in functions in Groovy scripts Filtering a search via scripting ElasticSearch scripting allows you to extend the traditional filter with custom scripts. Using scripting to create a custom filter is a convenient way to write scripting rules that are not provided by Lucene or ElasticSearch, and to implement business logic that is not available in the query DSL. Getting ready You will need a working ElasticSearch cluster and an index populated with the (chapter_06/populate_aggregations.sh) script, which is available at https://github.com/aparo/ elasticsearch-cookbook-second-edition. How to do it... In order to filter a search using a script, perform the following steps: Write a search with a filter that filters out a document with the value of age less than the parameter value: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "filtered": {      "filter": {        "script": {          "script": "doc["age"].value > param1",          "params" : {            "param1" : 80          }        }      },      "query": {        "match_all": {}      }    } } }' In this example, all the documents in which the value of age is greater than param1 are qualified to be returned. If everything works correctly, the result returned by ElasticSearch should be as shown here: { "took" : 30, "timed_out" : false, "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0 }, "hits" : {    "total" : 237,    "max_score" : 1.0,    "hits" : [ {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "9",      "_score" : 1.0, "_source" :{ … "age": 83, … }    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "23",      "_score" : 1.0, "_source" : { … "age": 87, … }    }, {      "_index" : "test-index",      "_type" : "test-type",      "_id" : "47",      "_score" : 1.0, "_source" : {…. "age": 98, …}    } ] } } How it works... The script filter is a language script that returns a Boolean value (true/false). For every hit, the script is evaluated, and if it returns true, the hit passes the filter. This type of scripting can only be used as Lucene filters, not as queries, because it doesn't affect the search (the exceptions are constant_score and custom_filters_score). These are the scripting fields: script: This contains the code to be executed params: These are optional parameters to be passed to the script lang (defaults to groovy): This defines the language of the script The script code can be any code in your preferred and supported scripting language that returns a Boolean value. There's more... Other languages are used in the same way as Groovy. For the current example, I have chosen a standard comparison that works in several languages. To execute the same script using the JavaScript language, use the following code: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "filtered": {      "filter": {        "script": {          "script": "doc["age"].value > param1",          "lang":"javascript",          "params" : {            "param1" : 80          }        }      },      "query": {        "match_all": {}      }    } } }' For Python, use the following code: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?&pretty=true&size=3' -d '{ "query": {    "filtered": {      "filter": {        "script": {          "script": "doc["age"].value > param1",          "lang":"python",          "params" : {            "param1" : 80          }        }      },      "query": {        "match_all": {}      }    } } }' See also The Installing additional script plugins recipe in this article to install additional languages for scripting The Sorting data using script recipe in this article to get a reference of the extra built-in functions in Groovy scripts Summary In this article you have learnt the ways you can use scripting to extend the ElasticSearch functional capabilities using different programming languages. Resources for Article: Further resources on this subject: Indexing the Data [Article] Low-Level Index Control [Article] Designing Puppet Architectures [Article]
Read more
  • 0
  • 0
  • 8475
Modal Close icon
Modal Close icon