Data | Tech News, Tutorials & Expert Insights

03 Mar 2017

17 min read

Data Pipelines

03 Mar 2017

In this article by Andrew Morgan, Antoine Amend, Matthew Hallett, David George, the author of the book Mastering Spark for Data Science, readers will learn how to construct a content registerand use it to track all input loaded to the system, and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process. Readers will learn how to construct a content registerand use it to track all input loaded to the system, and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process. In this article we will cover the following topics: Welcome the GDELT Dataset Data Pipelines Universal Ingestion Framework Real-time monitoring for new data Receiving Streaming Data via Kafka Registering new content and vaulting for tracking purposes Visualization of content metrics in Kibana - to monitor ingestion processes & data health (For more resources related to this topic, see here.) Data Pipelines Even with the most basic of analytics, we always require some data. In fact, finding the right data is probably among the hardest problems to solve in data science (but that’s a whole topic for another book!). We have already seen that the way in which we obtain our data can be as simple or complicated as is needed. In practice, we can break this decision into two distinct areas: Ad-hoc and scheduled. Ad-hoc data acquisition is the most common method during prototyping and small scale analytics as it usually doesn’t require any additional software to implement - the user requires some data and simply downloads it from source as and when required. This method is often a matter of clicking on a web link and storing the data somewhere convenient, although the data may still need to be versioned and secure. Scheduled data acquisition is used in more controlled environments for large scale and production analytics, there is also an excellent case for ingesting a dataset into a data lake for possible future use. With Internet of Things (IoT) on the increase, huge volumes of data are being produced, in many cases if the data is not ingested now it is lost forever. Much of this data may not have an immediate or apparent use today, but could do in the future; so the mind-set is to gather all of the data in case it is needed and delete it later when sure it is not. It’s clear we need a flexible approach to data acquisition that supports a variety of procurement options. Universal Ingestion Framework There are many ways to approach data acquisition ranging from home grown bash scripts through to high-end commercial tools. The aim of this section is to introduce a highly flexible framework that we can use for small scale data ingest, and then grow as our requirements change - all the way through to a full corporately managed workflow if needed - that framework will be build using Apache NiFi. NiFi enables us to build large-scale integrated data pipelines that move data around the planet. In addition, it’s also incredibly flexible and easy to build simple pipelines - usually quicker even than using Bash or any other traditional scripting method. If an ad-hoc approach is taken to source the same dataset on a number of occasions, then some serious thought should be given as to whether it falls into the scheduled category, or at least whether a more robust storage and versioning setup should be introduced. We have chosen to use Apache NiFi as it offers a solution that provides the ability to create many, varied complexity pipelines that can be scaled to truly Big Data and IoT levels, and it also provides a great drag & drop interface (using what’s known as flow-based programming[1]). With patterns, templates and modules for workflow production, it automatically takes care of many of the complex features that traditionally plague developers such as multi-threading, connection management and scalable processing. For our purposes it will enable us to quickly build simple pipelines for prototyping, and scale these to full production where required. It’s pretty well documented and easy to get running https://nifi.apache.org/download.html, it runs in a browser and looks like this: https://en.wikipedia.org/wiki/Flow-based_programming We leave the installation of NiFi as an exercise for the reader - which we would encourage you to do - as we will be using it in the following section. Introducing the GDELT News Stream Hopefully, we have NiFi up and running now and can start to ingest some data. So let’s start with some global news media data from GDELT. Here’s our brief, taken from the GDELT website http://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/: “Within 15 minutes of GDELT monitoring a news report breaking anywhere the world, it has translated it, processed it to identify all events, counts, quotes, people, organizations, locations, themes, emotions, relevant imagery, video, and embedded social media posts, placed it into global context, and made all of this available via a live open metadata firehose enabling open research on the planet itself. [As] the single largest deployment in the world of sentiment analysis, we hope that by bringing together so many emotional and thematic dimensions crossing so many languages and disciplines, and applying all of it in realtime to breaking news from across the planet, that this will spur an entirely new era in how we think about emotion and the ways in which it can help us better understand how we contextualize, interpret, respond to, and understand global events.” In order to start consuming this open data, we’ll need to hook into that metadata firehose and ingest the news streams onto our platform. How do we do this? Let’s start by finding out what data is available. Discover GDELT Real-time GDELT publish a list of the latest files on their website - this list is updated every 15 minutes. In NiFi, we can setup a dataflow that will poll the GDELT website, source a file from this list and save it to HDFS so we can use it later. Inside the NiFi dataflow designer, create a HTTP connector by dragging a processor onto the canvas and selecting GetHTTP. To configure this processor, you’ll need to enter the URL of the file list as: http://data.gdeltproject.org/gdeltv2/lastupdate.txt And also provide a temporary filename for the file list you will download. In the example below, we’ve used the NiFi’s expression language to generate a universally unique key so that files are not overwritten (UUID()). It’s worth noting that with this type of processor (GetHTTP), NiFi supports a number of scheduling and timing options for the polling and retrieval. For now, we’re just going to use the default options and let NiFi manage the polling intervals for us. An example of latest file list from GDELT is shown below. Next, we will parse the URL of the GKG news stream so that we can fetch it in a moment. Create a Regular Expression parser by dragging a processor onto the canvas and selecting ExtractText. Now position the new processor underneath the existing one and drag a line from the top processor to the bottom one. Finish by selecting the success relationship in the connection dialog that pops up. This is shown in the example below. Next, let’s configure the ExtractText processor to use a regular expression that matches only the relevant text of the file list, for example: ([^ ]*gkg.csv.*) From this regular expression, NiFi will create a new property (in this case, called url) associated with the flow design, which will take on a new value as each particular instance goes through the flow. It can even be configured to support multiple threads. Again, this is example is shown below. It’s worth noting here that while this is a fairly specific example, the technique is deliberately general purpose and can be used in many situations. Our First GDELT Feed Now that we have the URL of the GKG feed, we fetch it by configuring an InvokeHTTP processor to use the url property we previously created as it’s remote endpoint, and dragging the line as before. All that remains is to decompress the zipped content with a UnpackContent processor (using the basic zip format) and save to HDFS using a PutHDFS processor, like so: Improving with Publish and Subscribe So far, this flow looks very “point-to-point”, meaning that if we were to introduce a new consumer of data, for example, a Spark-streaming job, the flow must be changed. For example, the flow design might have to change to look like this: If we add yet another, the flow must change again. In fact, each time we add a new consumer, the flow gets a little more complicated, particularly when all the error handling is added. This is clearly not always desirable, as introducing or removing consumers (or producers) of data, might be something we want to do often, even frequently. Plus, it’s also a good idea to try to keep your flows as simple and reusable as possible. Therefore, for a more flexible pattern, instead of writing directly to HDFS, we can publish to Apache Kafka. This gives us the ability to add and remove consumers at any time without changing the data ingestion pipeline. We can also still write to HDFS from Kafka if needed, possibly even by designing a separate NiFi flow, or connect directly to Kafka using Spark-streaming. To do this, we create a Kafka writer by dragging a processor onto the canvas and selecting PutKafka. We now have a simple flow that continuously polls for an available file list, routinely retrieving the latest copy of a new stream over the web as it becomes available, decompressing the content and streaming it record-by-record into Kafka, a durable, fault-tolerant, distributed message queue, for processing by spark-streaming or storage in HDFS. And what’s more, without writing a single line of bash! Content Registry We have seen in this article that data ingestion is an area that is often overlooked, and that its importance cannot be underestimated. At this point we have a pipeline that enables us to ingest data from a source, schedule that ingest and direct the data to our repository of choice. But the story does not end there. Now we have the data, we need to fulfil our data management responsibilities. Enter the content registry. We’re going to build an index of metadata related to that data we have ingested. The data itself will still be directed to storage (HDFS, in our example) but, in addition, we will store metadata about the data, so that we can track what we’ve received and understand basic information about it, such as, when we received it, where it came from, how big it is, what type it is, etc. Choices and More Choices The choice of which technology we use to store this metadata is, as we have seen, one based upon knowledge and experience. For metadata indexing, we will require at least the following attributes: Easily searchable Scalable Parallel write ability Redundancy There are many ways to meet these requirements, for example we could write the metadata to Parquet, store in HDFS and search using Spark SQL. However, here we will use Elasticsearch as it meets the requirements a little better, most notably because it facilitates low latency queries of our metadata over a REST API - very useful for creating dashboards. In fact, Elasticsearch has the advantage of integrating directly with Kibana, meaning it can quickly produce rich visualizations of our content registry. For this reason, we will proceed with Elasticsearch in mind. Going with the Flow Using our current NiFi pipeline flow, let’s fork the output from “Fetch GKG files from URL” to add an additional set of steps to allow us to capture and store this metadata in Elasticsearch. These are: Replace the flow content with our metadata model Capture the metadata Store directly in Elasticsearch Here’s what this looks like in NiFi: Metadata Model So, the first step here is to define our metadata model. And there are many areas we could consider, but let’s select a set that helps tackle a few key points from earlier discussions. This will provide a good basis upon which further data can be added in the future, if required. So, let’s keep it simple and use the following three attributes: File size Date ingested File name These will provide basic registration of received files. Next, inside the NiFi flow, we’ll need to replace the actual data content with this new metadata model. An easy way to do this, is to create a JSON template file from our model. We’ll save it to local disk and use it inside a FetchFile processor to replace the flow’s content with this skeleton object. This template will look something like: { "FileSize": SIZE, "FileName": "FILENAME", "IngestedDate": "DATE" } Note the use of placeholder names (SIZE, FILENAME, DATE) in place of the attribute values. These will be substituted, one-by-one, by a sequence of ReplaceText processors, that swap the placeholder names for an appropriate flow attribute using regular expressions provided by the NiFi Expression Language, for example DATE becomes ${now()}. The last step is to output the new metadata payload to Elasticsearch. Once again, NiFi comes ready with a processor for this; the PutElasticsearch processor. An example metadata entry in Elasticsearch: { "_index": "gkg", "_type": "files", "_id": "AVZHCvGIV6x-JwdgvCzW", "_score": 1, "source": { "FileSize": 11279827, "FileName": "20150218233000.gkg.csv.zip", "IngestedDate": "2016-08-01T17:43:00+01:00" } } Now that we have added the ability to collect and interrogate metadata, we now have access to more statistics that can be used for analysis. This includes: Time based analysis e.g. file sizes over time Loss of data, for example are there data “holes” in the timeline? If there is a particular analytic that is required, the NIFI metadata component can be adjusted to provide the relevant data points. Indeed, an analytic could be built to look at historical data and update the index accordingly if the metadata does not exist in current data. Kibana Dashboard We have mentioned Kibana a number of times in this article, now that we have an index of metadata in Elasticsearch, we can use the tool to visualize some analytics. The purpose of this brief section is to demonstrate that we can immediately start to model and visualize our data. In this simple example we have completed the following steps: Added the Elasticsearch index for our GDELT metadata to the “Settings” tab Selected “file size” under the “Discover” tab Selected Visualize for “file size” Changed the Aggregation field to “Range” Entered values for the ranges The resultant graph displays the file size distribution: From here we are free to create new visualizations or even a fully featured dashboard that can be used to monitor the status of our file ingest. By increasing the variety of metadata written to Elasticsearch from NiFi, we can make more fields available in Kibana and even start our data science journey right here with some ingest based actionable insights. Now that we have a fully-functioning data pipeline delivering us real-time feeds of data, how do we ensure data quality of the payload we are receiving? Let’s take a look at the options. Quality Assurance With an initial data ingestion capability implemented, and data streaming onto your platform, you will need to decide how much quality assurance is required at the front door. It’s perfectly viable to start with no initial quality controls and build them up over time (retrospectively scanning historical data as time and resources allow). However, it may be prudent to install a basic level of verification to begin with. For example, basic checks such as file integrity, parity checking, completeness, checksums, type checking, field counting, overdue files, security field pre-population, denormalization, etc. You should take care that your up-front checks do not take too long. Depending on the intensity of your examinations and the size of your data, it’s not uncommon to encounter a situation where there is not enough time to perform all processing before the next dataset arrives. You will always need to monitor your cluster resources and calculate the most efficient use of time. Here are some examples of the type of rough capacity planning calculation you can perform: Example 1: Basic Quality Checking, No Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population) takes 4 minutes There are no other users on the compute cluster There are 10 minutes of resources available for other tasks. As there are no other users on the cluster, this is satisfactory - no action needs to be taken. Example 2: Advanced Quality Checking, No Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population, denormalization, sub dataset building) takes 13 minutes There are no other users on the compute cluster There is only 1 minute of resource available for other tasks. We probably need to consider, either: Configuring a resource scheduling policy Reducing the amount of data ingested Reducing the amount of processing we undertake Adding additional compute resources to the cluster Example 3: Basic Quality Checking, 50% Utility Due to Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population) takes 4 minutes (100% utility) There are other users on the compute cluster There are 6 minutes of resources available for other tasks (15 - 1 - (4 * (100 / 50))). Since there are other users there is a danger that, at least some of the time, we will not be able to complete our processing and a backlog of jobs will occur. When you run into timing issues, you have a number of options available to you in order to circumvent any backlog: Negotiating sole use of the resources at certain times Configuring a resource scheduling policy, including: YARN Fair Scheduler: allows you to define queues with differing priorities and target your Spark jobs by setting the spark.yarn.queue property on start-up so your job always takes precedence Dynamicandr Resource Allocation: allows concurrently running jobs to automatically scale to match their utilization Spark Scheduler Pool: allows you to define queues when sharing a SparkContext using multithreading model, and target your Spark job by setting the spark.scheduler.pool property per execution thread so your thread takes precedence Running processing jobs overnight when the cluster is quiet In any case, you will eventually get a good idea of how the various parts to your jobs perform and will then be in a position to calculate what changes could be made to improve efficiency. There’s always the option of throwing more resources at the problem, especially when using a cloud provider, but we would certainly encourage the intelligent use of existing resources - this is far more scalable, cheaper and builds data expertise. Summary In this article we walked through the full setup of an Apache NiFi GDELT ingest pipeline, complete with metadata forks and a brief introduction to visualizing the resultant data. This section is particularly important as GDELT is used extensively throughout the book and the NiFi method is a highly effective way to source data in a scalable and modular way. Resources for Article: Further resources on this subject: Integration with Continuous Delivery [article] Amazon Web Services [article] AWS Fundamentals [article]

0
1
16359

Packt

01 Mar 2017

17 min read

Understanding Spark RDD

Packt

01 Mar 2017

17 min read

In this article by Asif Abbasi author of the book Learning Apache Spark 2.0, we will understand Spark RDD along with that we will learn, how to construct RDDs, Operations on RDDs, Passing functions to Spark in Scala, Java, and Python and Transformations such as map, filter, flatMap, and sample. (For more resources related to this topic, see here.) What is an RDD? What’s in a name might be true for a rose, but perhaps not for an Resilient Distributed Datasets (RDD), and in essence describes what an RDD is. They are basically datasets, which are distributed across a cluster (remember Spark framework is inherently based on an MPP architecture), and provide resilience (automatic failover) by nature. Before we go into any further detail, let’s try to understand this a little bit, and again we are trying to be as abstract as possible. Let us assume that you have a sensor data from aircraft sensors and you want to analyze the data irrespective of its size and locality. For example, an Airbus A350 has roughly 6000 sensors across the entire plane and generates 2.5 TB data per day, while the newer model expected to launch in 2020 will generate roughly 7.5 TB per day. From a data engineering point of view, it might be important to understand the data pipeline, but from an analyst and a data scientist point of view, your major concern is to analyze the data irrespective of the size and number of nodes across which it has been stored. This is where the neatness of the RDD concept comes into play, where the sensor data can be encapsulated as an RDD concept, and any transformation/action that you perform on the RDD applies across the entire dataset. Six month's worth of dataset for an A350 would be approximately 450 TBs of data, and would need to sit across multiple machines. For the sake of discussion, we assume that you are working on a cluster of four worker machines. Your data would be partitioned across the workers as follows: Figure 2-1: RDD split across a cluster The figure basically explains that an RDD is a distributed collection of the data, and the framework distributes the data across the cluster. Data distribution across a set of machines brings its own set of nuisances including recovering from node failures. RDD’s are resilient as they can be recomputed from the RDD lineage graph, which is basically a graph of the entire parent RDDs of the RDD. In addition to the resilience, distribution, and representing a data set, an RDD has various other distinguishing qualities: In Memory: An RDD is a memory resident collection of objects. We’ll look at options where an RDD can be stored in memory, on disk, or both. However, the execution speed of Spark stems from the fact that the data is in memory, and is not fetched from disk for each operation. Partitioned: A partition is a division of a logical dataset or constituent elements into independent parts. Partitioning is a defacto performance optimization technique in distributed systems to achieve minimal network traffic, a killer for high performance workloads. The objective of partitioning in key-value oriented data is to collocate similar range of keys and in effect minimize shuffling. Data inside RDD is split into partitions and across various nodes of the cluster. Typed: Data in an RDD is strongly typed. When you create an RDD, all the elements are typed depending on the data type. Lazy evaluation: The transformations in Spark are lazy, which means data inside RDD is not available until you perform an action. You can, however, make the data available at any time using a count() action on the RDD. We’ll discuss this later and the benefits associated with it. Immutable: An RDD once created cannot be changed. It can, however, be transformed into a new RDD by performing a set of transformations on it. Parallel: An RDD is operated on in parallel. Since the data is spread across a cluster in various partitions, each partition is operated on in parallel. Cacheable: Since RDD’s are lazily evaluated, any action on an RDD will cause the RDD to revaluate all transformations that led to the creation of RDD. This is generally not a desirable behavior on large datasets, and hence Spark allows the option to persist the data on memory or disk. A typical Spark program flow with an RDD includes: Creation of an RDD from a data source. A set of transformations, for example, filter, map, join, and so on. Persisting the RDD to avoid re-execution. Calling actions on the RDD to start performing parallel operations across the cluster. This is depicted in the following figure: Figure 2-2: Typical Spark RDD flow Operations on RDD Two major operation types can be performed on an RDD. They are called: Transformations Actions Transformations Transformations are operations that create a new dataset, as RDDs are immutable. They are used to transform data from one to another, which could result in amplification of the data, reduction of the data, or a totally different shape altogether. These operations do not return any value back to the driver program, and hence lazily evaluated, which is one of the main benefits of Spark. An example of a transformation would be a map function that will pass through each element of the RDD and return a totally new RDD representing the results of application of the function on the original dataset. Actions Actions are operations that return a value to the driver program. As previously discussed, all transformations in Spark are lazy, which essentially means that Spark remembers all the transformations carried out on an RDD, and applies them in the most optimal fashion when an action is called. For example, you might have a 1 TB dataset, which you pass through a set of map functions by applying various transformations. Finally, you apply the reduce action on the dataset. Apache Spark will return only a final dataset, which might be few MBs rather than the entire 1 TB dataset of mapped intermediate result. You should, however, remember to persist intermediate results otherwise Spark will recompute the entire RDD graph each time an Action is called. The persist() method on an RDD should help you avoid recomputation and saving intermediate results. We’ll look at this in more detail later. Let’s illustrate the work of transformations and actions by a simple example. In this specific example, we’ll be using flatmap() transformations and a count action. We’ll use the README.md file from the local filesystem as an example. We’ll give a line-by-line explanation of the Scala example, and then provide code for Python and Java. As always, you must try this example with your own piece of text and investigate the results: //Loading the README.md file val dataFile = sc.textFile(“README.md”) Now that the data has been loaded, we’ll need to run a transformation. Since we know that each line of the text is loaded as a separate element, we’ll need to run a flatMap transformation and separate out individual words as separate elements, for which we’ll use the split function and use space as a delimiter: //Separate out a list of words from individual RDD elements val words = dataFile.flatMap(line => line.split(“ “)) Remember that until this point, while you seem to have applied a transformation function, nothing has been executed and all the transformations have been added to the logical plan. Also note that the transformation function returns a new RDD. We can then call the count() action on the words RDD, to perform the computation, which then results in fetching of data from the file to create an RDD, before applying the transformation function specified. You might note that we have actually passed a function to Spark: //Separate out a list of words from individual RDD elements Words.count() Upon calling the count() action the RDD is evaluated, and the results are sent back to the driver program. This is very neat and especially useful during big data applications. If you are Python savvy, you may want to run the following code in PySpark. You should note that lambda functions are passed to the Spark framework: //Loading data file, applying transformations and action dataFile = sc.textFile("README.md") words = dataFile.flatMap(lambda line: line.split(" ")) words.count() Programming the same functionality in Java is also quite straight forward and will look pretty similar to the program in Scala: JavaRDD<String> lines = sc.textFile("README.md"); JavaRDD<String> words = lines.map(line -> line.split(“ “)); int wordCount = words.count(); This might look like a simple program, but behind the scenes it is taking the line.split(“ ”) function and applying it to all the partitions in the cluster in parallel. The framework provides this simplicity and does all the background work of coordination to schedule it across the cluster, and get the results back. Passing functions to Spark (Scala) As you have seen in the previous example, passing functions is a critical functionality provided by Spark. From a user’s point of view you would pass the function in your driver program, and Spark would figure out the location of the data partitions across the cluster memory, running it in parallel. The exact syntax of passing functions differs by the programming language. Since Spark has been written in Scala, we’ll discuss Scala first. In Scala, the recommended ways to pass functions to the Spark framework are as follows: Anonymous functions Static singleton methods Anonymous functions Anonymous functions are used for short pieces of code. They are also referred to as lambda expressions, and are a cool and elegant feature of the programming language. The reason they are called anonymous functions is because you can give any name to the input argument and the result would be the same. For example, the following code examples would produce the same output: val words = dataFile.map(line => line.split(“ “)) val words = dataFile.map(anyline => anyline.split(“ “)) val words = dataFile.map(_.split(“ “)) Figure 2-11: Passing anonymous functions to Spark in Scala Static singleton functions While anonymous functions are really helpful for short snippets of code, they are not very helpful when you want to request the framework for a complex data manipulation. Static singleton functions come to the rescue with their own nuances, which we will discuss in this section. In software engineering, the Singleton pattern is a design pattern that restricts instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system. Static methods belong to the class and not an instance of it. They usually take input from the parameters, perform actions on it, and return the result. Figure 2-12: Passing static singleton functions to Spark in Scala Static singleton is the preferred way to pass functions, as technically you can create a class and call a method in the class instance. For example: class UtilFunctions{ def split(inputParam: String): Array[String] = {inputParam.split(“ “)} def operate(rdd: RDD[String]): RDD[String] ={ rdd.map(split)} } You can send a method in a class, but that has performance implications as the entire object would be sent along the method. Passing functions to Spark (Java) In Java, to create a function you will have to implement the interfaces available in the org.apache.spark.api.java function package. There are two popular ways to create such functions: Implement the interface in your own class, and pass the instance to Spark. Starting Java 8, you can use lambda expressions to pass off the functions to the Spark framework. Let’s reimplement the preceding word count examples in Java: Figure 2-13: Code example of Java implementation of word count (inline functions) If you belong to a group of programmers who feel that writing inline functions makes the code complex and unreadable (a lot of people do agree to that assertion), you may want to create separate functions and call them as follows: Figure 2-14: Code example of Java implementation of word count Passing functions to Spark (Python) Python provides a simple way to pass functions to Spark. The Spark programming guide available at spark.apache.org suggests there are three recommended ways to do this: Lambda expressions: The ideal way for short functions that can be written inside a single expression Local defs inside the function calling into Spark for longer code Top-level functions in a module While we have already looked at the lambda functions in some of the previous examples, let’s look at local definitions of the functions. Our example stays the same, which is we are trying to count the total number of words in a text file in Spark: def splitter(lineOfText): words = lineOfText.split(" ") return len(words) def aggregate(numWordsLine1, numWordsLineNext): totalWords = numWordsLine1 + numWordsLineNext return totalWords Let’s see the working code example: Figure 2-15: Code example of Python word count (local definition of functions) Here’s another way to implement this by defining the functions as a part of a UtilFunctions class, and referencing them within your map and reduce functions: Figure 2-16: Code example of Python word count (Utility class) You may want to be a bit cheeky here and try to add a countWords() to the UtilFunctions, so that it takes an RDD as input, and returns the total number of words. This method has potential performance implications as the whole object will need to be sent to the cluster. Let’s see how this can be implemented and the results in the following screenshot: Figure 2-17: Code example of Python word count (Utility class - 2) This can be avoided by making a copy of the referenced data field in a local object, rather than accessing it externally. Now that we have had a look at how to pass functions to Spark, and have already looked at some of the transformations and actions in the previous examples, including map, flatMap, and reduce, let’s look at the most common transformations and actions used in Spark. The list is not exhaustive, and you can find more examples in the Apache Spark documentation in the programming guide. If you would like to get a comprehensive list of all the available functions, you might want to check the following API docs: RDD PairRDD Scala http://bit.ly/2bfyoTo http://bit.ly/2bfzgah Python http://bit.ly/2bfyURl N/A Java http://bit.ly/2bfyRov http://bit.ly/2bfyOsH R http://bit.ly/2bfyrOZ N/A Table 2.1 – RDD and PairRDD API references Transformations The following table shows the most common transformations: map(func) coalesce(numPartitions) filter(func) repartition(numPartitions) flatMap(func) repartitionAndSortWithinPartitions(partitioner) mapPartitions(func) join(otherDataset, [numTasks]) mapPartitionsWithIndex(func) cogroup(otherDataset, [numTasks]) sample(withReplacement, fraction, seed) cartesian(otherDataset) Map(func) The map transformation is the most commonly used and the simplest of transformations on an RDD. The map transformation applies the function passed in the arguments to each of the elements of the source RDD. In the previous examples, we have seen the usage of map() transformation where we have passed the split() function to the input RDD. Figure 2-18: Operation of a map() function We’ll not give examples of map() functions as we have already seen plenty of examples of map functions previously. Filter (func) Filter, as the name implies, filters the input RDD, and creates a new dataset that satisfies the predicate passed as arguments. Example 2-1: Scala filtering example: val dataFile = sc.textFile(“README.md”) val linesWithApache = dataFile.filter(line => line.contains(“Apache”)) Example 2-2: Python filtering example: dataFile = sc.textFile(“README.md”) linesWithApache = dataFile.filter(lambda line: “Apache” in line) Example 2-3: Java filtering example: JavaRDD<String> dataFile = sc.textFile(“README.md”) JavaRDD<String> linesWithApache = dataFile.filter(line -> line.contains(“Apache”)); flatMap(func) The flatMap transformation is similar to map, but it offers a bit more flexibility. From the perspective of similarity to a map function, it operates on all the elements of the RDD, but the flexibility stems from its ability to handle functions that return a sequence rather than a single item. As you saw in the preceding examples, we had used flatMap to flatten the result of the split(“”) function, which returns a flattened structure rather than an RDD of string arrays. Figure 2-19: Operational details of the flatMap() transformation Let’s look at the flatMap example in Scala. Example 2-4: The flatmap() example in Scala: val favMovies = sc.parallelize(List("Pulp Fiction","Requiem for a dream","A clockwork Orange")); movies.flatMap(movieTitle=>movieTitle.split(" ")).collect() A flatMap in Python API would produce similar results. Example 2-5: The flatmap() example in Python: movies = sc.parallelize(["Pulp Fiction","Requiem for a dream","A clockwork Orange"]) movies.flatMap(lambda movieTitle: movieTitle.split(" ")).collect() The flatMap example in Java is a bit long-winded, but it essentially produces the same results. Example 2-6: The flatmap() example in Java: JavaRDD<String> movies = sc.parallelize (Arrays.asList("Pulp Fiction","Requiem for a dream" ,"A clockwork Orange") ); JavaRDD<String> movieName = movies.flatMap( new FlatMapFunction<String,String>(){ public Iterator<String> call(String movie){ return Arrays.asList(movie.split(" ")) .iterator(); } } ); Sample(withReplacement, fraction, seed) Sampling is an important component of any data analysis and it can have a significant impact on the quality of your results/findings. Spark provides an easy way to sample RDD’s for your calculations, if you would prefer to quickly test your hypothesis on a subset of data before running it on a full dataset. But here is a quick overview of the parameters that are passed onto the method: withReplacement: Is a Boolean (True/False), and it indicates if elements can be sampled multiple times (replaced when sampled out). Sampling with replacement means that the two sample values are independent. In practical terms this means that if we draw two samples with replacement, what we get on the first one doesn’t affect what we get on the second draw, and hence the covariance between the two samples is zero. If we are sampling without replacement, the two samples aren’t independent. Practically this means what we got on the first draw affects what we get on the second one and hence the covariance between the two isn’t zero. fraction: Fraction indicates the expected size of the sample as a fraction of the RDD’s size. The fraction must be between 0 and 1. For example, if you want to draw a 5% sample, you can choose 0.05 as a fraction. seed: The seed used for the random number generator. Let’s look at the sampling example in Scala. Example 2-7: The sample() example in Scala: val data = sc.parallelize( List(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); data.sample(true,0.1,12345).collect() The sampling example in Python looks similar to the one in Scala. Example 2-8: The sample() example in Python: data = sc.parallelize( [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]) data.sample(1,0.1,12345).collect() In Java, our sampling example returns an RDD of integers. Example 2-9: The sample() example in Java: JavaRDD<Integer> nums = sc.parallelize(Arrays.asList( 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); nums.sample(true,0.1,12345).collect(); References https://spark.apache.org/docs/latest/programming-guide.html http://www.purplemath.com/modules/numbprop.htm Summary We have gone through the concept of creating an RDD, to manipulating data within the RDD. We’ve looked at the transformations and actions available to an RDD, and walked you through various code examples to explain the differences between transformations and actions Resources for Article: Further resources on this subject: Getting Started with Apache Spark [article] Getting Started with Apache Spark DataFrames [article] Sabermetrics with Apache Spark [article]

0
0
13009

article-image-review-sql-server-features-developers

Packt

13 Feb 2017

43 min read

Review of SQL Server Features for Developers

Packt

13 Feb 2017

43 min read

0
0
1780

Packt

13 Feb 2017

14 min read

CNN architecture

Packt

13 Feb 2017

14 min read

0
0
29872

Packt

10 Feb 2017

13 min read

Bitcoin

Packt

10 Feb 2017

13 min read

In this article by Imran Bashir, the author of the book Mastering Blockchain, will see about bitcoin and it's importance in electronic cash system. (For more resources related to this topic, see here.) Bitcoin is the first application of blockchain technology. In this article readers will be introduced to the bitcoin technology in detail. Bitcoin has started a revolution with the introduction of very first fully decentralized digital currency that has been proven to be extremely secure and stable. This has also sparked great interest in academic and industrial research and introduced many new research areas. Since its introduction in 2008 bitcoin has gained much popularity and currently is the most successful digital currency in the world with billions of dollars invested in it. It is built on decades of research in the field of cryptography, digital cash and distributed computing. In the following section brief history is presented in order to provide background required to understand the foundations behind the invention of bitcoin. Digital currencies have always been an active area of research for many decades. Early proposals to create digital cash goes as far back as the early 1980s. In 1982 David Chaum proposed a scheme that used blind signatures to build untraceable digital currency. In this scheme a bank would issue digital money by signing a blinded and random serial number presented to it by the user. The user can then use the digital token signed by the bank as currency. The limitation in this scheme is that the bank has to keep track of all used serial numbers. This is a central system by design and requires to be trusted by the users. Later on in 1990 David Chaum proposed a refined version named ecash that not only used blinded signature but also some private identification data to craft a message that was then sent to the bank. This scheme allowed detection of double spending but did not prevent it. If the same token is used at two different location then the identity of the double spender would be revealed. ecash could only represent fixed amount of money. Adam Back's hashcash introduced in 1997 was originally proposed to thwart the email spam. The idea behind hashcash is to solve a computational puzzle that is easy to verify but is comparatively difficult to compute. The idea is that for a single user and single email extra computational effort it not noticeable but someone sending large number of spam emails would be discouraged as the time and resources required to run the spam campaign will increase substantially. B-money was proposed by Wei Dai in 1998 which introduced the idea of using proof of work to create money. Major weakness in the system was that some adversary with higher computational power could generate unsolicited money without giving the chance to the network to adjust to an appropriate difficulty level. The system was lacking details on the consensus mechanism between nodes and some security issues like Sybil attacks were also not addressed. At the same time Nick Szabo introduced the concept of bit gold which was also based on proof of work mechanism but had same problems as b-money had with one exception that network difficulty level was adjustable. Tomas Sander and Ammon TaShama introduced an ecash scheme in 1999 that for the first time used merkle trees to represent coins and zero knowledge proofs to prove possession of coins. In the scheme a central bank was required who kept record of all used serial numbers. This scheme allowed users to be fully anonymous albeit at some computational cost. RPOW (Reusable Proof of Work) was introduced by Hal Finney in 2004 that used hash cash scheme by Adam Back as a proof of computational resources spent to create the money. This was also a central system that kept a central database to keep track of all used PoW tokens. This was an online system that used remote attestation made possible by trusted computing platform (TPM hardware). All the above mentioned schemes are intelligently designed but were weak from one aspect or another. Especially all these schemes rely on a central server which is required to be trusted by the users. Bitcoin In 2008 bitcoin paper Bitcoin: A Peer-to-Peer Electronic Cash System was written by Satoshi Nakamoto. First key idea introduced in the paper is that it is a purely peer to peer electronic cash that does need an intermediary bank to transfer payments between peers. Bitcoin is built on decades of Cryptographic research like merkle trees, hash functions, public key cryptography and digital signatures. Moreover ideas like bit gold, b-money, hashcash and cryptographic time stamping have provided the foundations for bitcoin invention. All these technologies are cleverly combined in bitcoin to create world's first decentralized currency. Key issue that has been addressed in bitcoin is an elegant solution to Byzantine Generals problem along with a practical solution of double spend problem. Value of bitcoin has increased significantly since 2011 as shown in the graph below: Bitcoin price and volume since 2012 (on logarithmic scale) Regulation of bitcoin is a controversial subject and as much as it is a libertarian's dream law enforcement agencies and governments are proposing various regulations to control it such as bitlicense issued by NewYorks state department of financial services. This is a license issued to businesses which perform activities related to virtual currencies. Growth of bitcoin is also due to so called Network Effect. Also called demand-side economies of scale, it is a concept which basically means that more users who use the network the more valuable it becomes. Over time exponential increase has been seen in bitcoin network growth. Even though the price of bitcoin is quite volatile it has increased significantly over a period of last few years. Currently (at the time of writing) bitcoin price is 815 GBP. Bitcoin definition Bitcoin can be defined in various ways, it's a protocol, a digital currency and a platform. It is a combination of peer to peer network, protocols and software that facilitate the creation and usage of digital currency named bitcoin. Note that Bitcoin with capital B is used to refer to Bitcoin protocol whereas bitcoin with lower case b is used to refer to bitcoin, the currency. Nodes in this peer to peer to network talk to each other using the Bitcoin protocol. Decentralization of currency was made possible for the first time with the invention of Bitcoin. Moreover double spending problem was solved in an elegant and ingenious way in bitcoin. Double spending problem arises when for example a user sends coins to two different users at the same time and they will be verified independently as valid transactions. Keys and addresses Elliptic curve cryptography is used to generate public and private key pairs in the Bitcoin network. The Bitcoin address is created by taking the corresponding public key of a private key and hashing it twice, first with SHA256 algorithm and then with RIPEMD160. The resultant 160-bit hash is then prefixed with a version number and finally encoded with Base58Check encoding scheme. The bitcoin addresses are 26 to 35 characters long and begin with digit 1 or 3. A typical bitcoin address looks like a string shown as follows: 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt This is also commonly encoded in a QR code for easy sharing. The QR code of the preceding shown address is as follows: QR code of a bitcoin address 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt There are currently two types of addresses, commonly used P2PKH and another P2SH type starting with 1 and 3 respectively. In early days bitcoin used direct Pay-to-Pubkey which is now superseded by P2PKH. However direct Pay-to-Pubkey is still used in bitcoin for coinbase addresses. Addresses should not be used for more than once otherwise privacy and security issues can arise. Avoiding address reuse circumvents anonymity issues to some extent, bitcoin has some other security issues also, such as transaction malleability which requires different approach to resolve. from bitaddress.org private key and bitcoin address in a paper wallet Public keys in bitcoin In Public key cryptography, public keys are generated from private keys. Bitcoin uses ECC based on SECP256K1 standard. A private key is randomly selected and is 256-bit in length. Public keys can be presented in uncompressed or compressed format. Public keys are basically x and y coordinates on an elliptic curve and in uncompressed format are presented with a prefix of 04 in hexadecimal format. X and Y co-ordinates are both 32-bit in length. In total the compressed public key is 33 bytes long as compared to 65 bytes in uncompressed format. Compressed version of public keys basically include only X part, since Y part can be derived from it. The reason why compressed version of public keys works is that bitcoin client initially used uncompressed keys, but starting from bitcoin core client 0.6 compressed keys are used as standard. Keys are identified by various prefixes described as follows: Uncompressed public keys used 0x04 as prefix. Compressed public key starts with 0x03 if the y 32-bit part of the public key is odd. Compressed public key starts with 0x02 if the y 32-bit part of the public key is even. More mathematical description and reason why it works is described later. If the ECC graph is visualized it reveals that the y co-ordinate can be either below the x-axis or above the x-axis and as the curve is symmetric only the location in the prime field is required to be stored. Private keys in bitcoin Private keys are basically 256-bit numbers chosen in the range specified by SECP256K1 ECDSA recommendation. Any randomly chosen 256-bit number from 0x1 to 0xFFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFE BAAE DCE6 AF48 A03B BFD2 5E8C D036 4140 is a valid private key. Private keys are usually encoded using Wallet Import Format (WIF) in order to make them easier to copy and use. WIF can be converted to private key and vice versa. Steps are described later. Also Mini Private Key Format is used sometimes to encode the key in under 30 characters to allow storage where physical space is limited, for example, etching on physical coins or damage resistant QR codes. Bitcoin core client also allows encryption of the wallet which contains the private keys. Bitcoin currency units Bitcoin currency units are described as follows. Smallest bitcoin denomination is Satoshi. Base58Check encoding This encoding is used to limit the confusion between various characters such as 0OIl as they can look same in different fonts. The encoding basically takes the binary byte arrays and converts them into human readable string. This string is composed by utilizing a set of 58 alphanumeric symbols. More explanation and logic can be found in base58.h source file in bitcoin source code. Explanation from bitcoin source code Bitcoin addresses are encoded using Base58check encoding. Vanity addresses As the bitcoin addresses are based on base 58 encoding, it is possible to generate addresses that contain human readable messages. An example is shown as follows: Public address encoded in QR Vanity addresses are generated by using a purely brute force method. An example is shown as follows: Vanity address generated from https://bitcoinvanitygen.com/ Transactions Transactions are at the core of bitcoin ecosystem. Transactions can be as simple as just sending some bitcoins to a bitcoin address or it can be quite complex depending on the requirements. Each transaction is composed of at least one input and output. Inputs can be thought of as coins being spent that have been created in a previous transaction and outputs as coins being created. If a transaction is minting new coins then there is no input and therefore no signature is needed. If a transaction is to send coins to some other user (a bitcoin address), then it needs to be signed by the sender with their private key and also a reference is required to the previous transaction to show the origin of the coins. Coins are in fact unspent transactions outputs represented in Satoshis. Transactions are not encrypted and are publicly visible in the blockchain. Blocks are made up of transactions and these can be viewed by using any online blockchain explorer. Transaction life cycle A user/sender sends a transaction using wallet software or some other interface. Wallet software signs the transaction using the sender's private key. Transaction is broadcasted to the Bitcoin network using a flooding algorithm. Mining nodes include this transaction in the next block to be mined. Mining starts and once a miner who solves the Proof of Work problem broadcasts the newly mined block to the network. Proof of Work is explained in detail later. The nodes verify the block and propagate the block further and confirmation start to generate. Finally the confirmations start to appear in the receiver's wallet and after approximately 6 confirmations the transaction is considered finalized and confirmed. However 6 is just a recommended number , the transaction can be considered final even after first confirmation. The key idea behind waiting for six confirmations is that the probability of double spending virtually eliminates after 6 confirmations. Transaction structure A transaction at a high level contains metadata, inputs and outputs. Transactions are combined together to create a block. The transaction structure is shown in the following table: Field Size Description Version Number 4 bytes Used to specify rules to be used by the miners and nodes for transaction processing. Input counter 1 to 9 bytes Number of inputs included in the transaction. list of inputs variable Each input is composed of several fields including Previous Transaction hash, Previous Txout-index, Txin-script length, Txin-script and optional sequence number. The first transaction in a block is also called coinbase transaction. Specifies on or more transaction inputs. Out-counter 1 to 9 bytes positive integer representing the number of outputs. list of outputs variable Outputs included in the transaction. lock_time 4 bytes It defines the earliest time when a transaction becomes valid. It is either a Unix timestamp or block number. MetaData: This part of the transaction contains some values like size of transaction, number of inputs and outputs, hash of the transaction and a lock_time field. Every transaction has a prefix specifying the version number. Inputs: Generally each input spends a previous output. Each output is considered a UTXO, Unspent transaction output until an input consumes it. Outputs: Outputs have only two fields and it contains instructions for sending bitcoins. First field contains the amount of Satoshis where as second field is a locking script which contains the conditions that needs to be met in order for the output to be spent. More information about transaction spending by using locking and unlocking scripts and producing outputs is discussed later. Verification: Verification is performed by using Bitcoin's scripting language Summary In this article, we learned the importance of bitcoin in digital currency and how bitcoins are encoded using various private keys and encoding techniques. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Protecting Your Bitcoins [article] FPGA Mining [article]

0
0
26924

Packt

07 Feb 2017

32 min read

Context – Understanding your Data using R

Packt

07 Feb 2017

32 min read

0
2
30180

article-image-classification-using-convolutional-neural-networks

Mohammad Pezeshki

07 Feb 2017

5 min read

Classification using Convolutional Neural Networks

Mohammad Pezeshki

07 Feb 2017

5 min read

In this blog post, we begin with a simple classification task that the reader can readily relate to. The task is a binary classification of 25000 images of cats and dogs, divided into 20000 training, 2500 validation, and 2500 testing images. It seems reasonable to use the most promising model for object recognition, which is convolutional neural network (CNN). As a result, we use CNN as the baseline for the experiments, and along with this post, we will try to improve its performance using different techniques. So, in the next sections, we will first introduce CNN and its architecture and then we will explore three techniques to boost the performance and speed. These three techniques are using Parametric ReLU and a method of Batch Normalization. In this post, we will show the experimental results as we go through each technique. The complete code for CNN is available online in the author’s GitHub repository. Convolutional Neural Networks Convolutional neural networks can be seen as feedforward neural networks that multiple copies of the same neuron are applied to in different places. It means applying the same function to different patches of an image. Doing this means that we are explicitly imposing our knowledge about data (images) into the model structure. That's because we already know that natural image data is translation invariant, meaning that probability distribution of pixels are the same across all images. This structure, which is followed by a non-linearity and a pooling and subsampling layer, makes CNN’s powerful models, especially, when dealing with images. Here's a graphical illustration of CNN from Prof. Hugo Larochelle's course of Neural Networks, which is originally from Prof. YannLecun's paper on ConvNets. Implementation of a CNN in a GPU-based language of Theano is so straightforward as well. So, we can create a layer like this: And then we can stack them on top of each other like this: CNN Experiments Armed with CNN, we attacked the task using two baseline models. A relatively big, and a relatively small model. In the figures below, you can see the number for layer, filter size, pooling size, stride, and a number of fully connected layers. We trained both networks with a learning rate of 0.01, and a momentum of 0.9 on a GTX580 GPU. We also used early stopping. The small model can be trained in two hours and results in 81 percent accuracy on validation sets. The big model can be trained in 24 hours and results in 92 percent accuracy on validation sets. Parametric ReLU Parametric ReLU (aka Leaky ReLU) is an extension to Rectified Linear Unitthat allows the neuron to learn the slope of activation function in the negative region. Unlike the actual paper of Parametric ReLU by Microsoft Research, I used a different parameterizationthat forces the slope to be between 0 and 1. As shown in the figure below, when alpha is 0, the activation function is just linear. On the other hand, if alpha is 1, then the activation function is exactly the ReLU. Interestingly, although the number of trainable parameters is increased using Parametric ReLU, it improves the model both in terms of accuracy and in terms of convergence speed. Using Parametric ReLU makes the training time 3/4 and increases the accuracy around 1 percent. In Parametric ReLU,to make sure that alpha remains between 0 and 1, we will set alpha = Sigmoid(beta) and optimize beta instead. In our experiments, we will set the initial value of alpha to 0.5. After training, all alphas were between 0.5 and 0.8. That means that the model enjoys having a small gradient in the negative region. “Basically, even a small slope in negative region of activation function can help training a lot. Besides, it's important to let the model decide how much nonlinearity it needs.” Batch Normalization Batch Normalization simply means normalizing preactivations for each batch to have zero mean and unit variance. Based on a recent paper by Google, this normalization reduces a problem called Internal Covariance Shift and consequently makes the learning much faster. The equations are as follows: Personally, during this post, I found this as one of the most interesting and simplest techniques I've ever used. A very important point to keep in mind is to feed the whole validation set as a single batch at testing time to have a more accurate (less biased) estimation of mean and variance. “Batch Normalization, which means normalizing pre-activations for each batch to have zero mean and unit variance, can boost the results both in terms of accuracy and in terms of convergence speed.” Conclusion All in all, we will conclude this post with two finalized models. One of them can be trained in 10 epochs or, equivalently, 15 minutes, and can achieve 80 percent accuracy. The other model is a relatively large model. In this model, we did not use LDNN, but the two other techniques are used, and we achieved 94.5 percent accuracy. About the Author Mohammad Pezeshki is a PhD student in the MILA lab at University of Montreal. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014. He then obtained his Master’s in June 2016. His research interests lie in the fields of Artificial Intelligence, Machine Learning, Probabilistic Models and, specifically,Deep Learning.

0
0
26196

Packt

07 Feb 2017

6 min read

Hierarchical Clustering

Packt

07 Feb 2017

6 min read

In this article by Atul Tripathi, author of the book Machine Learning Cookbook, we will cover hierarchical clustering with a World Bank sample dataset. (For more resources related to this topic, see here.) Introduction Hierarchical clustering is one of the most important methods in unsupervised learning is hierarchical clustering. In hierarchical clustering for a given set of data points, the output is produced in the form of a binary tree (dendrogram). In the binary tree, the leaves represent the data points while internal nodes represent nested clusters of various sizes. Each object is assigned to a separate cluster. Evaluation of all the clusters shall take place based on a pairwise distance matrix. The distance matrix shall be constructed using distance values. The pair of clusters with the shortest distance must be considered. The pair identified should then be removed from the matrix and merged together. The merged clusters must be distance must be evaluated with the other clusters and the distance matrix must be updated. The process must be repeated until the distance matrix is reduced to a single element. Hierarchical clustering - World Bank sample dataset One of the main goals for establishing the World Bank has been to fight and eliminate poverty. Continuous evolution and fine tuning its policies in the ever-evolving world has been helping the institution to achieve the goal of poverty elimination. The barometer of success in elimination of poverty is measured in terms of improvement of each of the parameters in health, education, sanitation, infrastructure, and other services needed to improve the lives of poor. The development gains which will ensure the goals must be pursued in an environmentally, socially, and economically sustainable manner. Getting ready In order to perform hierarchical clustering, we shall be using a dataset collected from the World Bank dataset. Step 1 - Collecting and describing data The dataset titled WBClust2013 shall be used. This is available in the CSV format titled WBClust2013.csv. The dataset is in standard format. There are 80 rows of data. There are 14 variables. The numeric variables are: new.forest Rural log.CO2 log.GNI log.Energy.2011 LifeExp Fertility InfMort log.Exports log.Imports CellPhone RuralWater Pop The non-numeric variables are: Country How to do it Step 2 - exploring data Version info: Code for this page was tested in R version 3.2.3 (2015-12-10) Let's explore the data and understand the relationships among the variables. We'll begin by importing the CSV file named WBClust2013.csv. We will be saving the data to the wbclust data frame: > wbclust=read.csv("d:/WBClust2013.csv",header=T) Next, we shall print the wbclust data frame. The head() function returns the wbclust data frame. The wbclust data frame is passed as an input parameter: > head(wbclust) The results are as follows: Step 3 - transforming data Centering variables and creating z-scores are two common data analysis activities to standardize data. The numeric variables mentioned above need to create z-scores. The scale()function is a generic function whose default method centers and/or scales the columns of a numeric matrix. The data frame, wbclust is passed to the scale function. All the numeric fields are only considered. The result is then stored in another data frame, wbnorm. > wbnorm<-scale(wbclust[,2:13]) > wbnorm The results are as follows: All data frames have a row names attribute. In order to retrieve or set the row or column names of a matrix-like object, the rownames()function is used. The data frame, wbclust with the first column is passed to the rownames()function. > rownames(wbnorm)=wbclust[,1] > rownames(wbnorm) The call to the function rownames(wbnorm)results in display of the values from the first column. The results are as follows: Step 4 - training and evaluating the model performance The next step is about training the model. The first step is to calculate the distance matrix. The dist()function is used. Using the specified distance measure, distances between the rows of a data matrix are computed. The distance measure used can be Euclidean, maximum, Manhattan, Canberra, binary, or Minkowski. The distance measure used is Euclidean. The Euclidean distance calculates the distance between two vectors as sqrt(sum((x_i - y_i)^2)).The result is then stored in a new data frame, dist1. > dist1<-dist(wbnorm, method="euclidean") The next step is to perform clustering using Ward's method. The hclust() function is used. In order to perform cluster analysis on a set of dissimilarities of the n objects, the hclust()function is used. At the first stage, each of the objects is assigned to its own cluster. After which, at each stage the algorithm iterates and joins two of the most similar clusters. This process will continue till there is just a single cluster left. The hclust() function requires that we provide the data in the form of a distance matrix. The dist1 data frame is passed. By default, the complete linkage method is used. There are multiple agglomeration methods which can be used. Some of the agglomeration methods could be ward.D, ward.D2, single, complete, average. > clust1<-hclust(dist1,method="ward.D") > clust1 The call to the function, clust1results in display of the agglomeration methods used, the manner in which the distance is calculated, and the number of objects. The results are as follows: Step 5 - plotting the model The plot()function is a generic function for plotting of R objects. Here, the plot() function is used to draw the dendrogram: > plot(clust1,labels= wbclust$Country, cex=0.7, xlab="",ylab="Distance",main="Clustering for 80 Most Populous Countries") The result is as follows: The rect.hclust() function highlights the clusters and draws the rectangles around the branches of the dendrogram. The dendrogram is first cut at a certain level followed by drawing a rectangle around the selected branches. The object, clust1 is passed as an object to the function along with the number of clusters to be formed: > rect.hclust(clust1,k=5) The result is as follows: The cuts()function shall cut the tree into multiple groups on the basis of the desired number of groups or the cut height. Here, clust1 is passed as an object to the function along with the number of the desired group: > cuts=cutree(clust1,k=5) > cuts The result is as follows: Getting the list of countries in each group. The result is as follows: Summary In this article we covered hierarchical clustering by collecting, exploring its contents, transforming the data. We trained and evaluated it by using distance matrix and finally plotted the data as a dendrogram. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Specialized Machine Learning Topics [article] Machine Learning Using Spark MLlib [article]

0
0
5811

article-image-building-search-geo-locator-elasticsearch-and-spark

Packt

31 Jan 2017

12 min read

Building A Search Geo Locator with Elasticsearch and Spark

Packt

31 Jan 2017

12 min read

0
0
6282

Packt

23 Jan 2017

42 min read

The Storage - Apache Cassandra

Packt

23 Jan 2017

42 min read

0
0
6431

article-image-installing-quicksight-application

Packt

20 Jan 2017

4 min read

Installing QuickSight Application

Packt

20 Jan 2017

4 min read

In this article by Rajesh Nadipalli, the author of the book Effective Business Intelligence with QuickSight, we will see how you can install the Amazon QuickSight app from the Apple iTunes store for no cost. You can search for the app from the iTunes store and then proceed to download and install or alternatively you can follow this link to download the app. (For more resources related to this topic, see here.) Amazon QuickSight app is certified to work with iOS devices running iOS v9.0 and above. Once you have the app installed, you can then proceed to login to your QuickSight account as shown in the following screenshot: Figure 1.1: QuickSight sign in The Amazon QuickSight app is designed to access dashboards and analyses on your mobile device. All interactions on the app are read-only and changes you make on your device are not applied to the original visuals so that you can explore without any worry. Dashboards on the go After you login to the QuickSight app, you will first see the list of dashboards associated to your QuickSight account for easy access. If you don't see dashboards, then click on Dashboards icon from the menu at the bottom of your mobile device as shown in the following screenshot: Figure 1.2: Accessing dashboards You will now see the list of dashboards associated to your user ID. Dashboard detailed view From the dashboard listing, select the USA Census Dashboard, which will then redirect you to the detailed dashboard view. In the detailed dashboard view you will see all visuals that are part of that dashboard. You can click on the arrow to the extreme top right of each visual to open the specific chart in full screen mode as shown in the following screenshot. In the scatter plot analysis shown in the following screenshot, you can further click on any of the dots to get specific values about that bubble. In the following screenshot the selected circle is for zip code 94027 which has PopulationCount of 7,089 and MedianIncome of $216,905 and MeanIncome of $336,888: Figure 1.3: Dashboard visual Dashboard search QuickSight mobile app also provides a search feature, which is handy if you know only partial name of the dashboard. Follow the following steps to search for a dashboard: First ensure you are in the dashboards tab by clicking on the Dashboards icon from the bottom menu. Next click on the search icon seen on the top right corner. Next type the partial name. In the following example, i have typed Usa. QuickSight now searches for all dashboards that have the word Usa in it and lists them out. You can next click on the dashboard to get details about that specific dashboard as shown in the following screenshot: Figure 1.4: Dashboard search Favorite a dashboard QuickSight provides a convenient way to bookmark your dashboards by setting them as favorites. To use this feature, first identify which dashboards you often use and click on the star icon to it's right side as shown in the following screenshot. Next to access all of your favorites, click on the Favorites tab and the list is then refined to only those dashboards you had previously identified as favorite: Figure 1.5: Dashboard favorites Limitations of mobile app While dashboards are fairly easy to interact with on the mobile app, there are key limitations when compared to the standard browser version, which I am listing as follows: You cannot create share dashboards to others using the mobile app. You cannot zoom in/out from the visual, which would be really good in scenarios where the charts are dense. Chart legends are not shown. Summary We have seen how to install Amazon QuickSight app and using this app you can browse, search, and view dashboards. We have covered how to access dashboards, search, favorite, and its detailed view. We have also seen some limitations of mobile app. Resources for Article: Further resources on this subject: Introduction to Practical Business Intelligence [article] MicroStrategy 10 [article] Making Your Data Everything It Can Be [article]

0
0
3037

Packt

19 Jan 2017

7 min read

Clustering Model with Spark

Packt

19 Jan 2017

7 min read

0
0
2123

article-image-using-firebase-real-time-database

Oliver Blumanski

18 Jan 2017

5 min read

Using the Firebase Real-Time Database

Oliver Blumanski

18 Jan 2017

5 min read

In this post, we are going to look at how to use the Firebase real-time database, along with an example. Here we are writing and reading data from the database using multiple platforms. To do this, we first need a server script that is adding data, and secondly we need a component that pulls the data from the Firebase database. Step 1 - Server Script to collect data Digest an XML feed and transfer the data into the Firebase real-time database. The script runs as cronjob frequently to refresh the data. Step 2 - App Component Subscribe to the data from a JavaScript component, in this case, React-Native. About Firebase Now that those two steps are complete, let's take a step back and talk about Google Firebase. Firebase offers a range of services such as a real-time database, authentication, cloud notifications, storage, and much more. You can find the full feature list here. Firebase covers three platforms: iOS, Android, and Web. The server script uses the Firebases JavaScript Web API. Having data in this real-time database allows us to query the data from all three platforms (iOS, Android, Web), and in addition, the real-time database allows us to subscribe (listen) to a database path (query), or to query a path once. Step 1 - Digest XML feed and transfer into Firebase Firebase Set UpThe first thing you need to do is to set up a Google Firebase project here In the app, click on "Add another App" and choose Web, a pop-up will show you the configuration. You can copy paste your config into the example script. Now you need to set the rules for your Firebase database. You should make yourself familiar with the database access rules. In my example, the path latestMarkets/ is open for write and read. In a real-world production app, you would have to secure this, having authentication for the write permissions. Here are the database rules to get started: { "rules": { "users": { "$uid": { ".read": "$uid === auth.uid", ".write": "$uid === auth.uid" } }, "latestMarkets": { ".read": true, ".write": true } } } The Server Script Code The XML feed contains stock market data and is frequently changing, except on the weekend. To build the server script, some NPM packages are needed: Firebase Request xml2json babel-preset-es2015 Require modules and configure Firebase web api: const Firebase = require('firebase'); const request = require('request'); const parser = require('xml2json'); // firebase access config const config = { apiKey: "apikey", authDomain: "authdomain", databaseURL: "dburl", storageBucket: "optional", messagingSenderId: "optional" } // init firebase Firebase.initializeApp(config) [/Code] I write JavaScript code in ES6. It is much more fun. It is a simple script, so let's have a look at the code that is relevant to Firebase. The code below is inserting or overwriting data in the database. For this script, I am happy to overwrite data: Firebase.database().ref('latestMarkets/'+value.Symbol).set({ Symbol: value.Symbol, Bid: value.Bid, Ask: value.Ask, High: value.High, Low: value.Low, Direction: value.Direction, Last: value.Last }) .then((response) => { // callback callback(true) }) .catch((error) => { // callback callback(error) }) Firebase Db first references the path: Firebase.database().ref('latestMarkets/'+value.Symbol) And then the action you want to do: // insert/overwrite (promise) Firebase.database().ref('latestMarkets/'+value.Symbol).set({}).then((result)) // get data once (promise) Firebase.database().ref('latestMarkets/'+value.Symbol).once('value').then((snapshot)) // listen to db path, get data on change (callback) Firebase.database().ref('latestMarkets/'+value.Symbol).on('value', ((snapshot) => {}) // ...... Here is the Github repository: Displaying the data in a React-Native app This code below will listen to a database path, on data change, all connected devices will synchronise the data: Firebase.database().ref('latestMarkets/').on('value', snapshot => { // do something with snapshot.val() }) To close the listener, or unsubscribe the path, one can use "off": Firebase.database().ref('latestMarkets/').off() I’ve created an example react-native app to display the data: The Github repository Conclusion In mobile app development, one big question is: "What database and cache solution can I use to provide online and offline capabilities?" One way to look at this question is like you are starting a project from scratch. If so, you can fit your data into Firebase, and then this would be a great solution for you. Additionally, you can use it for both web and mobile apps. The great thing is that you don't need to write a particular API, and you can access data straight from JavaScript. On the other hand, if you have a project that uses MySQL for example, the Firebase real-time database won't help you much. You would need to have a remote API to connect to your database in this case. But even if using the Firebase database isn't a good fit for your project, there are still other features, such as Firebase Storage or Cloud Messaging, which are very easy to use, and even though they are beyond the scope of this post, they are worth checking out. About the author Oliver Blumanski is a developer based out of Townsville, Australia. He has been a software developer since 2000, and can be found on GitHub at @blumanski.

0
0
16942

Packt

16 Jan 2017

15 min read

Tabular Models

Packt

16 Jan 2017

15 min read

In this article by Derek Wilson, the author of the book Tabular Modeling with SQL Server 2016 Analysis Services Cookbook, you will learn the following recipes: Opening an existing model Importing data Modifying model relationships Modifying model measures Modifying model columns Modifying model hierarchies Creating a calculated table Creating key performance indicators (KPIs) Modifying key performance indicators (KPIs) Deploying a modified model (For more resources related to this topic, see here.) Once the new data is loaded into the model, we will modify various pieces of the model, including adding a new Key Performance Indicator. Next, we will perform calculations to see how to create and modify measures and columns. Opening an existing model We will open the model. To make modifications to your deployed models, we will need to open the model in the Visual Studio designer. How to do it… Open your solution, by navigating to File | Open | Project/Solution. Then select the folder and solution Chapter3_Model and select Open. Your solution is now open and ready for modification. How it works… Visual Studio stores the model as a project inside of a solution. In Chapter 3 we created a new project and saved it as Chapter3_Model. To make modifications to the model we open it in Visual Studio. Importing data The crash data has many columns that store the data in codes. In order to make this data useful for reporting, we need to add description columns. In this section, we will create four code tables by importing data into a SQL Server database. Then, we will add the tables to your existing model. Getting ready In the database on your SQL Server, run the following scripts to create the four tables and populate them with the reference data: Create the Major Cause of Accident Reference Data table: CREATE TABLE [dbo].[MAJCSE_T]( [MAJCSE] [int] NULL, [MAJOR_CAUSE] [varchar](50) NULL ) ON [PRIMARY] Then, populate the table with data: INSERT INTO MAJCSE_T VALUES (20, 'Overall/rollover'), (21, 'Jackknife'), (31, 'Animal'), (32, 'Non-motorist'), (33, 'Vehicle in Traffic'), (35, 'Parked motor vehicle'), (37, 'Railway vehicle'), (40, 'Collision with bridge'), (41, 'Collision with bridge pier'), (43, 'Collision with curb'), (44, 'Collision with ditch'), (47, 'Collision culvert'), (48, 'Collision Guardrail - face'), (50, 'Collision traffic barrier'), (53, 'impact with Attenuator'), (54, 'Collision with utility pole'), (55, 'Collision with traffic sign'), (59, 'Collision with mailbox'), (60, 'Collision with Tree'), (70, 'Fire'), (71, 'Immersion'), (72, 'Hit and Run'), (99, 'Unknown') Create the table to store the lighting conditions at the time of the crash: CREATE TABLE [dbo].[LIGHT_T]( [LIGHT] [int] NULL, [LIGHT_CONDITION] [varchar](30) NULL ) ON [PRIMARY] Now, populate the data that shows the descriptions for the codes: INSERT INTO LIGHT_T VALUES (1, 'Daylight'), (2, 'Dusk'), (3, 'Dawn'), (4, 'Dark, roadway lighted'), (5, 'Dark, roadway not lighted'), (6, 'Dark, unknown lighting'), (9, 'Unknown') Create the table to store the road conditions: CREATE TABLE [dbo].[CSRFCND_T]( [CSRFCND] [int] NULL, [SURFACE_CONDITION] [varchar](50) NULL ) ON [PRIMARY] Now populate the road condition descriptions: INSERT INTO CSRFCND_T VALUES (1, 'Dry'), (2, 'Wet'), (3, 'Ice'), (4, 'Snow'), (5, 'Slush'), (6, 'Sand, Mud'), (7, 'Water'), (99, 'Unknown') Finally, create the weather table: CREATE TABLE [dbo].[WEATHER_T]( [WEATHER] [int] NULL, [WEATHER_CONDITION] [varchar](30) NULL ) ON [PRIMARY] Then populate the weather condition descriptions. INSERT INTO WEATHER_T VALUES (1, 'Clear'), (2, 'Partly Cloudy'), (3, 'Cloudy'), (5, 'Mist'), (6, 'Rain'), (7, 'Sleet, hail, freezing rain'), (9, 'Severe winds'), (10, 'Blowing Sand'), (99, 'Unknown') You now have the tables and data required to complete the recipes in this chapter. How to do it… From your open model, change to the Diagram view in model.bim. Navigate to Model | Import from Data Source then select Microsoft SQL Server on the Table Import Wizard and click on Next. Set your Server Name to Localhost and change the Database name to Chapter3 and click on Next. Enter your admin account username and password and click on Next. You want to select from a list of tables the four tables that were created at the beginning. Click on Finish to import the data. How it works… This recipe opens the table import wizard and allows us to select the four new tables that are to be added to the existing model. The data is then imported into your Tabular Model workspace. Once imported, the data is now ready to be used to enhance the model. Modifying model relationships We will create the necessary relationships for the new tables. These relationships will be used in the model in order for the SSAS engine to perform correct calculations. How to do it… Open your model to the diagram view and you will see the four tables that you imported from the previous recipe. Select the CSRFCND field in the CSRFCND_T table and drag the CSRFCND table in the Crash_Data table. Select the LIGHT field in the LIGHT_T table and drag to the LIGHT table in the Crash_Data table. Select the MAJCSE field in the MAJCSE_T table and drag to the MAJCSE table in the Crash_Data table. Select the WEATHER field in the WEATHER_T table and drag to the WEATHER table in the Crash_Data table. How it works… Each table in this section has a relationship built between the code columns and the Crash_Data table corresponding columns. These relationships allow for DAX calculations to be applied across the data tables. Modifying model measures Now that there are more tables in the model, we are going to add an additional measure to perform quick calculations on data. The measure will use a simple DAX calculation since it is focused on how to add or modify the model measures. How to do it… Open the Chapter 3 model project to the Model.bim folder and make sure you are in grid view. Select the cell under Count_of_Crashes and in the fx bar add the following DAX formula to create Sum_of_Fatalities: Sum_of_Fatalities:=SUM(Crash_Data[FATALITIES]) Then, hit Enter to create the calculation: In the properties window, enter Injury_Calculations in the Display Folder. Then, change the Format to Whole Number and change the Show Thousand Separator to True. Finally, add to Description Total Number of Fatalities Recorded: How it works… In this recipe, we added a new measure to the existing model that calculates the total number of fatalities on the Crash_Data table. Then we added a new folder for the users to see the calculation. We also modified the default behavior of the calculation to display as a whole number and show commas to make the numbers easier to interpret. Finally, we added a description to the calculation that users will be able to see in the reporting tools. If we did not make these changes in the model, each user will be required to make the changes each time they accessed the model. By placing the changes in the model, everyone will see the data in the same format. Modifying model columns We will modify the properties of the columns on the WEATHER table. Modifications to the columns in a table make the information easier for your users to understand in the reporting tools. Some properties determine how the SSAS engine uses the fields when creating the model on the server. How to do it… In Model.bim, make sure you are in the grid view and change to the WEATHER_T tab. Select WEATHER column to view the available Properties and make the following changes: Hiddenproperty to True Uniqueproperty to True Sort By ColumnselectWEATHER_CONDITION Summarize By to Count Next, select the WEATHER_CONDITION column and modify the following properties. Description add Weather at time of crash Default Labelproperty to True How it works… This recipe modified the properties of the measure to make it better for your report users to access the data. The WEATHER code column was hidden so it will not be visible in the reporting tools and the WEATHER_CONDITION was sorted in alphabetical order. You set the default aggregation to Count and then added a description for the column. Now, when this dimension is added to a report only the WEATHER_CONDITION column will be seen and pre-sorted based on the WEATHER_CONDITION field. It will also use count as the aggregation type to provide the number of each type of weather conditions. If you were to add another new description to the table, it would automatically be sorted correctly. Modifying model hierarchies Once you have created a hierarchy, you may want to remove or modify the hierarchy from your model. We will make modifications to the Calendar_YQMD hierarchy. How to do it… Open Model.bim to the diagram view and find the Master_Calendar_T table. Review the Calendar_YQMD hierarchy and included columns. Select the Quarter_Name column and right-click on it to bring up the menu. Select Remove from Hierarchy to delete Quarter_Name from the hierarchy and confirm on the next screen by selecting Remove from Hierarchy. Select the Calendar_YQMD hierarchy and right-click on it and select Rename. Change the name to Calendar_YMD and hit on Enter. How it works… In this recipe, we opened the diagram view and selected the Master_Calendar_T table to find the existing hierarchy. After selecting the Quarter_Name column in the hierarchy, we used the menus to view the available options for modifications. Then we selected the option to remove the column from the hierarchy. Finally, we updated the name of the hierarchy to let users know that the quarter column is not included. There’s more… Another option to remove fields from the hierarchy is to select the column and then press the delete key. Likewise, you can double-click on the Calendar_YQMD hierarchy to bring up the edit window for the name. Then edit the name and hit Enter to save the change in the designer. Creating a calculated table Calculated tables are created dynamically using functions or DAX queries. They are very useful if you need to create a new table based on information in another table. For example, you could have a date table with 30 years of data. However, most of your users only look at the last five years of information when running most of their analysis. Instead of creating a new table you can dynamically make a new table that only stores the last five years of dates. You will use a single DAX query to filter the Master_Calendar_T table to the last 5 years of data. How to do it… OpenModel.bim to the grid view and then select the Table menu and New Calculated Table. A new data tab is created. In the function box, enter this DAX formula to create a date calendar for the last 5 years: FILTER(MasterCalendar_T, MasterCalendar_T[Date]>=DATEADD(MasterCalendar_T[Date],6,YEAR)) Double-click on the CalculatedTable 1 tab and rename to Last_5_Years_T. How it works… It works by creating a new table in the model that is built from a DAX formula. In order to limit the number of years shown, the DAX formula reduces the total number of dates available for the last 5 years of dates. There’s more… After you create a calculated table, you will need to create the necessary relationships and hierarchies just like a regular table: Switch to the diagram view in the model.bim and you will be able to see the new table. Create a new hierarchy and name it Last_5_Years_YQM and include Year, Quarter_Name, Month_Name, and Date Replace the Master_Calendar_T relationship with the Date column from the Last_5_Years_T date column to the Crash_Date.Crash_Date column. Now, the model will only display the last 5 years of crash data when using the Last_5_Years_T table in the reporting tools. The Crash_Data table still contains all of the records if you need to view more than 5 years of data. Creating key performance indicators (KPIs) Key performance indicators are business metrics that show the effectiveness of a business objective. They are used to track actual performance against budgeted or planned value such as Service Level Agreements or On-Time performance. The advantage of creating a KPI is the ability to quickly see the actual value compared to the target value. To add a KPI, you will need to have a measure to use as the actual and another measure that returns the target value. In this recipe, we will create a KPI that tracks the number of fatalities and compares them to the prior year with the goal of having fewer fatalities each year. How to do it… Open the Model.bim to the grid view and select an empty cell and create a new measure named Last_Year_Fatalities:Last_Year_Fatalities:=CALCULATE(SUM(Crash_Data[FATALITIES]),DATEADD(MasterCalendar_T[Date],-1, YEAR)) Select the already existing Sum_of_measure then right-click and select Create KPI…. On the Key Performance Indicator (KPI) window, select Last_Year_Fatalities as the Target Measure. Then, select the second set of icons that have red, yellow, and green with symbols. Finally, change the KPI color scheme to green, yellow, and red and make the scores 90 and 97, and then click on OK. The Sum_of_Fatalites measure will now have a small graph next to it in the measure grid to show that there is a KPI on that measure. How it works… You created a new calculation that compared the actual count of fatalities compared to the same number for the prior year. Then you created a new KPI that used the actual and Last_Year_Fatalities measure. In the KPI window, you setup thresholds to determine when a KPI is red, yellow, or green. For this example, you want to show that having less fatalities year over year is better. Therefore, when the KPI is 97% or higher the KPI will show red. For values that are in the range of 90% to 97% the KPI is yellow and anything below 90% is green. By selecting the icons with both color and symbols, users that are color-blind can still determine the appropriate symbol of the KPI. Modifying key performance indicators (KPIs) Once you have created a KPI, you may want to remove or modify the KPI from your model. You will make modifications to the Last_Year_Fatalities hierarchy. How to do it… Open Model.bim to the Grid view and select the Sum_of_Fatalities measure then right-click to bring up Edit KPI settings…. Edit the appropriate settings to modify an existing KPI. How it works… Just like models, KPIs will need to be modified after being initially designed. The icon next to a measure denotes that a KPI is defined on the measure. Right-clicking on the measure brings up the menu that allows you to enter the Edit KPI setting. Deploying a modified model Once you have completed the changes to your model, you have two options for deployment. First, you can deploy the model and replace the existing model. Alternatively, you can change the name of your model and deploy it as a new model. This is often useful when you need to test changes and maintain the existing model as is. How to do it… Open the Chapter3_model project in Visual Studio. Select the Project menu and select Chapter3_Model Properties… to bring up the Properties menu and review the Server and Database properties. To overwrite an existing model make no changes and click on OK. Select the Build menu from the Chapter3_Model project and select the Deploy Chapter3_Model option. On the following screens, enter the impersonation credentials for your data and hit OK to deploy the changes. How it works… the model that is on your local machine and submits the changes to the server. By not making any changes to the existing model properties, a new deployment will overwrite the old model. All of your changes are now published on the server and users can begin to leverage the changes. There’s more… Sometimes you might want to deploy your model to a different database without overwriting the existing environment. This could be to try out a new model or test different functionality with users that you might want to implement. You can modify the properties of the project to deploy to a different server such as development, UAT, or production. Likewise, you can also change the database name to deploy the model to the same server or different servers for testing. Open the Project menu and then select Chapter3_Model Properties. Change the name of the Database to Chapter4_Model and click on OK. Next, on the Build menu, select Deploy Chapter3_Model to deploy the model to the same server under the new name of Chapter4_Model. When you review the Analysis Services databases in SQL Server Management Studio, you will now see a database for Chapter3_Model and Chapter4_Model. Summary After building a model, we will need to maintain and enhance the model as the business users update or change their requirements. We will begin by adding additional tables to the model that contain the descriptive data columns for several code columns. Then we will create relationships between these new tables and the existing data tables. Resources for Article: Further resources on this subject: Say Hi to Tableau [article] Data Tables and DataTables Plugin in jQuery 1.3 with PHP [article] Data Science with R [article]

0
0
3000

article-image-basic-operations-elasticsearch

Packt

16 Jan 2017

10 min read

Basic Operations of Elasticsearch

Packt

16 Jan 2017

10 min read

In this article by Alberto Maria Angelo Paro, the author of the book ElasticSearch 5.0 Cookbook - Third Edition, you will learn the following recipes: Creating an index Deleting an index Opening/closing an index Putting a mapping in an index Getting a mapping (For more resources related to this topic, see here.) Creating an index The first operation to do before starting indexing data in Elasticsearch is to create an index--the main container of our data. An index is similar to the concept of database in SQL, a container for types (tables in SQL) and documents (records in SQL). Getting ready To execute curl via the command line you need to install curl for your operative system. How to do it... The HTTP method to create an index is PUT (but also POST works); the REST URL contains the index name: http://<server>/<index_name> For creating an index, we will perform the following steps: From the command line, we can execute a PUT call: curl -XPUT http://127.0.0.1:9200/myindex -d '{ "settings" : { "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } } }' The result returned by Elasticsearch should be: {"acknowledged":true,"shards_acknowledged":true} If the index already exists, a 400 error is returned: { "error" : { "root_cause" : [ { "type" : "index_already_exists_exception", "reason" : "index [myindex/YJRxuqvkQWOe3VuTaTbu7g] already exists", "index_uuid" : "YJRxuqvkQWOe3VuTaTbu7g", "index" : "myindex" } ], "type" : "index_already_exists_exception", "reason" : "index [myindex/YJRxuqvkQWOe3VuTaTbu7g] already exists", "index_uuid" : "YJRxuqvkQWOe3VuTaTbu7g", "index" : "myindex" }, "status" : 400 } How it works... Because the index name will be mapped to a directory on your storage, there are some limitations to the index name, and the only accepted characters are: ASCII letters [a-z] Numbers [0-9] point ".", minus "-", "&" and "_" During index creation, the replication can be set with two parameters in the settings/index object: number_of_shards, which controls the number of shards that compose the index (every shard can store up to 2^32 documents) number_of_replicas, which controls the number of replica (how many times your data is replicated in the cluster for high availability)A good practice is to set this value at least to 1. The API call initializes a new index, which means: The index is created in a primary node first and then its status is propagated to all nodes of the cluster level A default mapping (empty) is created All the shards required by the index are initialized and ready to accept data The index creation API allows defining the mapping during creation time. The parameter required to define a mapping is mapping and accepts multi mappings. So in a single call it is possible to create an index and put the required mappings. There's more... The create index command allows passing also the mappings section, which contains the mapping definitions. It is a shortcut to create an index with mappings, without executing an extra PUT mapping call: curl -XPOST localhost:9200/myindex -d '{ "settings" : { "number_of_shards" : 2, "number_of_replicas" : 1 }, "mappings" : { "order" : { "properties" : { "id" : {"type" : "keyword", "store" : "yes"}, "date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"}, "customer_id" : {"type" : "keyword", "store" : "yes"}, "sent" : {"type" : "boolea+n", "index":"not_analyzed"}, "name" : {"type" : "text", "index":"analyzed"}, "quantity" : {"type" : "integer", "index":"not_analyzed"}, "vat" : {"type" : "double", "index":"no"} } } } }' Deleting an index The counterpart of creating an index is deleting one. Deleting an index means deleting its shards, mappings, and data. There are many common scenarios when we need to delete an index, such as: Removing the index to clean unwanted/obsolete data (for example, old Logstash indices). Resetting an index for a scratch restart. Deleting an index that has some missing shard, mainly due to some failures, to bring back the cluster in a valid state (if a node dies and it's storing a single replica shard of an index, this index is missing a shard so the cluster state becomes red. In this case, you'll bring back the cluster to a green status, but you lose the data contained in the deleted index). Getting ready To execute curl via command line you need to install curl for your operative system. The index created is required to be deleted. How to do it... The HTTP method used to delete an index is DELETE. The following URL contains only the index name: http://<server>/<index_name> For deleting an index, we will perform the steps given as follows: Execute a DELETE call, by writing the following command: curl -XDELETE http://127.0.0.1:9200/myindex We check the result returned by Elasticsearch. If everything is all right, it should be: {"acknowledged":true} If the index doesn't exist, a 404 error is returned: { "error" : { "root_cause" : [ { "type" : "index_not_found_exception", "reason" : "no such index", "resource.type" : "index_or_alias", "resource.id" : "myindex", "index_uuid" : "_na_", "index" : "myindex" } ], "type" : "index_not_found_exception", "reason" : "no such index", "resource.type" : "index_or_alias", "resource.id" : "myindex", "index_uuid" : "_na_", "index" : "myindex" }, "status" : 404 } How it works... When an index is deleted, all the data related to the index is removed from disk and is lost. During the delete processing, first the cluster is updated, and then the shards are deleted from the storage. This operation is very fast; in a traditional filesystem it is implemented as a recursive delete. It's not possible restore a deleted index, if there is no backup. Also calling using the special _all index_name can be used to remove all the indices. In production it is good practice to disable the all indices deletion by adding the following line to Elasticsearch.yml: action.destructive_requires_name:true Opening/closing an index If you want to keep your data, but save resources (memory/CPU), a good alternative to delete indexes is to close them. Elasticsearch allows you to open/close an index to put it into online/offline mode. Getting ready To execute curl via the command line you need to install curl for your operative system. How to do it... For opening/closing an index, we will perform the following steps: From the command line, we can execute a POST call to close an index using: curl -XPOST http://127.0.0.1:9200/myindex/_close If the call is successful, the result returned by Elasticsearch should be: {,"acknowledged":true} To open an index, from the command line, type the following command: curl -XPOST http://127.0.0.1:9200/myindex/_open If the call is successful, the result returned by Elasticsearch should be: {"acknowledged":true} How it works... When an index is closed, there is no overhead on the cluster (except for metadata state): the index shards are switched off and they don't use file descriptors, memory, and threads. There are many use cases when closing an index: Disabling date-based indices (indices that store their records by date), for example, when you keep an index for a week, month, or day and you want to keep online a fixed number of old indices (that is, two months) and some offline (that is, from two months to six months). When you do searches on all the active indices of a cluster and don't want search in some indices (in this case, using alias is the best solution, but you can achieve the same concept of alias with closed indices). An alias cannot have the same name as an index When an index is closed, calling the open restores its state. Putting a mapping in an index We saw how to build mapping by indexing documents. This recipe shows how to put a type mapping in an index. This kind of operation can be considered as the Elasticsearch version of an SQL created table. Getting ready To execute curl via the command line you need to install curl for your operative system. How to do it... The HTTP method to put a mapping is PUT (also POST works). The URL format for putting a mapping is: http://<server>/<index_name>/<type_name>/_mapping For putting a mapping in an index, we will perform the steps given as follows: If we consider the type order, the call will be: curl -XPUT 'http://localhost:9200/myindex/order/_mapping' -d '{ "order" : { "properties" : { "id" : {"type" : "keyword", "store" : "yes"}, "date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"}, "customer_id" : {"type" : "keyword", "store" : "yes"}, "sent" : {"type" : "boolean", "index":"not_analyzed"}, "name" : {"type" : "text", "index":"analyzed"}, "quantity" : {"type" : "integer", "index":"not_analyzed"}, "vat" : {"type" : "double", "index":"no"} } } }' In case of success, the result returned by Elasticsearch should be: {"acknowledged":true} How it works... This call checks if the index exists and then it creates one or more type mapping as described in the definition. During mapping insert if there is an existing mapping for this type, it is merged with the new one. If there is a field with a different type and the type could not be updated, an exception expanding fields property is raised. To prevent an exception during the merging mapping phase, it's possible to specify the ignore_conflicts parameter to true (default is false). The put mapping call allows you to set the type for several indices in one shot; list the indices separated by commas or to apply all indexes using the _all alias. There's more… There is not a delete operation for mapping. It's not possible to delete a single mapping from an index. To remove or change a mapping you need to manage the following steps: Create a new index with the new/modified mapping Reindex all the records Delete the old index with incorrect mapping Getting a mapping After having set our mappings for processing types, we sometimes need to control or analyze the mapping to prevent issues. The action to get the mapping for a type helps us to understand structure or its evolution due to some merge and implicit type guessing. Getting ready To execute curl via command-line you need to install curl for your operative system. How to do it… The HTTP method to get a mapping is GET. The URL formats for getting mappings are: http://<server>/_mapping http://<server>/<index_name>/_mapping http://<server>/<index_name>/<type_name>/_mapping To get a mapping from the type of an index, we will perform the following steps: If we consider the type order of the previous chapter, the call will be: curl -XGET 'http://localhost:9200/myindex/order/_mapping?pretty=true' The pretty argument in the URL is optional, but very handy to pretty print the response output. The result returned by Elasticsearch should be: { "myindex" : { "mappings" : { "order" : { "properties" : { "customer_id" : { "type" : "keyword", "store" : true }, … truncated } } } } } How it works... The mapping is stored at the cluster level in Elasticsearch. The call checks both index and type existence and then it returns the stored mapping. The returned mapping is in a reduced form, which means that the default values for a field are not returned. Elasticsearch stores only not default field values to reduce network and memory consumption. Retrieving a mapping is very useful for several purposes: Debugging template level mapping Checking if implicit mapping was derivated correctly by guessing fields Retrieving the mapping metadata, which can be used to store type-related information Simply checking if the mapping is correct If you need to fetch several mappings, it is better to do it at index level or cluster level to reduce the numbers of API calls. Summary We learned how to manage indices and perform operations on documents. We'll discuss different operations on indices such as create, delete, update, open, and close. These operations are very important because they allow better define the container (index) that will store your documents. The index create/delete actions are similar to the SQL create/delete database commands. Resources for Article: Further resources on this subject: Elastic Stack Overview [article] Elasticsearch – Spicing Up a Search Using Geo [article] Downloading and Setting Up ElasticSearch [article]

0
0
5843

How-To Tutorials - Data

Data Pipelines

Understanding Spark RDD

Review of SQL Server Features for Developers

CNN architecture

Bitcoin

Context – Understanding your Data using R

Classification using Convolutional Neural Networks

Hierarchical Clustering

Building A Search Geo Locator with Elasticsearch and Spark

The Storage - Apache Cassandra

Trending Topics

Installing QuickSight Application

Clustering Model with Spark

Using the Firebase Real-Time Database

Tabular Models

Basic Operations of Elasticsearch

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access