Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-getting-started-with-h2o-for-machine-learning
Sugandha Lahoti
01 Dec 2017
7 min read
Save for later

Getting started with Machine Learning in H2O

Sugandha Lahoti
01 Dec 2017
7 min read
[box type="note" align="" class="" width=""]We present to you an excerpt from our book by Dr. Uday Kamath and Krishna Choppella titled Mastering Java Machine Learning. This book aims to give you an array of advanced techniques on Machine Learning. [/box] Our article given below talks about using H2O as a Machine Learning Platform for Big Data applications. H2O is a leading open source platform for Machine Learning at Big Data scale, with a focus on bringing AI to the enterprise. The company counts several leading lights in statistical learning theory and optimization among its scientific advisors. It supports programming environments in multiple languages. H2O architecture The following figure gives a high-level architecture of H2O with important components. H2O can access data from various data stores such as HDFS, SQL, NoSQL, and Amazon S3, to name a few. The most popular deployment of H2O is to use one of the deployment stacks with Spark or to run it in a H2O cluster itself. The core of H2O is an optimized way of handling Big Data in memory, so that iterative algorithms that go through the same data can be handled efficiently and achieve good performance. Important Machine Learning algorithms in supervised and unsupervised learning are implemented specially to handle horizontal scalability across multiple nodes and JVMs. H2O provides not only its own user interface, known as flow, to manage and run modeling tasks, but also has different language bindings and connector APIs to Java, R, Python, and Scala. Most Machine Learning algorithms, optimization algorithms, and utilities use the concept of fork-join or MapReduce. As shown in the figure below, the entire dataset is considered as a Data Frame in H2O, and comprises vectors, which are features or columns in the dataset. The rows or instances are made up of one element from each Vector arranged side-by-side. The rows are grouped together to form a processing unit known as a Chunk. Several chunks are combined in one JVM. Any algorithmic or optimization work begins by sending the information from the topmost JVM to fork on to the next JVM, then on to the next, and so on, similar to the map operation in MapReduce. Each JVM works on the rows in the chunks to establish the task and finally the results flow back in the reduce operation: Machine learning in H2O The following figure shows all the Machine Learning algorithms supported in H2O v3 for supervised and unsupervised learning: Tools and usage H2O Flow is an interactive web application that helps data scientists to perform various tasks from importing data to running complex models using point and click and wizard-based concepts. H2O is run in local mode as: java –Xmx6g –jar h2o.jar The default way to start Flow is to point your browser and go to the following URL: http://192.168.1.7:54321/. The right-side of Flow captures every user action performed under the tab OUTLINE. The actions taken can be edited and saved as named flows for reuse and collaboration, as shown in the figure below: The figure below shows the interface for importing files from the local filesystem or HDFS and displays detailed summary statistics as well as next actions that can be performed on the dataset. Once the data is imported, it gets a data frame reference in the H2O framework with the extension of .hex. The summary statistics are useful in understanding the characteristics of data such as missing, mean, max, min, and so on. It also has an easy way to transform the features from one type to another, for example, numeric features with a few unique values to categorical/nominal types known as enum in H2O. The actions that can be performed on the datasets are: Visualize the data. Split the data into different sets such as training, validation, and testing. Build supervised and unsupervised models. Use the models to predict. Download and export the files in various formats. Building supervised or unsupervised models in H2O is done through an interactive screen. Every modeling algorithm has its parameters classified into three sections: basic, advanced, and expert. Any parameter that supports hyper-parameter searches for tuning the model has a checkbox grid next to it, and more than one parameter value can be used. Some basic parameters such as training_frame, validation_frame, and response_ column, are common to every supervised algorithm; others are specific to model types, such as the choice of solver for GLM, the activation function for deep learning, and so on. All such common parameters are available in the basic section. Advanced parameters are settings that afford greater flexibility and control to the modeler if the default behavior must be overridden. Several of these parameters are also common across some algorithms—two examples are the choice of method for assigning the fold index (if cross-validation was selected in the basic section), and selecting the column containing weights (if each example is weighted separately), and so on. Expert parameters define more complex elements such as how to handle the missing values, model-specific parameters that need more than a basic understanding of the algorithms, and other esoteric variables. In the figure below, GLM, a supervised learning algorithm, is being configured with 10-fold cross-validation, binomial (two-class) classification, efficient LBFGS optimization algorithm, and stratified sampling for cross-validation split: The model results screen contains a detailed analysis of the results using important evaluation charts, depending on the validation method that was used. At the top of the screen are possible actions that can be taken, such as to run the model on unseen data for prediction, download the model as POJO format, export the results, and so on. Some of the charts are algorithm-specific, like the scoring history that shows how the training loss or the objective function changes over the iterations in GLM—this gives the user insight into the speed of convergence as well as into the tuning of the iterations parameter. We see the ROC curves and the Area Under Curve metric on the validation data in addition to the gains and lift charts, which give the cumulative capture rate and cumulative lift over the validation sample respectively. The figure below shows SCORING HISTORY, ROC CURVE, and GAINS/LIFT charts for GLM on 10-fold cross-validation on the CoverType dataset: The output of validation gives detailed evaluation measures such as accuracy, AUC, err, errors, f1 measure, MCC (Mathews Correlation Coefficient), precision, and recall for each validation fold in the case of cross-validation as well as the mean and standard deviation computed across all. The prediction action runs the model using unseen held-out data to estimate the out-of-sample performance. Important measures such as errors, accuracy, area under curve, ROC plots, and so on, are given as the output of predictions that can be saved or exported. H2O is a rich visualization and analysis framework that is accessible from multiple programming environments( HDFS, SQL, NoSQL, S3, and others). It can also support a number of Machine Learning algorithms that can be run in a cluster. All these factors make it one of the major Machine Learning framework on Big Data. If you think this post is useful, do not miss to check our book Mastering Java Machine Learning  to know more on predictive models for batch- and stream-based big data learning using the latest tools and methodologies.  
Read more
  • 0
  • 0
  • 4471

article-image-integration-spark-sql
Packt
24 Sep 2015
11 min read
Save for later

Integration with Spark SQL

Packt
24 Sep 2015
11 min read
 In this article by Sumit Gupta, the author of the book Learning Real-time Processing with Spark Streaming, we will discuss the integration of Spark Streaming with various other advance Spark libraries such as Spark SQL. (For more resources related to this topic, see here.) No single software in today's world can fulfill the varied, versatile, and complex demands/needs of the enterprises, and to be honest, neither should it! Software are made to fulfill specific needs arising out of the enterprises at a particular point in time, which may change in future due to many other factors. These factors may or may not be controlled like government policies, business/market dynamics, and many more. Considering all these factors integration and interoperability of any software system with internal/external systems/software's is pivotal in fulfilling the enterprise needs. Integration and interoperability are categorized as nonfunctional requirements, which are always implicit and may or may not be explicitly stated by the end users. Over the period of time, architects have realized the importance of these implicit requirements in modern enterprises, and now, all enterprise architectures provide support due diligence and provisions in fulfillment of these requirements. Even the enterprise architecture frameworks such as The Open Group Architecture Framework (TOGAF) defines the specific set of procedures and guidelines for defining and establishing interoperability and integration requirements of modern enterprises. Spark community realized the importance of both these factors and provided a versatile and scalable framework with certain hooks for integration and interoperability with the different systems/libraries; for example; data consumed and processed via Spark streams can also be loaded into the structured (table: rows/columns) format and can be further queried using SQL. Even the data can be stored in the form of Hive tables in HDFS as persistent tables, which will exist even after our Spark program has restarted. In this article, we will discuss querying streaming data in real time using Spark SQL. Querying streaming data in real time Spark Streaming is developed on the principle of integration and interoperability where it not only provides a framework for consuming data in near real time from varied data sources, but at the same time, it also provides the integration with Spark SQL where existing DStreams can be converted into structured data format for querying using standard SQL constructs. There are many such use cases where SQL on streaming data is a much needed feature; for example, in our distributed log analysis use case, we may need to combine the precomputed datasets with the streaming data for performing exploratory analysis using interactive SQL queries, which is difficult to implement only with streaming operators as they are not designed for introducing new datasets and perform ad hoc queries. Moreover SQL's success at expressing complex data transformations derives from the fact that it is based on a set of very powerful data processing primitives that do filtering, merging, correlation, and aggregation, which is not available in the low-level programming languages such as Java/ C++ and may result in long development cycles and high maintenance costs. Let's move forward and first understand few things about Spark SQL, and then, we will also see the process of converting existing DStreams into the Structured formats. Understanding Spark SQL Spark SQL is one of the modules developed over the Spark framework for processing structured data, which is stored in the form of rows and columns. At a very high level, it is similar to the data residing in RDBMS in the form rows and columns, and then SQL queries are executed for performing analysis, but Spark SQL is much more versatile and flexible as compared to RDBMS. Spark SQL provides distributed processing of SQL queries and can be compared to frameworks Hive/Impala or Drill. Here are the few notable features of Spark SQL: Spark SQL is capable of loading data from variety of data sources such as text files, JSON, Hive, HDFS, Parquet format, and of course RDBMS too so that we can consume/join and process datasets from different and varied data sources. It supports static and dynamic schema definition for the data loaded from various sources, which helps in defining schema for known data structures/types, and also for those datasets where the columns and their types are not known until runtime. It can work as a distributed query engine using the thrift JDBC/ODBC server or command-line interface where end users or applications can interact with Spark SQL directly to run SQL queries. Spark SQL provides integration with Spark Streaming where DStreams can be transformed into the structured format and further SQL Queries can be executed. It is capable of caching tables using an in-memory columnar format for faster reads and in-memory data processing. It supports Schema evolution so that new columns can be added/deleted to the existing schema, and Spark SQL still maintains the compatibility between all versions of the schema. Spark SQL defines the higher level of programming abstraction called DataFrames, which is also an extension to the existing RDD API. Data frames are the distributed collection of the objects in the form of rows and named columns, which is similar to tables in the RDBMS, but with much richer functionality containing all the previously defined features. The DataFrame API is inspired by the concepts of data frames in R (http://www.r-tutor.com/r-introduction/data-frame) and Python (http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). Let's move ahead and understand how Spark SQL works with the help of an example: As a first step, let's create sample JSON data about the basic information about the company's departments such as Name, Employees, and so on, and save this data into the file company.json. The JSON file would look like this: [ { "Name":"DEPT_A", "No_Of_Emp":10, "No_Of_Supervisors":2 }, { "Name":"DEPT_B", "No_Of_Emp":12, "No_Of_Supervisors":2 }, { "Name":"DEPT_C", "No_Of_Emp":14, "No_Of_Supervisors":3 }, { "Name":"DEPT_D", "No_Of_Emp":10, "No_Of_Supervisors":1 }, { "Name":"DEPT_E", "No_Of_Emp":20, "No_Of_Supervisors":5 } ] You can use any online JSON editor such as http://codebeautify.org/online-json-editor to see and edit data defined in the preceding JSON code. Next, let's extend our Spark-Examples project and create a new package by the name chapter.six, and within this new package, create a new Scala object and name it as ScalaFirstSparkSQL.scala. Next, add the following import statements just below the package declaration: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ Further, in your main method, add following set of statements to create SQLContext from SparkContext: //Creating Spark Configuration val conf = new SparkConf() //Setting Application/ Job Name conf.setAppName("My First Spark SQL") // Define Spark Context which we will use to initialize our SQL Context val sparkCtx = new SparkContext(conf) //Creating SQL Context val sqlCtx = new SQLContext(sparkCtx) SQLContext or any of its descendants such as HiveContext—for working with Hive tables or CassandraSQLContext—for working with Cassandra tables is the main entry point for accessing all functionalities of Spark SQL. It allows the creation of data frames, and also provides functionality to fire SQL queries over data frames. Next, we will define the following code to load the JSON file (company.json) using the SQLContext, and further, we will also create a data frame: //Define path of your JSON File (company.json) which needs to be processed val path = "/home/softwares/spark/data/company.json"; //Use SQLCOntext and Load the JSON file. //This will return the DataFrame which can be further Queried using SQL queries. val dataFrame = sqlCtx.jsonFile(path) In the preceding piece of code, we used the jsonFile(…) method for loading the JSON data. There are other utility method defined by SQLContext for reading raw data from filesystem or creating data frames from the existing RDD and many more. Spark SQL supports two different methods for converting the existing RDDs into data frames. The first method uses reflection to infer the schema of an RDD from the given data. This approach leads to more concise code and helps in instances where we already know the schema while writing Spark application. We have used the same approach in our example. The second method is through a programmatic interface that allows to construct a schema. Then, apply it to an existing RDD and finally generate a data frame. This method is more verbose, but provides flexibility and helps in those instances where columns and data types are not known until the data is received at runtime. Refer to https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.SQLContext for a complete list of methods exposed by SQLContext. Once the DataFrame is created, we need to register DataFrame as a temporary table within the SQL context so that we can execute SQL queries over the registered table. Let's add the following piece of code for registering our DataFrame with our SQL context and name it company: //Register the data as a temporary table within SQL Context //Temporary table is destroyed as soon as SQL Context is destroyed. dataFrame.registerTempTable("company"); And we are done… Our JSON data is automatically organized into the table (rows/column) and is ready to accept the SQL queries. Even the data types is inferred from the type of data entered within the JSON file itself. Now, we will start executing the SQL queries on our table, but before that let's see the schema being created/defined by SQLContext: //Printing the Schema of the Data loaded in the Data Frame dataFrame.printSchema(); The execution of the preceding statement will provide results similar to mentioned illustration: The preceding illustration shows the schema of the JSON data loaded by Spark SQL. Pretty simple and straight, isn't it? Spark SQL has automatically created our schema based on the data defined in our company.json file. It has also defined the data type of each of the columns. We can also define the schema using reflection (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection) or can also programmatically define the schema (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection). Next, let's execute some SQL queries to see the data stored in DataFrame, so the first SQL would be to print all records: //Executing SQL Queries to Print all records in the DataFrame println("Printing All records") sqlCtx.sql("Select * from company").collect().foreach(print) The execution of the preceding statement will produce the following results on the console where the driver is executed: Next, let's also select only few columns instead of all records and print the same on console: //Executing SQL Queries to Print Name and Employees //in each Department println("n Printing Number of Employees in All Departments") sqlCtx.sql("Select Name, No_Of_Emp from company").collect().foreach(println) The execution of the preceding statement will produce the following results on the Console where the driver is executed: Now, finally let's do some aggregation and count the total number of all employees across the departments: //Using the aggregate function (agg) to print the //total number of employees in the Company println("n Printing Total Number of Employees in Company_X") val allRec = sqlCtx.sql("Select * from company").agg(Map("No_Of_Emp"->"sum")) allRec.collect.foreach ( println ) In the preceding piece of code, we used the agg(…) function and performed the sum of all employees across the departments, where sum can be replaced by avg, max, min, or count. The execution of the preceding statement will produce the following results on the console where the driver is executed: The preceding images shows the results of executing the aggregation on our company.json data. Refer to the Data Frame API at https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.DataFrame for further information on the available functions for performing aggregation. As a last step, we will stop our Spark SQL context by invoking the stop() function on SparkContext—sparkCtx.stop(). This is required so that your application can notify master or resource manager to release all resources allocated to the Spark job. It also ensures the graceful shutdown of the job and avoids any resource leakage, which may happen otherwise. Also, as of now, there can be only one Spark context active per JVM, and we need to stop() the active SparkContext class before creating a new one. Summary In this article, we have seen the step-by-step process of using Spark SQL as a standalone program. Though we have considered JSON files as an example, but we can also leverage Spark SQL with Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md) or MongoDB (https://github.com/Stratio/spark-mongodb) or Elasticsearch (http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html). Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames[article] Sabermetrics with Apache Spark[article] Getting Started with Apache Spark [article]
Read more
  • 0
  • 0
  • 4455

article-image-integrating-elasticsearch-hadoop-ecosystem
Packt
07 Oct 2015
14 min read
Save for later

Integrating Elasticsearch with the Hadoop ecosystem

Packt
07 Oct 2015
14 min read
In this article by Vishal Shukla, author of the book Elasticsearch for Hadoop, we will take a look at how ES-Hadoop can integrate with Pig and Spark with ease. Elasticsearch is great in getting insights into the indexed data. The Hadoop ecosystem does a great job in making Hadoop easily usable for different users by providing a comfortable interface. Some of the examples are Hive and Pig. Apart from these, Hadoop integrates well with other computing engines and platforms, such as Spark and Cascading. (For more resources related to this topic, see here.) Pigging out Elasticsearch For many use cases, Pig is one of the easiest ways to fiddle around with the data in the Hadoop ecosystem. Pig wins when it comes to ease of use and simple syntax for designing data flow pipelines without getting into complex programming. Assuming that you know Pig, we will cover how to move the data to and from Elasticsearch. If you don't know Pig yet, never mind. You can still carry on with the steps, and by the end of the article, you will at least know how to use Pig to perform data ingestion and reading with Elasticsearch. Setting up Apache Pig for Elasticsearch Let's start by setting up Apache Pig. At the time of writing this article, the latest Pig version available is 0.15.0. You can use the following steps to set up the same version: First, download the Pig distribution using the following command: $ sudo wget –O /usr/local/pig.tar.gz http://mirrors.sonic.net/apache/pig/pig-0.15.0/pig-0.15.0.tar.gz Then, extract Pig to the desired location and rename it to a convenient name: $ cd /userusr/local $ sudo tar –xvf pig.tar.gz $ sudo mv pig-0.15.0 pig Now, export the required environment variables by appending the following two lines in the /home/eshadoop/.bashrc file: export PIG_HOME=/usr/local/pig export PATH=$PATH:$PIG_HOME/bin You can either log out and relogin to see the newly set environment variables or source the environment configuration with the following command: $ source ~/.bashrc Now, start the job history server daemon with the following command: $ mr-jobhistory-daemon.sh start historyserver You should see the Pig console with the following command: $ pig grunt> It's easy to forget to start the job history daemon once you restart your machine or VM. You may make this daemon run on start up, or you need to ensure this manually. Now, we have Pig up and running. In order to use Pig with Elasticsearch, we must ensure that the ES-Hadoop JAR file is available in the Pig classpath. Let's take the ES-Hadoop JAR file and and import it to HDFS using the following steps: First, download the ES-Hadoop JAR used to develop the examples in this article, as shown in the following command: $ wget http://central.maven.org/maven2/org/elasticsearch/elasticsearch-hadoop/2.1.1/elasticsearch-hadoop-2.1.1.jar Then, move the downloaded JAR to a convenient name as follows: $ sudo mkdir /opt/lib Now, import the JAR to HDFS: $ hadoop fs –mkdir /lib $ hadoop fs –put elasticsearch-hadoop-2.1.1.jar /lib/elasticsearch-hadoop-2.1.1.jar Throughout this article, we will use a crime dataset that is tailored from the open dataset provided at https://data.cityofchicago.org/. This tailored dataset can be downloaded from http://www.packtpub.com/support, where all the code files required for this article are available. Once you have downloaded the dataset, import it to HDFS at /ch07/crime_data.csv. Importing data to Elasticsearch Let's import the crime dataset to Elasticsearch using Pig with ES-Hadoop. This provides the EsStorage class as Pig Storage. In order to use the EsStorage class, you need to have a registered ES-Hadoop JAR with Pig. You can register the JAR located in the local filesystem, HDFS, or other shared filesystems. The REGISTER command registers a JAR file that contains UDFs (User-defined functions) with Pig, as shown in the following code: grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar; Then, load the CSV data file as a relation with the following code: grunt> SOURCE = load '/ch07/crimes_dataset.csv' using PigStorage(',') as (id:chararray, caseNumber:chararray, date:datetime, block:chararray, iucr:chararray, primaryType:chararray, description:chararray, location:chararray, arrest:boolean, domestic:boolean, lat:double,lon:double); This command reads the CSV fields and maps each token in the data to the respective field in the preceding command. The resulting relation, SOURCE, represents a relation with the Bag data structure that contains multiple Tuples. Now, generate the target Pig relation that has the structure that matches closely to the target Elasticsearch index mapping, as shown in the following code: grunt> TARGET = foreach SOURCE generate id, caseNumber, date, block, iucr, primaryType, description, location, arrest, domestic, TOTUPLE(lon, lat) AS geoLocation; Here, we need the nested object with the geoLocation name in the target Elasticsearch document. We can achieve this with a Tuple to represent the lat and lon fields. TOTUPLE() helps us to create this tuple. We then assigned the geoLocation alias for this tuple. Let's store the TARGET relationto the Elasticsearch index with the following code: grunt> STORE TARGET INTO 'esh_pig/crimes' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.timeout = 5m', 'es.index.auto.create = true', 'es.mapping.names=arrest:isArrest, domestic:isDomestic', 'es.mapping.id=id'); We can specify the target index and type to store indexed documents. The EsStorage class can accept multiple Elasticsearch configurations.es.mapping.names maps the Pig field name to Elasticsearch document's field name. You can use Pig's field id to assign a custom _id value for the Elasticsearch document using the es.mapping.id option. Similarly, you can set the _ttl and _timestamp metadata fields as well. Pig uses just one reducer in the default configuration. It is recommended to change this behavior to have a parallelism that matches the number of shards available, as shown in the following command: grunt> SET default_parallel 5; Pig also combines the input splits, irrespective of its size. This makes it efficient for small files by reducing the number of mappers. However, this will give performance issues for large files. You can disable this behavior in the Pig script, as shown in the following command: grunt> SET pig.splitCombination FALSE; Executing the preceding commands will create the Elasticsearch index and import crime data documents. If you observe the created documents in Elasticsearch, you can see the geoLocation value isan array in the [-87.74274476, 41.87404405]format. This is because by default, ES-Hadoop ignores the tuple field names and simply converts them as an ordered array. If you wish to make your geoLocation field look similar to the key/value-based object with the lat/lon keys, you can do so by including the following configuration in EsStorage: es.mapping.pig.tuple.use.field.names=true Writing from the JSON source If you have inputs as a well-formed JSON file, you can avoid conversion and transformations and directly pass the JSON document to Elasticsearch for indexing purposes. You may have the JSON data in Pig as chararray, bytearray, or in any other form that translates to well-formed JSON by calling the toString() method, as shown in the following code: grunt> JSON_DATA = LOAD '/ch07/crimes.json' USING PigStorage() AS (json:chararray); grunt> STORE JSON_DATA INTO 'esh_pig/crimes_json' USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true'); Type conversions Take a look at the the type mapping of the esh_pig index in Elasticsearch. It maps the geoLocation type to double. This is done because Elasticsearch inferred the double type based on the field type we specified in Pig. To map geoLocation to geo_point, you must create the Elasticsearch mapping for it manually before executing the script. Although Elasticsearch provides a data type detection based on the type of field in the incoming document, it is always good to create the type mapping beforehand in Elasticsearch. This is a one-time activity that you should do. Then, you can run the MapReduce, Pig, Hive, Cascading, or Spark jobs multiple times. This will avoid any surprises in the type detection. For your reference, here is a list of some of the field types of Pig and Elasticsearch that map to each other. The table doesn't list no-brainer and absolutely intuitive type mappings: Pig type Elasticsearch type chararray This specifies string bytearray This indicates binary tuple This denotes an array(default) or object bag This specifies an array map This denotes an object bigdecimal This indicates Not supported biginteger This denotes Not supported Reading data from Elasticsearch Reading data from Elasticsearch using Pig is as simple as writing a single command with the Elasticsearch query. Here is a snippet of how to print tuples that has crimes related to theft: grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar grunt> ES = LOAD 'esh_pig/crimes' using org.elasticsearch.hadoop.pig.EsStorage('{"query" : { "term" : { "primaryType" : "theft" } } }'); grunt> dump ES; Executing the preceding commands will print the tuples Pig console. Giving Spark to Elasticsearch Spark is a distributed computing system that provides huge performance boost compared to Hadoop MapReduce. It works on an abstraction of RDD (Resilient-distributed Datasets). This can be created for any data residing in Hadoop. Without any surprises, ES-Hadoop provides easy integration with Spark by enabling the creation of RDD from the data in Elasticsearch. Spark's increasing support of integrating with various data sources, such as HDFS, Parquet, Avro, S3, Cassandra, relational databases, and streaming data makes it special when it comes to data integration. This means that when you use ES-Hadoop with Spark, you can make all these sources integrate with Elasticsearch easily. Setting up Spark In order to set up Apache Spark in order to execute a job, you can perform the following steps: First, download the Apache Spark distribution with the following command: $ sudo wget –O /usr/local/spark.tgzhttp://www.apache.org/dyn/closer.cgi/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz Then, extract Spark to the desired location and rename it to a convenient name, as shown in the following command: $ cd /user/local $ sudo tar –xvf spark.tgz $ sudo mv spark-1.4.1-bin-hadoop2.4 spark Importing data to Elasticsearch To import the crime dataset to Elasticsearch with Spark, let's see how we can write a Spark job. We will continue using Java to write Spark jobs for consistency. Here are the driver program's snippets: SparkConf conf = new SparkConf().setAppName("esh-spark").setMaster("local[4]"); conf.set("es.index.auto.create", "true"); JavaSparkContext context = new JavaSparkContext(conf); Set up the SparkConf object to configure the spark job. As always, you can also set most options (such as es.index.auto.create) and other configurations that we have seen throughout the article. Using this configuration, we created the JavaSparkContext object as follows: JavaRDD<String> textFile = context.textFile("hdfs://localhost:9000/ch07/crimes_dataset.csv"); Read the crime data CSV file as JavaRDD. Here, RDD is still of the type String that represents each line: JavaRDD<Crime> dataSplits = textFile.map(new Function<String, Crime>() { @Override public Crime call(String line) throws Exception { CSVParser parser = CSVParser.parse(line, CSVFormat.RFC4180); Crime c = new Crime(); CSVRecord record = parser.getRecords().get(0); c.setId(record.get(0)); .. .. String lat = record.get(10); String lon = record.get(11); Map<String, Double> geoLocation = new HashMap<>(); geoLocation.put("lat", StringUtils.isEmpty(lat)? null:Double.parseDouble(lat)); geoLocation.put("lon",StringUtils.isEmpty(lon)?null:Double. parseDouble(lon)); c.setGeoLocation(geoLocation); return c; } }); In the preceding snippet, we called the map() method on JavaRDD to map each of the input line to the Crime object. Note that we created a simple JavaBean class called Crime that implements the Serializable interface and maps to the Elasticsearch document structure. Using CSVParser, we parsed each field into the Crime object. We mapped nested the geoLocation object by embedding Map in the Crime object. This map is populated with the lat and lon fields. This map() method returns another JavaRDD that contains the Crime objects, as shown in the following code: JavaEsSpark.saveToEs(dataSplits, "esh_spark/crimes"); Save JavaRDD<Crime> to Elasticsearch with the JavaEsSpark class provided by Elasticsearch. For all the ES-Hadoop integrations, such as Pig, Hive, Cascading, Apache Storm, and Spark, you can use all the standard ES-Hadoop configurations and techniques. This includes dynamic/multiresource writes with a pattern similar to esh_spark/{primaryType} and use JSON strings to directly import the data to Elasticsearch as well. To control the Elasticsearch document metadata from being indexed, you can use the saveToEsWithMeta() method of JavaEsSpark. You can pass an instance of JavaPairRDD that contains Tuple2<Metadata, Object>, where Metadata represents a map that has the key/value pairs of the document metadata fields, such as id, ttl, timestamp, and version. Using SparkSQL ES-Hadoop also bridges Elasticsearch with the SparkSQL module. SparkSQL 1.3+ versions provide the DataFrame abstraction that represent a collection of Row. We will not discuss the details of DataFrame here. ES-Hadoop lets you persist your DataFrame instance to Elasticsearch transparently. Let's see how we can do this with the following code: SQLContext sqlContext = new SQLContext(context); DataFrame df = sqlContext.createDataFrame(dataSplits, Crime.class); Create an SQLContext instance using the JavaSparkContext instance. Using the SqlContextSqlContext instance, you can create DataFrame by calling the createDataFrame() method and passing the existing JavaRDD<T> and Class<T>, where T is a JavaBean class that implements the Serializable interface. Note that the passing class instance is required to infer a schema for DataFrame. If you wish to use nonJavaBean-based RDD, you can create the schema manually. The article source code contains the implementations of both the approaches for your reference. Take a look at the following code: JavaEsSparkSQL.saveToEs(df, "esh_sparksql/crimes_reflection"); Once you have the DataFrame instance, you can save it to Elasticsearch with the JavaEsSparkSQL class, as shown in the preceding code. Reading data from Elasticsearch Here is the snippet of SparkEsReader that finds crimes related to theft: JavaRDD<Map<String, Object>> esRDD = JavaEsSpark.esRDD(context, "esh_spark/crimes", "{"query" : { "term" : { "primaryType" : "theft" } } }").values(); for(Map<String,Object> item: esRDD.collect()){ System.out.println(item); } We used the same JavaEsSpark class to create RDD with documents that match the Elasticsearch query. Using SparkSQL ES-Hadoop provides a org.elasticsearch.spark.sql data source provider to read the data from Elasticsearch using SparkSQL, as shown in the following code: Map<String, String> options = new HashMap<>(); options.put("pushdown","true"); options.put("es.nodes","localhost"); DataFrame df = sqlContext.read() .options(options) .format("org.elasticsearch.spark.sql") .load("esh_sparksql/crimes_reflection"); The preceding code snippet uses the org.elasticsearch.spark.sql data source to load data from Elasticsearch. You can set the pushdown option to true to push the query execution down to Elasticsearch. This greatly increases its efficiency as the query execution is collocated where the data resides, as shown in the following code: df.registerTempTable("crimes"); DataFrame theftCrimes = sqlContext.sql("SELECT * FROM crimes WHERE primaryType='THEFT'"); for(Row row: theftCrimes.javaRDD().collect()){ System.out.println(row); } We registered table with the data frame and executed the SQL query on SqlContext. Note that we need to collect the final results locally to print in a driver class. Summary In this article, we looked at the various Hadoop ecosystem technologies. We set up Pig with ES-Hadoop and developed the script to interact with Elasticsearch. You also learned how to use ES-Hadoop to integrate Elasticsearch with Spark and empower it with powerful SQL engine SparkSQL. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [Article] Elasticsearch Administration [Article] Downloading and Setting Up ElasticSearch [Article]
Read more
  • 0
  • 0
  • 4453

article-image-what-are-ssas-2012-dimensions-and-cube
Packt
29 Aug 2013
7 min read
Save for later

What are SSAS 2012 dimensions and cube?

Packt
29 Aug 2013
7 min read
(For more resources related to this topic, see here.) What is SSAS? SQL Server Analysis Services is an online analytical processing tool that highly boosts the different types of SQL queries and calculations that are accepted in the business intelligence environment. It looks like a relation database, but it has differences. SSAS does not replace the requirement of relational databases, but if you combine the two, it would help to develop the business intelligence solutions. Why do we need SSAS? SSAS provide a very clear graphical interface for the end users to build queries. It is a kind of cache that we can use to speed up reporting. In most real scenarios where SSAS is used, there is a full copy of the data in the data warehouse. All reporting and analytic queries are run against SSAS rather than against the relational database. Today's modern relational databases include many features specifically aimed at BI reporting. SSAS are database services specifically designed for this type of workload, and in most cases it has achieved much better query performance. SSAS 2012 architecture In this article we will explain about the architecture of SSAS. The first and most important point to make about SSAS 2012 is that it is really two products in one package. It has had a few advancements relating to performance, scalability, and manageability. This new version of SSAS that closely resembles PowerPivot uses the tabular model. When installing SSAS, we must select either the tabular model or multidimensional model for installing an instance that runs inside the server; both data models are developed under the same code but sometimes both are treated separately. The concepts included in designing both data models are different, and we can't turn a tabular database into a multidimensional database, or vice versa without rebuilding everything from the start. The main point of view of the end users is that both data models do almost the same things and appear almost equally when used through a client tool such as Excel. The tabular model A concept of building a database using the tabular model is very similar to building it in a relational database. An instance of Analysis Services can hold many databases, and each database can be looked upon as a self-contained collection of objects and data relating to a single business solution. If we are writing reports or analyzing data and we find that we need to run queries on multiple databases, we probably have made a design mistake somewhere because everything we need should be contained within an individual database. Tabular models are designed by using SQL Server Data Tools (SSDT), and a data project in SSDT mapping onto a database in Analysis Services. The multidimensional model This data model is very similar to the tabular model. Data is managed in databases, and databases are designed in SSDT, which are in turn managed by using SQL Server Management Studio. The differences may become similar below the database level, where the multidimensional data model rather than relational concepts are accepted. In the multidimensional model, data is modeled as a series of cubes and dimensions and not tables. The future of Analysis Services We have two data models inside SSAS, along with two query and calculation languages; it is clearly not an ideal state of affairs. It means we have to select a data model to use at the start of our project, when we might not even know enough about our need to gauge which one is appropriate. It also means that anyone who decides to specialize in SSAS has to learn two technologies. Microsoft has very clearly said that the multidimensional model is not scrapped and that the tabular model is not its replacement. It is just like saying that the new advanced features for the multidimensional data model will be released in future versions of SSAS. The fact that the tabular and multidimensional data models share some of the same code suggests that some new features could easily be developed for both models simultaneously. What's new in SSAS 2012? As we know, there is no easy way of transferring a multidimensional data model into a tabular data model. We may have many tools in the market that claim to make this transition with a few mouse clicks, but such tools could only ever work for very simple multidimensional data models and would not save much development time. Therefore, if we already have a mature multidimensional implementation and the in-house skills to develop and maintain it, we may find the following improvements in SSAS 2012 useful. Ease of use If we are starting an SSAS 2012 project with no previous multidimensional or OLAP experience, it is very likely that we will find a tabular model much easier to learn than a multidimensional one. Not only are the concepts much easier to understand, especially if we are used to working with relational databases, but also the development process is much more straightforward and there are far fewer features to learn. Compatibility with PowerPivot The tabular data model and PowerPivot are the same in the way their models are designed. The user interfaces used are practically the same, as both the interfaces use DAX. PowerPivot models can be imported into SQL Server Data Tools to generate a tabular model, although the process does not work the other way around, and a tabular model cannot be converted to a PowerPivot model. Processing performance characteristics If we compare the processing performance of the multidimensional and tabular data models, that will become difficult. It may be slower to process a large table following the tabular data model than the equivalent measure group in a multidimensional one because a tabular data model can't process partitions in the same table at the same time, whereas a multidimensional model can process partitions in the same measure group at the same time. What is SSAS dimension? A database dimension is a collection of related objects; in other words, attributes; they provide the information about fact data in one or more cubes. Typical attributes in a product dimension are product name, product category, line, size, and price. Attributes can be organized into user-defined hierarchies that provide the paths to assist users when they browse through the data in a cube. By default these attributes are visible as attribute hierarchies, and they can be used to understand fact data in a cube. What is SSAS cube? A cube is a multidimensional structure that contains information for analytical purposes; the main constituents of a cube are dimensions and measures. Dimensions define the structure of a cube that you use to slice and dice over, and measures provide the aggregated numerical values of interest to the end user. As a logical structure, a cube allows a client application to retrieve values—of measures—as if they are contained in cells in the cube. The cells are defined for every possible summarized value. A cell, in the cube, is defined by the intersection of dimension members and contains the aggregated values of the measures at that specific intersection. Summary We talked about the special new features and services present, what you can do with them, and why they’re so great. Resources for Article: Further resources on this subject: Creating an Analysis Services Cube with Visual Studio 2008 - Part 1 [Article] Performing Common MDX-related Tasks [Article] How to Perform Iteration on Sets in MDX [Article]
Read more
  • 0
  • 0
  • 4453

article-image-dimensionality-reduction
Packt
03 Jan 2017
15 min read
Save for later

Dimensionality Reduction

Packt
03 Jan 2017
15 min read
In this article by Ashish Kumar and Avinash Paul the authors of the book Mastering Text Mining with R, we will Data volume and high dimensions pose an astounding challenge in text mining tasks. Inherent noise and the computational cost of processing huge amount of datasets make it even more sarduous. The science of dimensionality reduction lies in the art of losing out on only a commensurately small amount of information and still being able to reduce the high dimension space into a manageable proportion. (For more resources related to this topic, see here.) For classification and clustering techniques to be applied to text data, for different natural language processing activities, we need to reduce the dimensions and noise in the data so that each document can be represented using fewer dimensions, thus significantly reducing the noise that can hinder the performance. The curse of dimensionality Topic modeling and document clustering are common text mining activities, but the text data can be very high-dimensional, which can cause a phenomenon called curse of dimensionality. Some literature also calls it concentration of measure: Distance is attributed to all the dimensions and assumes each of them to have the same effect on the distance. The higher the dimensions, the more similar things appear to each other. The similarity measures do not take into account the association of attributes, which may result in inaccurate distance estimation. The number of samples required per attribute increases exponentially with the increase in dimensions. A lot of dimensions might be highly correlated with each other, thus causing multi-collinearity. Extra dimensions cause a rapid volume increase that can result in high sparsity, which is a major issue in any method that requires statistical significance. Also, it causes huge variance in estimates, near duplicates, and poor predictors. Distance concentration and computational infeasibility Distance concentration is a phenomenon associated with high-dimensional space wherein pairwise distances or dissimilarity between points appear indistinguishable. All the vectors in high dimensions appear to be orthogonal to each other. The distances between each data point to its neighbors, farthest or nearest, become equal. This totally jeopardizes the utility of methods that use distance based measures. Let's consider that the number of samples is n and the number of dimensions is d. If d is very large, the number of samples may prove to be insufficient to accurately estimate the parameters. For the datasets with number of dimensions d, the number of parameters in the covariance matrix will be d^2. In an ideal scenario, n should be much larger than d^2, to avoid overfitting. In general, there is an optimal number of dimensions to use for a given fixed number of samples. While it may feel like a good idea to engineer more features, if we are not able to solve a problem with less number of features. But the computational cost and model complexity increases with the rise in number of dimensions. For instance, if n number of samples look to be dense enough for a one-dimensional feature space. For a k-dimensional feature space, n^k samples would be required. Dimensionality reduction Complex and noisy characteristics of textual data with high dimensions can be handled by dimensionality reduction techniques. These techniques reduce the dimension of the textual data while still preserving its underlying statistics. Though the dimensions are reduced, it is important to preserve the inter-document relationships. The idea is to have minimum number of dimensions, which can preserve the intrinsic dimensionality of the data. A textual collection is mostly represented in form of a term document matrix wherein we have the importance of each term in a document. The dimensionality of such a collection increases with the number of unique terms. If we were to suggest the simplest possible dimensionality reduction method, that would be to specify the limit or boundary on the distribution of different terms in the collection. Any term that occurs with a significantly high frequency is not going to be informative for us, and the barely present terms can undoubtedly be ignored and considered as noise. Some examples of stop words are is, was, then, and the. Words that generally occur with high frequency and have no particular meaning are referred to as stop words. Words that occur just once or twice are more likely to be spelling errors or complicated words, and hence both these and stop words should not be considered for modeling the document in the Term Document Matrix (TDM). We will discuss a few dimensionality reduction techniques in brief and dive into their implementation using R. Principal component analysis Principal component analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs. PCA comprises the following steps: Compute the n-dimensional mean of the given dataset. Compute the covariance matrix of the features. Compute the eigenvectors and eigenvalues of the covariance matrix. Rank/sort the eigenvectors by descending eigenvalue. Choose x eigenvectors with the largest eigenvalues. Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space. PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis. Using R for PCA R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns. prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function. princomp() fails in situations if the number of variables is larger than the number of observations. Each function returns a list whose class is prcomp() or princomp(). The information returned and terminology is summarized in the following table: prcomp() princomp() Explanation sdev sdev Standard deviation of each column Rotations Loading Principle components Center Center Subtracted value of each row or column to get the center data Scale Scale Scale factors used X Score The rotated data   n.obs Number of observations of each variable   Call The call to function that created the object Here's a list of the functions available in different R packages for performing PCA: PCA(): FactoMineR package acp(): amap package prcomp(): stats package princomp(): stats package dudi.pca(): ade4 package pcaMethods: This package from Bioconductor has various convenient  methods to compute PCA Understanding the FactoMineR package FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package: library(FactoMineR) data<-replicate(10,rnorm(1000)) result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T) print(result.pca) Results for the principal component analysis (PCA). The analysis was performed on 1,000 individuals, described by nine variables. The results are available in the following objects: Name Description $eig Eigenvalues $var Results for the variables $var$coord coord. for the variables $var$cor Correlations variables - dimensions $var$cos2 cos2 for the variables $var$contrib Contributions of the variables $ind Results for the individuals $ind$coord coord. for the individuals $ind$cos2 cos2 for the individuals $ind$contrib Contributions of the individuals $call Summary statistics $call$centre Mean of the variables $call$ecart.type Standard error of the variables $call$row.w Weights for the individuals $call$col.w Weights for the variables Eigenvalue percentage of variance cumulative percentage of variance: comp 1 1.1573559 12.859510 12.85951 comp 2 1.0991481 12.212757 25.07227 comp 3 1.0553160 11.725734 36.79800 comp 4 1.0076069 11.195632 47.99363 comp 5 0.9841510 10.935011 58.92864 comp 6 0.9782554 10.869505 69.79815 comp 7 0.9466867 10.518741 80.31689 comp 8 0.9172075 10.191194 90.50808 comp 9 0.8542724 9.491916 100.00000 Amap package Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame. This function is akin to princomp() and prcomp(), except that it has slightly different graphic represention. For more intricate details, refer to the CRAN-R resource page: https://cran.r-project.org/web/packages/lLibrary(amap/amap.pdf Library(amap acp(data,center=TRUE,reduce=TRUE) Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package: acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") W(x,h,D=NULL,kernel="gaussien") acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien") Proportion of variance We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence. R has a prcomp() function in the base package to estimate principal components. Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits: pca_base<-prcomp(data) print(pca_base) The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains: pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100 pr_variance [1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907 9.498072 [8] 9.218186 8.762572 8.430598 pr_variance signifies the proportion of variance explained by each component in descending order of magnitude. Let's calculate the cumulative proportion of variance for the components: cumsum(pr_variance) [1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057 73.58864 [8] 82.80683 91.56940 100.00000 Components 1-8 explain the 82% variance in the data. Singular vector decomposition Singular vector decomposition (SVD) is a dimensionality reduction technique that gained a lot of popularity in recent times after the famous Netflix Movie Recommendation challenge. Since its inception, it has found its usage in many applications in statistics, mathematics, and signal processing. It is primarily a technique to factorize any matrix; it can be real or a complex matrix. A rectangular matrix can be factorized into two orthonormal matrices and a diagonal matrix of positive real values. An m*n matrix is considered as m points in n-dimensional space; SVD attempts to find the best k dimensional subspace that fits the data: SVD in R is used to compute approximations of singular values and singular vectors of large-scale data matrices. These approximations are made using different types of memory-efficient algorithm, and IRLBA is one of them (named after Lanczos bi-diagonalization (IRLBA) algorithm). We shall be using the irlba package here in order to implement SVD. Implementation of SVD using R The following code will show the implementation of SVD using R: # List of packages for the session packages = c("foreach", "doParallel", "irlba") # Install CRAN packages (if not already installed) inst <- packages %in% installed.packages() if(length(packages[!inst]) > 0) install.packages(packages[!inst]) # Load packages into session lapply(packages, require, character.only=TRUE) # register the parallel session for registerDoParallel(cores=detectCores(all.tests=TRUE)) std_svd <- function(x, k, p=25, iter=0 1 ) { m1 <- as.matrix(x) r <- nrow(m1) c <- ncol(m1) p <- min( min(r,c)-k,p) z <- k+p m2 <- matrix ( rnorm(z*c), nrow=c, ncol=z) y <- m1 %*% m2 q <- qr.Q(qr(y)) b<- t(q) %*% m1 #iterations b1<-foreach( i=i1:iter ) %dopar% { y1 <- m1 %*% t(b) q1 <- qr.Q(qr(y1)) b1 <- t(q1) %*% m1 } b1<-b1[[iter]] b2 <- b1 %*% t(b1) eigens <- eigen(b2, symmetric=T) result <- list() result$svalues <- sqrt(eigens$values)[1:k] u1=eigens$vectors[1:k,1:k] result$u <- (q %*% eigens$vectors)[,1:k] result$v <- (t(b) %*% eigens$vectors %*% diag(1/eigens$values))[,1:k] return(result) } svd<- std_svd(x=data,k=5)) # singular vectors svd$svalues [1] 35.37645 33.76244 32.93265 32.72369 31.46702 We obtain the following values after running SVD using the IRLBA algorithm: d: approximate singular values. u: nu approximate left singular vectors v: nv approximate right singular vectors iter: # of IRLBA algorithm iterations mprod: # of matrix vector products performed These values can be used for obtaining results of SVD and understanding the overall statistics about how the algorithm performed. Latent factors # svd$u, svd$v dim(svd$u) #u value after running IRLBA [1] 1000 5 dim(svd$v) #v value after running IRLBA [1] 10 5 A modified version of the previous function can be achieved by altering the power iterations for a robust implementation: foreach( i = 1:iter )%dopar% { y1 <- m1 %*% t(b) y2 <- t(y1) %*% y1 r2 <- chol(y2, pivot = T) q1 <- y2 %*% solve(r2) b1 <- t(q1) %*% m1 } b2 <- b1 %*% t(b1) Some other functions available in R packages are as follows: Functions Package svd() svd Irlba() irlba svdImpute bcv ISOMAP – moving towards non-linearity ISOMAP is a nonlinear dimension reduction method and is representative of isometric mapping methods. ISOMAP is one of the approaches for manifold learning. ISOMAP finds the map that preserves the global, nonlinear geometry of the data by preserving the geodesic manifold inter-point distances. Like multi-dimensional scaling, ISOMAP creates a visual presentation of distance of a number of objects. Geodesic is the shortest curve along the manifold connecting two points induced by a neighborhood graph. Multi-dimensional scaling uses the Euclidian distance measure; since the data is in a nonlinear format, ISOMPA uses geodesic distance. ISOMAP can be viewed as an extension of metric multi-dimensional scaling. At a very high level, ISOMAP can be describes in four steps: Determine the neighbor of each point Construct a neighborhood graph Compute the shortest distance path between all pairs Construct k-dimensional coordinate vectors by applying MDS Geodesic distance approximation is basically calculated in three ways: Neighboring points: input-space distance Faraway points: a sequence of short hops between neighboring points Method: Finding shortest paths in a graph with edges connecting neighboring data points source("http://bioconductor.org/biocLite.R") biocLite("RDRToolbox") library('RDRToolbox') swiss_Data=SwissRoll(N = 1000, Plot=TRUE) x=SwissRoll() open3d() plot3d(x, col=rainbow(1050)[-c(1:50)],box=FALSE,type="s",size=1) simData_Iso = Isomap(data=swiss_Data, dims=1:10, k=10,plotResiduals=TRUE) library(vegan)data(BCI) distance <- vegdist(BCI) tree <- spantree(dis) pl1 <- ordiplot(cmdscale(dis), main="cmdscale") lines(tree, pl1, col="red") z <- isomap(distance, k=3) rgl.isomap(z, size=4, color="red") pl2 <- plot(isomap(distance, epsilon=0.5), main="isomap epsilon=0.5") pl3 <- plot(isomap(distance, k=5), main="isomap k=5") pl4 <- plot(z, main="isomap k=3") Summary The idea of this article was to get you familiar with some of the generic dimensionality reduction methods and their implementation using R language. We discussed a few packages that provide functions to perform these tasks. We also covered a few custom functions that can be utilized to perform these tasks. Kudos, you have completed the basics of text mining with R. You must be feeling confident about various data mining methods, text mining algorithms (related to natural language processing of the texts) and after reading this article, dimensionality reduction. If you feel a little low on confidence, do not be upset. Turn a few pages back and try implementing those tiny code snippets on your own dataset and figure out how they help you understand your data. Remember this - to mine something, you have to get into it by yourself. This holds true for text as well. Resources for Article: Further resources on this subject: Data Science with R [Article] Machine Learning with R [Article] Data mining [Article]
Read more
  • 0
  • 0
  • 4433

article-image-oracle-business-intelligence-drilling-data-and-down
Packt
21 Oct 2010
5 min read
Save for later

Oracle Business Intelligence: Drilling Data Up and Down

Packt
21 Oct 2010
5 min read
What is data drilling? In terms of Oracle Discoverer, drilling is a technique that enables you to quickly navigate through worksheet data, finding the answers to the questions facing your business. As mentioned, depending on your needs, you can use drilling to view the data you're working with in deeper detail or, in contrast, drill it up to a higher level. The drilling to detail technique enables you to look at the values making up a particular summary value. Also, you can drill to related items, adding related information that is not currently included in the worksheet. So, Discoverer supports a set of drilling tools, including the following: Drilling up and down Drilling to a related item Drilling to detail Drilling out The following sections cover the above tools in detail, providing examples on how you might use them. Drilling to a related item Let's begin with a discussion on how to drill to a related item, adding the detailed information for a certain item. As usual, this is best understood by example. Suppose you want to drill from the Maya Silver item, which can be found on the left axis of the worksheet, to the Orddate:Day item. Here are the steps to follow: Let's first create a copy of the worksheet to work with in this example. To do this, move to the worksheet discussed in the preceding example and select the Edit | Duplicate Worksheet | As Crosstab menu of Discoverer. In the Duplicate as Crosstab dialog, just click OK. As a result a copied worksheet should appear in the workbook. On the worksheet, right-click the Maya Silver item and select Drill… in the pop-up menu: As a result, the Drill dialog should appear. In the Drill dialog, select Drill to a Related Item in the Where do you want to drill to? select box and then choose the Orddate:Day item, as shown in the following screenshot: Then, click OK to close the dialog and rearrange the data on the worksheet. The reorganized worksheet should now look like the following one: As you can see, this shows the Maya Silver item broken down into day sales per product. Now suppose you want to see a more detailed view of the Maya Silver item and break it out further into product category. Right-click the Maya Silver item and select Drill… in the pop-up menu. In the Drill dialog, select Drill to a Related Item in the Where do you want to drill to? select box and then choose the Category item. Next, click OK.The resulting worksheet should look now like this: As you can see, the result of the drilling operations you just performed is that you can see the dollar amount for Maya Silver detailed by category, by day, by product. You may be asking yourself if it's possible to change the order in which the Maya Silver record is detailed. Say, you want to see it detailed in the following order: by day, by category, and finally by product. The answer is sure. On the left axis of the worksheet, drag the Orddate:Day item (the third from the left) to the second position within the same left axis, just before the Category item, as shown in the following screenshot: As a result, you should see that the data on the worksheet has been rearranged as shown in the following screenshot: Having just a few rows in the underlying tables, as we have here, is OK for demonstration purposes, since it results in compact screenshots. To see more meaningful figures on the worksheet though, you might insert more rows into the orderitems, orders, and products underlying tables. Once you're done with it, you can click the Refresh button on the Discoverer toolbar to see an updated worksheet. Select the File | Save menu option of Discoverer to save the worksheet discussed here. Drilling up and down As the name implies, drilling down is a technique you can use to float down a drill hierarchy to see data in more detail. And drilling up is the reverse operation, which you can use to slide up a drill hierarchy to see consolidated data. But what is a drill hierarchy? Working with drill hierarchies A drill hierarchy represents a set of items related to each other according to the foreign key relationships in the underlying tables. If a worksheet item is associated with a drill hierarchy, you can look at that hierarchy by clicking the drill icon located at the left of the heading of the worksheet item. Suppose you want to look at the hierarchy associated with the Orddate item located on our worksheet at the top axis. To do this, click the Orddate drill icon. As a result, you should see the menu shown in the following screeenshot: As you can see, you can drill up here from Orddate to Year, Quarter, or Month. The next screenshot illustrates what you would have if you chose Month. It's important to note that you may have more than one hierarchy associated with a worksheet item. In this case, you can move on to the hierarchy you want to use through the All Hierarchies option on the drill menu.
Read more
  • 0
  • 0
  • 4424
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-logistic-regression-using-tensorflow
Packt
06 Mar 2018
9 min read
Save for later

Logistic Regression Using TensorFlow

Packt
06 Mar 2018
9 min read
In this article, by PKS Prakash and Achyutuni Sri Krishna Rao, authors of R Deep Learning Cookbook we will learn how to Perform logistic regression using TensorFlow. In this recipe, we will cover the application of TensorFlow in setting up a logistic regression model. The example will use a similar dataset to that used in the H2O model setup. (For more resources related to this topic, see here.) What is TensorFlow TensorFlow is another open source library developed by the Google Brain Team to build numerical computation models using data flow graphs. The core of TensorFlow was developed in C++ with the wrapper in Python. The tensorflow package in R gives you access to the TensorFlow API composed of Python modules to execute computation models. TensorFlow supports both CPU- and GPU-based computations. The tensorflow package in R calls the Python tensorflow API for execution, which is essential to install the tensorflow package in both R and Python to make R work. The following are the dependencies for tensorflow: Python 2.7 / 3.x  R (>3.2) devtools package in R for installing TensorFlow from GitHub  TensorFlow in Python pip Getting ready The code for this section is created on Linux but can be run on any operating system. To start modeling, load the tensorflow package in the environment. R loads the default TensorFlow environment variable and also the NumPy library from Python in the np variable:  library("tensorflow") # Load TensorFlow np <- import("numpy") # Load numpy library How to do it... The data is imported using a standard function from R, as shown in the following code. The data is imported using the read.csv file and transformed into the matrix format followed by selecting the features used to model as defined in xFeatures and yFeatures. The next step in TensorFlow is to set up a graph to run optimization: # Loading input and test data xFeatures = c("Temperature", "Humidity", "Light", "CO2", "HumidityRatio") yFeatures = "Occupancy" occupancy_train <- as.matrix(read.csv("datatraining.txt",stringsAsFactors = T)) occupancy_test <- as.matrix(read.csv("datatest.txt",stringsAsFactors = T)) # subset features for modeling and transform to numeric values occupancy_train<-apply(occupancy_train[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) occupancy_test<-apply(occupancy_test[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) # Data dimensions nFeatures<-length(xFeatures) nRow<-nrow(occupancy_train) Before setting up the graph, let's reset the graph using the following command: # Reset the graph tf$reset_default_graph() Additionally, let's start an interactive session as it will allow us to execute variables without referring to the session-to-session object: # Starting session as interactive session sess<-tf$InteractiveSession() Define the logistic regression model in TensorFlow: # Setting-up Logistic regression graph x <- tf$constant(unlist(occupancy_train[, xFeatures]), shape=c(nRow, nFeatures), dtype=np$float32) # W <- tf$Variable(tf$random_uniform(shape(nFeatures, 1L))) b <- tf$Variable(tf$zeros(shape(1L))) y <- tf$matmul(x, W) + b The input feature x is defined as a constant as it will be an input to the system. The weight W and bias b are defined as variables that will be optimized during the optimization process. The y is set up as a symbolic representation between x, W, and b. The weight W is set up to initialize random uniform distribution and b is assigned the value zero.  The next step is to set up the cost function for logistic regression: # Setting-up cost function and optimizer y_ <- tf$constant(unlist(occupancy_train[, yFeatures]), dtype="float32", shape=c(nRow, 1L)) cross_entropy<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=y_, logits=y, name="cross_entropy")) optimizer <- tf$train$GradientDescentOptimizer(0.15)$minimize(cross_entr opy) # Start a session init <- tf$global_variables_initializer() sess$run(init) Execute the gradient descent algorithm for the optimization of weights using cross entropy as the loss function: # Running optimization for (step in 1:5000) { sess$run(optimizer) if (step %% 20== 0) cat(step, "-", sess$run(W), sess$run(b), "==>", sess$run(cross_entropy), "n") } How it works... The performance of the model can be evaluated using AUC: # Performance on Train library(pROC) ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) # Performance on test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)). AUC can be visualized using the plot.auc function from the pROC package, as shown in the screenshot following this command. The performance for training and testing (holdout) is very similar. plot.roc(roc_obj, col = "green", lty=2, lwd=2) plot.roc(roc_objt, add=T, col="red", lty=4, lwd=2) Performance of logistic regression using TensorFlow Visualizing TensorFlow graphs TensorFlow graphs can be visualized using TensorBoard. It is a service that utilizes TensorFlow event files to visualize TensorFlow models as graphs. Graph model visualization in TensorBoard is also used to debug TensorFlow models. Getting ready TensorBoard can be started using the following command in the terminal: $ tensorboard --logdir home/log --port 6006 The following are the major parameters for TensorBoard: --logdir : To map to the directory to load TensorFlow events --debug: To increase log verbosity  --host: To define the host to listen to its localhost (127.0.0.1) by default  --port: To define the port to which TensorBoard will serve The preceding command will launch the TensorFlow service on localhost at port 6006, as shown in the following screenshot:                                                                                                                                                         TensorBoard The tabs on the TensorBoard capture relevant data generated during graph execution. How to do it... The section covers how to visualize TensorFlow models and output in TernsorBoard.  To visualize summaries and graphs, data from TensorFlow can be exported using the FileWriter command from the summary module. A default session graph can be added using the following command:  # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) The graph for logistic regression developed using the preceding code is shown in the following screenshot:                                                                                 Visualization of the logistic regression graph in TensorBoard Similarly, other variable summaries can be added to the TensorBoard using correct summaries, as shown in the following code: # Adding histogram summary to weight and bias variable w_hist = tf$histogram_summary("weights", W) b_hist = tf$histogram_summary("biases", b) Create a cross entropy evaluation for test. An example script to generate the cross entropy cost function for test and train is shown in the following command: # Set-up cross entropy for test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- tf$nn$sigmoid(tf$matmul(xt, W) + b) yt_ <- tf$constant(unlist(occupancy_test[, yFeatures]), dtype="float32", shape=c(nRowt, 1L)) cross_entropy_tst<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=yt_, logits=ypredt, name="cross_entropy_tst")) Add summary variables to be collected: # Add summary ops to collect data w_hist = tf$summary$histogram("weights", W) b_hist = tf$summary$histogram("biases", b) crossEntropySummary<-tf$summary$scalar("costFunction", cross_entropy) crossEntropyTstSummary<- tf$summary$scalar("costFunction_test", cross_entropy_tst) Open the writing object, log_writer. It writes the default graph to the location, c:/log: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) Run the optimization and collect the summaries: for (step in 1:2500) { sess$run(optimizer) # Evaluate performance on training and test data after 50 Iteration if (step %% 50== 0){ ### Performance on Train ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) ### Performance on Test ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)) cat("train AUC: ", auc(roc_obj), " Test AUC: ", auc(roc_objt), "n") # Save summary of Bias and weights log_writer$add_summary(sess$run(b_hist), global_step=step) log_writer$add_summary(sess$run(w_hist), global_step=step) log_writer$add_summary(sess$run(crossEntropySummary), global_step=step) log_writer$add_summary(sess$run(crossEntropyTstSummary), global_step=step) } } Collect all the summaries to a single tensor using the merge_all command from the summary module: summary = tf$summary$merge_all() Write the summaries to the log file using the log_writer object: log_writer = tf$summary$FileWriter('c:/log', sess$graph) summary_str = sess$run(summary) log_writer$add_summary(summary_str, step) log_writer$close() Summary In this article, we have learned how to perform logistic regression using TensorFlow also we have covered the application of TensorFlow in setting up a logistic regression model. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 4403

article-image-top-5-machine-learning-movies
Chris Key
17 Oct 2017
3 min read
Save for later

Top 5 Machine Learning Movies

Chris Key
17 Oct 2017
3 min read
Sitting in Mumbai airport at 2am can lead to some truly random conversations. Discussing the plot of Short Circuit 2 led us to thinking about this article. Here's my list of the top 5 movies featuring advanced machine learning. Short Circuit 2 [imdb] "Hey laser-lips, your momma was a snow blower!" A plucky robot who has named himself Johnny 5 returns to the screens to help build toy robots in a big city. By this point he is considered to have actual intelligence rather than artificial intelligence, however the plot of the film centres around his naivety and lack of ability to see the dark motives behind his new buddy, Oscar. We learn that intelligence can be applied anywhere, but sometimes it is the wrong place. Or right if you like stealing car stereos for "Los Locos". The Matrix Revolutions [imdb] The robots learn to balance an equation. Bet you wish you had them in your math high-school class. Also kudos to the Wachowski brothers who learnt from the machines the ability to balance the equation and released this monstrosity to even out the universe in light of the amazing first film in the trilogy. Blade Runner [imdb] “I've seen things you people wouldn't believe.” In the ultimate example of machines (see footnote) learning to emulate humanity, we struggled for 30 years to understand if Deckard was really human or a Nexus (spoilers: he is almost certainly a replicant!). It is interesting to note that when Pris and Roy are teamed up with JF Sebastian, their behaviours, aside from the occasional murder, show them to be more socially aware than their genius inventor friend. Wall-E [imdb] Disney and Pixar made a movie with no dialog for the entire first half, yet it was enthralling to watch. Without saying a single word, we see a small utility robot display a full range of emotions that we can relate to. He also demonstrates other signs of life – his need for energy and rest, and his sense of purpose is divided between his prime directive of cleaning the planet, and his passion for collecting interesting objects. Terminator 2 [imdb] “I know now why you cry, but it is something I can never do” Sarah Connor tells us that “Watching John with the machine, it was suddenly so clear. The terminator, would never stop. It would never leave him, and it would never hurt him, never shout at him, or get drunk and hit him, or say it was too busy to spend time with him. It would always be there. And it would die, to protect him.” Yet John Connor teaches the deadly robot, played by the invincible ex-Governator Arnold Schwarzenegger, how to be normal in society. No Problemo. Gimme five. Hasta La Vista, baby. Footnote - replicants aren't really machines. The replicants are genetic engineered and created by the Tyrell corporation with limited lifespans and specific abilities. For all intents and purposes, they are really organic robots.
Read more
  • 0
  • 0
  • 4394

article-image-deep-learning-image-generation-getting-started-generative-adversarial-networks
Mohammad Pezeshki
27 Sep 2016
5 min read
Save for later

Deep Learning and Image generation: Get Started with Generative Adversarial Networks

Mohammad Pezeshki
27 Sep 2016
5 min read
In machine learning, a generative model is one that captures the observable data distribution. The objective of deep neural generative models is to disentangle different factors of variation in data and be able to generate new or similar-looking samples of the data. For example, an ideal generative model on face images disentangles all different factors of variation such as illumination, pose, gender, skin color, and so on, and is also able to generate a new face by the combination of those factors in a very non-linear way. Figure 1 shows a trained generative model that has learned different factors, including pose and the degree of smiling. On the x-axis, as we go to the right, the pose changes and on y-axis as we move upwards, smiles turn to frowns. Usually these factors are orthogonal to one another, meaning that changing one while keeping the others fixed leads to a single change in data space; e.g. in the first row of Figure 1, only the pose changes with no change in the degree of smiling. The figure is adapted from here.   Based on the assumption that these underlying factors of variation have a very simple distribution (unlike the data itself), to generate a new face, we can simply sample a random number from the assumed simple distribution (such as a Gaussian). In other words, if there are k different factors, we randomly sample from a k-dimensional Gaussian distribution (aka noise). In this post, we will take a look at one of the recent models in the area of deep learning and generative models, called generative adversarial network. This model can be seen as a game between two agents: the Generator and the Discriminator. The generator generates images from noise and the discriminator discriminates between real images and those images which are generated by the generator. The objective is then to train the model such that while the discriminator tries to distinguish generated images from real images, the generator tries to fool the discriminator.  To train the model, we need to define a cost. In the case of GAN, the errors made by the discriminator are considered as the cost. Consequently, the objective of the discriminator is to minimize the cost, while the objective for the generator is to fool the discriminator by maximizing the cost. A graphical illustration of the model is shown in Figure 2.   Formally, we define the discriminator as a binary classiffier D : Rm ! f0; 1g and the generator as the mapping G : Rk ! Rm in which k is the dimension of the latent space that represents all of the factors of variation. Denoting the data by x and a point in latent space by z, the model can be trained by playing the following minmax game:   Note that the rst term encourages the discriminator to discriminate between generated images and real ones, while the second term encourages the generator to come up with images that would fool the discriminator. In practice, the log in the second term can be saturated, which would hurt the row of the gradient. As a result, the cost may be reformulated equivalently as:   At the time of generation, we can sample from a simple k-dimensional Gaussian distribution with zero mean and unit variance and pass it onto the generator. Among different models that can be used as the discriminator and generator, we use deep neural networks with parameters D and G for the discriminator and generator, respectively. Since the training boils down to updating the parameters using the backpropagation algorithm, the update rule is defined as follows: If we use a convolutional network as the discriminator and another convolutional network with fractionally strided convolution layers as the generator, the model is called DCGAN (Deep Convolutional Generative Adversarial Network). Some samples of bedroom im-age generation from this model are shown in Figure 3.   The generator can also be a sequential model, meaning that it can generate an image using a sequence of images with lower-resolution or details. A few examples of the generated images using such a model are shown in Figure 4. The GAN and later variants such as the DCGAN are currently considered to be among the best when it comes to the quality of the generated samples. The images look so realistic that you might assume that the model has simply memorized instances of the training set, but a quick KNN search reveals this not to be the case. About the author Mohammad Pezeshk is a master’s student in the LISA lab of Universite de Montreal working under the supervision of Yoshua Bengio and Aaron Courville. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014 and then started his master’s in September 2014. His research interests lie in the fields of artificial intelligence, machine learning, probabilistic models and specifically deep learning.
Read more
  • 0
  • 0
  • 4389

article-image-navigation-mesh-generation
Packt
19 Dec 2014
9 min read
Save for later

Navigation Mesh Generation

Packt
19 Dec 2014
9 min read
In this article by Curtis Bennett and Dan Violet Sagmiller, authors of the book Unity AI Programming Essentials, we will learn about navigation meshes in Unity. Navigation mesh generation controls how AI characters are able to travel around a game level and is one of the most important topics in game AI. In this article, we will provide an overview of navigation meshes and look at the algorithm for generating them. Then, we'll look at different options of customizing our navigation meshes better. To do this, we will be using RAIN 2.1.5, a popular AI plugin for Unity by Rival Theory, available for free at http://rivaltheory.com/rain/download/. In this article, you will learn about: How navigation mesh generation works and the algorithm behind it Advanced options for customizing navigation meshes Creating advanced navigation meshes with RAIN (For more resources related to this topic, see here.) An overview of a navigation mesh To use navigation meshes, also referred to as NavMeshes, effectively the first things we need to know are what exactly navigation meshes are and how they are created. A navigation mesh is a definition of the area an AI character could travel to in a level. It is a mesh, but it is not intended to be rendered or seen by the player, instead it is used by the AI system. A NavMesh usually does not cover all the area in a level (if it did we wouldn't need one) since it's just the area a character can walk. The mesh is also almost always a simplified version of the geometry. For instance, you could have a cave floor in a game with thousands of polygons along the bottom showing different details in the rock, but for the navigation mesh the areas would just be a handful of very large polys giving a simplified view of the level. The purpose of navigation mesh is to provide this simplified representation to the rest of the AI system a way to find a path between two points on a level for a character. This is its purpose; let's discuss how they are created. It used to be a common practice in the games industry to create navigation meshes manually. A designer or artist would take the completed level geometry and create one using standard polygon mesh modelling tools and save it out. As you might imagine, this allowed for nice, custom, efficient meshes, but was also a big time sink, since every time the level changed the navigation mesh would need to be manually edited and updated. In recent years, there has been more research in automatic navigation mesh generation. There are many approaches to automatic navigation mesh generation, but the most popular is Recast, originally developed and designed by Mikko Monomen. Recast takes in level geometry and a set of parameters defining the character, such as the size of the character and how big of steps it can take, and then does a multipass approach to filter and create the final NavMesh. The most important phase of this is voxelizing the level based on an inputted cell size. This means the level geometry is divided into voxels (cubes) creating a version of the level geometry where everything is partitioned into different boxes called cells. Then the geometry in each of these cells is analyzed and simplified based on its intersection with the sides of the boxes and is culled based on things such as the slope of the geometry or how big a step height is between geometry. This simplified geometry is then merged and triangulated to make a final navigation mesh that can be used by the AI system. The source code and more information on the original C++ implementation of Recast is available at https://github.com/memononen/recastnavigation. Advanced NavMesh parameters Now that we understand how navigation mesh generations works, let's look at the different parameters you can set to generate them in more detail. We'll look at how to do these with RAIN: Open Unity and create a new scene and a floor and some blocks for walls. Download RAIN from http://rivaltheory.com/rain/download/ and import it into your scene. Then go to RAIN | Create Navigation Mesh. Also right-click on the RAIN menu and choose Show Advanced Settings. The setup should look something like the following screenshot: Now let's look at some of the important parameters: Size: This is the overall size of the navigation mesh. You'll want the navigation mesh to cover your entire level and use this parameter instead of trying to scale up the navigation mesh through the Scale transform in the Inspector window. For our demo here, set the Size parameter to 20. Walkable Radius: This is an important parameter to define the character size of the mesh. Remember, each mesh will be matched to the size of a particular character, and this is the radius of the character. You can visualize the radius for a character by adding a Unity Sphere Collider script to your object (by going to Component | Physics | Sphere Collider) and adjusting the radius of the collider. Cell Size: This is also a very important parameter. During the voxel step of the Recast algorithm, this sets the size of the cubes to inspect the geometry. The smaller the size, the more detailed and finer mesh, but longer the processing time for Recast. A large cell size makes computation fast but loses detail. For example, here is a NavMesh from our demo with a cell size of 0.01: You can see the finer detail here. Here is the navigation mesh generated with a cell size of 0.1: Note the difference between the two screenshots. In the former, walking through the two walls lower down in our picture is possible, but in the latter with a larger cell size, there is no path even though the character radius is the same. Problems like this become greater with larger cell sizes. The following is a navigation mesh with a cell size of 1: As you can see, the detail becomes jumbled and the mesh itself becomes unusable. With such differing results, the big question is how large should a cell size be for a level? The answer is that it depends on the required result. However, one important consideration is that as the processing time to generate one is done during development and not at runtime even if it takes several minutes to generate a good mesh, it can be worth it to get a good result in the game. Setting a small cell size on a large level can cause mesh processing to take a significant amount of time and consume a lot of memory. It is a good practice to save the scene before attempting to generate a complex navigation mesh. The Size, Walkable Radius, and Cell Size parameters are the most important parameters when generating the navigation mesh, but there are more that are used to customize the mesh further: Max Slope: This is the largest slope that a character can walk on. This is how much a piece of geometry that is tilted can still be walked on. If you take the wall and rotate it, you can see it is walkable: The preceding is a screenshot of a walkable object with slope. Step Height: This is how high a character can step from one object to another. For example, if you have steps between two blocks, as shown in the following screenshot, this would define how far in height the blocks can be apart and whether the area is still considered walkable: This is a screenshot of the navigation mesh with step height set to connect adjacent blocks. Walkable Height: This is the vertical height that is needed for the character to walk. For example, in the previous illustration, the second block is not walkable underneath because of the walkable height. If you raise it to a least one unit off the ground and set the walkable height to 1, the area underneath would become walkable:   You can see a screenshot of the navigation mesh with walkable height set to allow going under the higher block. These are the most important parameters. There are some other parameters related to the visualization and to cull objects. We will look at culling more in the next section. Culling areas Being able to set up areas as walkable or not is an important part of creating a level. To demo this, let's divide the level into two parts and create a bridge between the two. Take our demo and duplicate the floor and pull it down. Then transform one of the walls to a bridge. Then, add two other pieces of geometry to mark areas that are dangerous to walk on, like lava. Here is an example setup: This is a basic scene with a bridge to cross. If you recreate the navigation mesh now, all of the geometry will be covered and the bridge won't be recognized. To fix this, you can create a new tag called Lava and tag the geometry under the bridge with it. Then, in the navigation meshes' RAIN component, add Lava to the unwalkable tags. If you then regenerate the mesh, only the bridge is walkable. This is a screenshot of a navigation mesh areas under bridge culled: Using layers and the walkable tag you can customize navigation meshes. Summary Navigation meshes are an important part of game AI. In this article, we looked at the different parameters to customize navigation meshes. We looked at things such as setting the character size and walkable slopes and discussed the importance of the cell size parameter. We then saw how to customize our mesh by tagging different areas as not walkable. This should be a good start for designing navigation meshes for your games. Resources for Article: Further resources on this subject: Components in Unity [article] Enemy and Friendly AIs [article] Introduction to AI [article]
Read more
  • 0
  • 0
  • 4378
article-image-microstrategy-10
Packt
15 Jul 2016
13 min read
Save for later

MicroStrategy 10

Packt
15 Jul 2016
13 min read
In this article by Dmitry Anoshin, Himani Rana, and Ning Ma, the authors of the book, Mastering Business Intelligence with MicroStrategy, we are going to talk about MicroStrategy 10 which is one of the leading platforms on the market, can handle all data analytics demands, and offers a powerful solution. We will be discussing the different concepts of MicroStrategy such as its history, deployment, and so on. (For more resources related to this topic, see here.) Meet MicroStrategy 10 MicroStrategy is a market leader in Business Intelligence (BI) products. It has rich functionality in order to meet the requirements of modern businesses. In 2015, MicroStrategy provided a new release of MicroStrategy, version 10. It offers both agility and governance like no other BI product. In addition, it is easy to use and enterprise ready. At the same time, it is great for both IT and business. In other words, MicroStrategy 10 offers an analytics platform that combines an easy and empowering user experience, together with enterprise-grade performance, management, and security capabilities. It is true bimodal BI and moves seamlessly between styles: Data discovery and visualization Enterprise reporting and dashboards In-memory high performance BI Scales from departments to enterprises Administration and security MicroStrategy 10 consists of three main products: MicroStrategy Desktop, MicroStrategy Mobile, and MicroStrategy Web. MicroStrategy Desktop lets users start discovering and visualizing data instantly. It is available for Mac and PC. It allows users to connect, prepare, discover, and visualize data. In addition, we can easily promote to a MicroStrategy Server. Moreover, MicroStrategy Desktop has a brand new HTML5 interface and includes all connection drivers. It allows us to use data blending, data preparation, and data enrichment. Finally, it has powerful advanced analytics and can be integrated with R. To cut a long story short, we want to notice main changes of new BI platform. All developers keep the same functionality, the looks as well as architect the same. All changes are about Web interface and Intelligence Server. Let's look closer at what MicroStrategy 10 can show us. MicroStrategy 10 expands the analytical ecosystem by using third-party toolkits such as: Data visualization libraries: We can easily plug in and use any visualization from the expanding range of Java libraries Statistical toolkits: R, SAS, SPSS, KXEN, and others Geolocation data visualization: Uses mapping capabilities to visualize and interact with location data MicroStrategy 10 has more than 25 new data sources that we can connect to quickly and simply. In addition, it allows us build reports on top of other BI tools, such as SAP Business Objects, Cognos, and Oracle BI. It has a new connector to Hadoop, which uses the native connector. Moreover, it allows us to blend multiple data sources in-memory. We want to notice that MicroStrategy 10 got reach functionality for work with data such as: Streamlined workflows to parse and prepare data Multi-table in-memory support from different sources Automatically parse and prepare data with every refresh 100+ inbuilt functions to profile and clean data Create custom groups on the fly without coding In terms of connection to Hadoop, most BI products use Hive or Impala ODBC drivers in order to use SQL to get data from Hadoop. However, this method is bad in terms of performance. MicroStrategy 10 queries directly against Hadoop. As a result, it is up to 50 times faster than via ODBC. Let's look at some of the main technical changes that have significantly improved MicroStrategy. The platform is now faster than ever before, because it doesn't have a two-billion-row limit on in-memory datasets and allows us to create analytical cubes up to 16 times bigger in size. It publishes cubes dramatically faster. Moreover, MicroStrategy 10 has higher data throughput and cubes can be loaded in parallel 4 times faster with multi-threaded parallel loading. In addition, the in-memory engine allows us to create cubes 80 times larger than before, and we can access data from cubes 50% faster, by using up to 8 parallel threads. Look at the following table, where we compare in-memory cube functionality in version 9 versus version 10: Feature Ver. 9 Ver. 10 Data volume 100 GB ~2TB Number of rows 2 billion 200 billion Load rate 8 GB/hour ~200 GB/hour Data model Star schema Any schema, tabular or multiple sets   In order to make the administration of MicroStrategy more effective in the new version, MicroStrategy Operation Manager was released. It gives MicroStrategy administrators powerful development tools to monitor, automate, and control systems. Operations Manager gives us: Centralized management in a web browser Enterprise Manager Console within Tool Triggers and 24/7 alerts System health monitors Server management Multiple environment administration MicroStrategy 10 education and certification MicroStrategy 10 offers new training courses that can be conducted offline in a training center, or online at http://www.microstrategy.com/us/services/education. We believe that certification is a good thing on your journey. The following certifications now exist for version 10: MicroStrategy 10 Certified Associated Analyst MicroStrategy 10 Certified Application Designer MicroStrategy 10 Certified Application Developer MicroStrategy 10 Certified Administrator After passing all of these exams, you will become a MicroStrategy 10 Application Engineer. More details can be found here: http://www.microstrategy.com/Strategy/media/downloads/training-events/MicroStrategy-certification-matrix_v10.pdf. History ofMicroStrategy Let us briefly look at the history of MicroStrategy, which began in 1991: 1991: Released first BI product, which allowed users to create graphical views and analyses of information data 2000: Released MicroStrategy 7 with a web interface 2003: First to release a fully integrated reporting tool, combining list reports, BI-style dashboards, and interface analyses in a single module. 2005: Released MicroStrategy 8, including one-click actions and drag-and-drop dashboard creation 2009: Released MicroStrategy 9, delivering a seamless consolidated path from department to enterprise BI 2010: Unveiled new mobile BI capabilities for iPad and iPhone, and was featured on the iTunes Bestseller List 2011: Released MicroStrategy Cloud, the first SaaS offering from a major BI vendor 2012: Released Visual Data Discovery and groundbreaking new security platform, Usher 2013: Released expanded Analytics Platform and free Analytics Desktop client 2014: Announced availability of MicroStrategy Analytics via Amazon Web Services (AWS) 2015: MicroStrategy 10 was released, the first ever enterprise analytics solution for centralized and decentralized BI DeployingMicroStrategy 10 We know only one way to master MicroStrategy, through practical exercises. Let's start by downloading and deploying MicroStrategy 10.2. Overview of training architecture In order to master MicroStrategy and learn about some BI considerations, we need to download the all-important software, deploy it, and connect to a network. During the preparation of the training environment, we will cover the installation of MicroStrategy on a Linux operating system. This is very good practice, because many people work with Windows and are not familiar with Linux, so this chapter will provide additional knowledge of working with Linux, as well as installing MicroStrategy and a web server. Look at the training architecture: There are three main components: Red Hat Linux 6.4: Used for deploying the web server and Intelligence Server. Windows machine: Uses MicroStrategy Client and Oracle database. Virtual machine with Hadoop: Ready virtual machine with Hadoop, which will connect to MicroStrategy using a brand new connection. In the real world, we should use separate machines for every component, and sometimes several machines in order to run one component. This is called clustering. Let's create a virtual machine. Creating of Red Hat Linux virtual machine Let's create a virtual machine with Red Hat Linux, which will host our Intelligence Server: Go to http://www.redhat.com/ and create an account Go to the software download center: https://access.redhat.com/downloads Download RHEL: https://access.redhat.com/downloads/content/69/ver=/rhel---7/7.2/x86_64/product-software Choose Red Hat Enterprise Linux Server Download Red Hat Enterprise Linux 6.4 x86_64 Choose Binary DVD Now we can create a virtual machine with RHEL 6.4. We have several options in order to choose the software for deploying virtual machine. In our case, we will use a VMware workstation. Before starting to deploy a new VM, we should adjust the default settings, such as increasing RAM and HDD, and adding one more network card in order to connect the external environment with the MicroStrategyClient and sample database. In addition, we should create a new network. When the deployment of the RHEL virtual machine is complete, we should activate a subscription in order to install the required packages. Let us do this with one command in the terminal: # subscription-manager register --username <username> --password <password> --auto-attach Performing prerequisites for MicroStrategy 10 According to the installation and configuration guide, we should deploy all necessary packages. In order to install them, we should execute them under the root: # su # yum install compat-libstdc++-33.i686 # yum install libXp.x86_64 # yum install elfutils-devel.x86_64 # yum install libstdc++-4.4.7-3.el6.i686 # yum install krb5-libs.i686 # yum install nss-pam-ldapd.i686 # yum install ksh.x86_64 The project design process Project design is not just about creating a project in MicroStrategy architect; it involves several steps and thorough analysis, such as how data is stored in the data warehouse, what reports the user wants based on the data, and so on. The following are the steps involved in our project design process: Logical data model design Once the user have business requirements documented, the user must create a fact qualifier matrix to identify the attributes, facts, and hierarchies, which are the building blocks of any logical data model. An example of a fact qualifier is as follows: A logical data model is created based on the source systems and designed before defining a data warehouse. So, it's good for seeing which objects the users want and checking whether the objects are in the source systems. It represents the definition, characteristics, and relationships of the data. This graphical representation of information is easily understandable by business users too. A logical data model graphically represents the following concepts: Attributes: Provides a detailed description of the data Facts: Provide numerical information about the data Hierarchies: Provide relationships between data Data warehouse schema design Physical data warehouse design is based on the logical data model and represents the storage and retrieval of data from the data warehouse. Here, we determine the optimal schema design, which ensures reporting performance and maintenance. The key components of a physical data warehouse schema are columns and tables: Columns: These store attribute and fact data. The following are the three types of columns: ID column: Stores the ID for an attribute Description column: Stores text description of the attribute Fact column: Stores fact data Tables: Physical grouping of related data. Following are the types of tables: Lookup tables: Store information about attributes such as IDs and descriptions Relationship tables: Store information about relationship between two or more attributes Fact tables: Store factual data and the level of aggregation, which is defined based on the attributes of the fact table. They contain base fact columns or derived fact columns: Base fact: Stores the data at the lowest possible level of detail. Aggregate fact: Stores data at a higher or summarized level of detail. Mobile server installation and configuration While mobile client is easy to install, mobile server is not. Here we provide a step-by-step guide on how to install mobile server: Download MicroStrategyMobile.war. Mobile server is packed in a WAR file, just like Operation Manager or Web: Copy MicroStrategyMobile.war from <Microstrategy Installation folder>/Mobile/MobileServer to /usr/local/tomcat7/webapps. Then restart Tomcat, by issuing the ./shutdown.sh and ./startup.sh commands: Connect to the mobile server. Go to http://192.168.81.134:8080/MicroStrategyMobile/servlet/mstrWebAdmin. Then add the server name localhost.localdomain and click connect: Configure mobile server. You can configure (1) Authentication settings for the mobile server application; (2) Privileges and permissions; (3) SSL encryption; (4) Client authentication with a certificate server; (5) Destination folder for the photo uploader widget and signature capture input control. Performing Pareto analysis One good thing about data discovery tools is their agile approach to the data. We can connect any data source and easily slice and dice data. Let's try to use the Pareto principle in order to answer the question: How are sales distributed among the different products? The Pareto principle states that, for many events, roughly 80% of results come from 20% of the causes. For example, 80% of profits come from 20% of the products offered. This type of analysis is very popular in product analytics. In MicroStrategy Desktop, we can use shortcut metrics in order to quickly make complex calculations such as running sums or a percent of the total. Let's build a visualization in order to see the 20% of products that bring us 80% of the money: Choose Combo Chart. Drag and drop Salesamount to the vertical and Englishproductname to the horizontal. Add Orderdate to the filters and restrict to 60 days. Right-click on Sales amountand choose Descending Sort. Right-click on Salesamount | ShortcutMetrics | Percent Running Total. Drag and drop Metric Names to the Color By. Change the color of Salesamount and Percent Running Total. Change the shape of Percent Running Total. As a result, we get this chart: From this chart we can quickly understand our top 20% of products which bring us 80% of revenue. Splunk and MicroStrategy MicroStrategy 10 has announced a new connection to Splunk. I suppose that Splunk is not very popular in the world of Business Intelligence. Most people who have heard about Splunk think that it is just a platform for processing logs. The answers is both true and false. Splunk was derived from the world of spelunking, because searching for root causes in logs is a kind of spelunking without light, and Splunk solves this problem by indexing machine data from a tremendous number of data sources, starting from applications, hardware, sensors, and so on. What is Splunk Splunk's goal is making machine data accessible, usable, and valuable for everyone, and turning machine data into business value. It can: Collect data from anywhere Search and analyze everything Gain real-time Operational Intelligence In the BI world, everyone knows what a data warehouse is. Creating reports from Splunk Now we are ready to build reports using MicroStrategy Desktop and Splunk. Let's do it: Go to MicroStrategy Desktop, click add data, and choose Splunk Create a connection using the existing DNS based on Splunk ODBC: Choose one of tables (Splunk reports): Add other tables as new data sources. Now we can build a dashboard using data from Splunk by dragging and dropping attributes and metrics: Summary In this article we looked at MicroStrategy 10 and its features. We learned about its history and deployment. We also learnt about the project design process, the Pareto analysis and about the connection of Splunk and MicroStrategy. Resources for Article: Further resources on this subject: Stacked Denoising Autoencoders [article] Creating external tables in your Oracle 10g/11g Database [article] Clustering Methods [article]
Read more
  • 0
  • 0
  • 4351

article-image-sap-hana-integration-microsoft-excel
Packt
03 Jan 2013
4 min read
Save for later

SAP HANA integration with Microsoft Excel

Packt
03 Jan 2013
4 min read
(For more resources related to this topic, see here.) Once your application is finished inside SAP HANA, and you can see that it performs as expected inside the Studio, you need to be able to deploy it to your users. Asking them to use the Studio is not really practical, and you don’t necessarily want to put the modeling software in the hands of all your users. Reporting on SAP HANA can be done in most of SAP’s Business Objects suite of applications, or in tools which can create and consume MDX queries and data. The simplest of these tools to start with is probably Microsoft Excel. Excel can connect to SAP HANA using the MDX language (a kind of multidimensional SQL) in the form of pivot tables. These in turn allow users to “slice and dice” data as they require, to extract the metrics they need. There are (at time of writing) limitations to the integration with SAP HANA and external reporting tools. These limitations are due to the relative youth of the HANA product, and are being addressed with each successive update to the software. Those listed here are valid for SAP HANA SP04, they may or may not be valid for your version: Hierarchies can only be visualized in Microsoft Excel, not in BusinessObjects Prompts can only be used in Business Objects BI4. Views which use variables can be used in other tools, but only if the variable has a default value (if you don’t have a default value on the variable, then Excel, notably, will complain that the view has been “changed on the server”) In order to make MDX connections to SAP HANA, the SAP HANA Client software is needed. This is separate to the Studio, and must be installed on the client workstation. Like the Studio itself, it can be found on the SAP HANA DVD set, or in the SWDC. Additionally, like the studio, SAP provides a developer download of the client software on SDN, at the following link: http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/webcontent/uuid/402aa158-6a7a-2f10-0195-f43595f6fe5f Just download the appropriate version for your Microsoft Office installation. Even if your PC has a 64-bit installation of Windows, you most likely have a 32-bit installation of Office, and you’ll need the 32-bit version of the SAP HANA Client software. If you’re not sure, you can find the information in the Help | About dialog box. In Excel 2010, for example, click on the File tab, then the Help menu entry. The version is specified on the right of the page: Just install the client software like you installed the studio, usually to the default location. Once the software is installed, there is no shortcut created on your desktop, and no entry will be created in your “Start” menu, so don’t be surprised to not see anything to run. We’re going to incorporate our sales simulator in Microsoft Excel, so launch Excel now. Go to the Data tab, and click on From Other Sources, then From Data Connection Wizard, as shown: Next, select Other/Advanced, then SAP HANA MDX provider, and then click Next. The SAP HANA Logon dialog will appear, so enter your Host, Instance, and login information (the same information you use to connect to SAP HANA with the Studio). Click on Test Connection to validate the connection. If the test succeeds, click on OK to choose the CUBE to which you want to connect. In Excel, all your Analytic and Calculation Views are considered to the cubes. Choose your Analytic or Calculation view and click Next. On this screen there’s a checkbox Save password in file – this will avoid having to type in the SAP HANA password every time the Excel file is opened – but the password is stored in the Excel file, which is a little less secure. Click on the Finish button to create the connection to SAP HANA, and your View. On the next screen you’ll be asked where you want to insert the pivot table, just click on OK, to see the results: Congratulations! You now have your reporting application available in Microsoft Excel, showing the same information you could see using the Data Preview feature of the SAP HANA Studio. Resources for Article : Further resources on SAP HANA Starter: SAP NetWeaver: MDM Scenarios and Fundamentals [Article] SAP BusinessObjects: Customizing the Dashboard [Article] SQL Query Basics in SAP Business One [Article]
Read more
  • 0
  • 0
  • 4336

article-image-creating-interactive-spreadsheets-using-tables-and-slicers
Packt
06 Oct 2015
10 min read
Save for later

Creating Interactive Spreadsheets using Tables and Slicers

Packt
06 Oct 2015
10 min read
In this article by Hernan D Rojas author of the book Data Analysis and Business Modeling with Excel 2013 introduces additional materials to the advanced Excel developer in the presentation stage of the data analysis life cycle. What you will learn in this article is how to leverage Excel's new features to add interactivity to your spreadsheets. (For more resources related to this topic, see here.) What are slicers? Slicers are essentially buttons that automatically filter your data. Excel has always been able to filter data, but slicers are more practical and visually appealing. Let's compare the two in the following steps: First, fire up Excel 2013, and create a new spreadsheet. Manually enter the data, as shown in the following figure: Highlight cells A1 through E11, and press Ctrl + T to convert our data to an Excel table. Converting your data to a table is the first step that you need to take in order to introduce slicers in your spreadsheet. Let's filter our data using the default filtering capabilities that we are already familiar with. Filter the Type column and only select the rows that have the value equal to SUV, as shown in the following figure. Click on the OK button to apply the filter to the table. You will now be left with four rows that have the Type column equal to SUV. Using a typical Excel filter, we were able to filter our data and only show all of the SUV cars. We can then continue to filter by other columns, such as MPG (miles per gallon) and Price. How can we accomplish the same results using slicers? Continue reading this article to find this out. How to create slicers? In this article, we will be going through simple but powerful steps that are required to build slicers. After we create our first slicer, make sure that you compare and contrast the old way of filtering versus the new way of filtering data. Remove the filter that we just applied to our table by clicking on the option named Clear Filter From "Type", as shown in the following figure: With your Excel table selected, click on the TABLE TOOLS tab. Click on the Insert Slicer button. In the Insert Slicers dialog box, select the Type checkbox, and click on the OK button, as shown in the following screenshot: You should now have a slicer that looks similar to the one in the following figure. Notice that you can resize and move the slicer anywhere you want in the spreadsheet. Click on the Sedan filter in the slicer that we build in the previous step. Wow! The data is filtered and only the rows where the Type column is equal to Sedan is shown in the results. Click on the Sport filter and see what happens. The data is now filtered where the Type column is equal to Sport. Notice that the previous filter of Sedan was removed as soon as we clicked on the Sport filter. What if we want to filter the data by both Sport and Sedan? We can just highlight both the filters with our mouse, or click on Sedan, press Ctrl, and then, click on the Sport filter. The end result will look like this: To clear the filter, just click on the Clear Filter button. Do you see the advantage of slicers over filters? Yes, of course, they are simply better. Filtering between Sedan, Sport, or SUV is very easy and convenient. It will certainly take less key strokes and the feedback is instant. Think about the end users interacting with your spreadsheet. At a touch of a button, they can answer questions that arise in their heads. This is what you call an interactive spreadsheet or an interactive dashboard. Styling slicers There are not many options to style slicers but Excel does give you a decent amount of color schemes that you can experiment with: With the Type slicer selected, navigate to the SLICER TOOLS tab, as shown in the following figure: Click on the various slicer styles available to get a feel of what Excel offers. Adding multiple slicers You are able to add multiple slicers and multiple charts in one spreadsheet. Why would we do this? Well, this is the beginning of a dashboard creation. Let's expand on the example we have just been working on, and see how we can turn raw data into an interactive dashboard: Let's start with creating slicers for # of Passengers and MPG, as shown in the following figure: Rename Sheet1 as Data, and create a new sheet called Dashboard, as shown here: Move the three slicers by cutting and pasting them from the Data sheet to the Dashboard sheet. Create a line chart using the columns Company and MPG, as shown in the following figure: Create a bar chart using the columns Type and MPG. Create another bar chart with the columns company and # of Passengers, as shown in the following figure. These types of charts are technically called column charts, but you can get away by calling them bar charts. Now, move the three charts from the Data tab to the Dashboard tab. Right-click on the bar chart, and select the Move Chart… option. In the Move Chart dialog box, change the Object in: parameter from Data to Dashboard, and then click on the OK button. Move the other two charts to the Dashboard tab so that there are no more charts in the Data tab. Rearrange the charts and slicers so that they look as closely as possible to the ones in the following figure. As you can see that this tab is starting to look like a dashboard. The Type slicer will look better if Sedan, Sport, and SUV are laid out horizontally. Select the Type slicer, and click on the SLICER TOOLS menu option. Change the Columns parameter from 1 to 3, as shown in the following figure. This is how we are able to change the layout or shape of the slicer. Resize the Type slicer so that it looks like the one in the following figure: Clearing filters You can click on one or more filters in the dashboard that we just created. Very cool! Every time we select a filter, all of the three charts that we created get updated. This again is called adding interactivity to your spreadsheets. This allows the end users of your dashboard to interact with your data and perform their own analysis. If you notice, there is really a no good way of removing multiple filters at once. For example, if you select Sedans that have a MPG of greater or equal to 30, how would you remove all of the filters? You would have to clear the filters from the Type slicer and then from the MPG slicer. This can be a little tedious to your end user, and you will want to avoid this at any cost. The next steps will show you how to create a button using VBA that will filter all of our data in a flash: Press Alt + F11, and create a sub procedure called Clear_Slicer, as shown in the following figure. This code will basically find all of the filters that you have selected and then manually clears them for you one at a time. The next step is to bind this code to a button: Sub Clear_Slicer() ' Declare Variables Dim cache As SlicerCache ' Loop through each filter For Each cache In ActiveWorkbook.SlicerCaches ' clear filter cache.ClearManualFilter Next cache End Sub Select the DEVELOPER tab, and click on the Insert button. In the pop-up menu called Form Controls, select the Button option. Now, click anywhere on the sheet, and you will get a dialog box that looks like the following figure. This is where we are going to assign a macro to the button. This means that whenever you click on the button we are creating, Excel will run the macro of our choice. Since we have already created a macro called Clear_Slicer, it will make sense to select this macro, and then click on the OK button. Change the text of the button to Clear All Filters and resize it so that it looks like this: Adjust the properties of the button by right-clicking on the button and selecting the Format Control… option. Here, you can change the font size and the color of your button label. Now, select a bunch of filters, and click on our new shiny button. Yes, that was pretty cool. The most important part is that it is now even easier to "reset" your dashboard and start a brand new analysis. What do I mean by start a brand new analysis? In general, when a user initially starts using your dashboard, he/she will click on the filters aimlessly. The users do this just to figure out how to mechanically use the dashboard. Then, after they get the hang of it, they want to start with a clean slate and perform some data analysis. If we did not have the Clear All Filters button, the users would have to figure out how they would clear every slicer one at a time to start over. The worst case scenario is when the user does not realize when the filters are turned on and when they are turned off. Now, do not laugh at this situation, or assume that your end user is not as smart as you are. This just means that you need to lower the learning curve of your dashboard and make it easy to use. With the addition of the clear button, the end user can think of a question, use the slicers to answer it, click on the clear button, and start the process all over again. These little details are what that is going to separate you from the average Excel developer. Summary The aim of this article was to give you ideas and tools to present your data artistically. Whether you like it or not, sometimes, a better looking analysis will trump the better but less attractive one. Excel gives you the tools to not be on the short end of the stick but to always be able to present visually stunning analysis. You now know have your Excel slicers, and you learned how to bind them to your data. Users of your spreadsheet can now slice and dice your data to answer multiple questions. Executives like flashy visualizations, and when you combine them with a strong analysis, you have a very powerful combination. In this article, we also went through a variety of strategies to customize slicers and chart elements. These little changes made to your dashboard will make them standout and help you get your message across. Excel as always has been an invaluable tool that gives you all of the tools necessary to overcome any data challenges you might come across. As I tell all my students, the key to become better is simply to practice, practice, and practice. Resources for Article: Further resources on this subject: Introduction to Stata and Data Analytics [article] Getting Started with Apache Spark DataFrames [article] Big Data Analysis (R and Hadoop) [article]
Read more
  • 0
  • 0
  • 4331
article-image-creating-line-graphs-r
Packt
17 Jan 2011
7 min read
Save for later

Creating Line Graphs in R

Packt
17 Jan 2011
7 min read
Adding customized legends for multiple line graphs Line graphs with more than one line, representing more than one variable, are quite common in any kind of data analysis. In this recipe we will learn how to create and customize legends for such graphs. Getting ready We will use the base graphics library for this recipe, so all you need to do is run the recipe at the R prompt. It is good practice to save your code as a script to use again later. How to do it... First we need to load the cityrain.csv example data file, which contains monthly rainfall data for four major cities across the world. You can download this file from here. We will use the cityrain.csv example dataset. rain<-read.csv("cityrain.csv") plot(rain$Tokyo,type="b",lwd=2, xaxt="n",ylim=c(0,300),col="black", xlab="Month",ylab="Rainfall (mm)", main="Monthly Rainfall in major cities") axis(1,at=1:length(rain$Month),labels=rain$Month) lines(rain$Berlin,col="red",type="b",lwd=2) lines(rain$NewYork,col="orange",type="b",lwd=2) lines(rain$London,col="purple",type="b",lwd=2) legend("topright",legend=c("Tokyo","Berlin","New York","London"), lty=1,lwd=2,pch=21,col=c("black","red","orange","purple"), ncol=2,bty="n",cex=0.8, text.col=c("black","red","orange","purple"), inset=0.01) How it works... We used the legend() function. It is quite a flexible function and allows us to adjust the placement and styling of the legend in many ways. The first argument we passed to legend() specifies the position of the legend within the plot region. We used "topright"; other possible values are "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "right", and "center". We can also specify the location of legend with x and y co-ordinates as we will soon see. The other important arguments specific to lines are lwd and lty which specify the line width and type drawn in the legend box respectively. It is important to keep these the same as the corresponding values in the plot() and lines() commands. We also set pch to 21 to replicate the type="b" argument in the plot() command. cex and text.col set the size and colors of the legend text. Note that we set the text colors to the same colors as the lines they represent. Setting bty (box type) to "n" ensures no box is drawn around the legend. This is good practice as it keeps the look of the graph clean. ncol sets the number of columns over which the legend labels are spread and inset sets the inset distance from the margins as a fraction of the plot region. There's more... Let's experiment by changing some of the arguments discussed: legend(1,300,legend=c("Tokyo","Berlin","New York","London"), lty=1,lwd=2,pch=21,col=c("black","red","orange","purple"), horiz=TRUE,bty="n",bg="yellow",cex=1, text.col=c("black","red","orange","purple")) This time we used x and y co-ordinates instead of a keyword to position the legend. We also set the horiz argument to TRUE. As the name suggests, horiz makes the legend labels horizontal instead of the default vertical. Specifying horiz overrides the ncol argument. Finally, we made the legend text bigger by setting cex to 1 and did not use the inset argument. An alternative way of creating the previous plot without having to call plot() and lines() multiple times is to use the matplot() function. To see details on how to use this function, please see the help file by running ?matplot or help(matplot) at the R prompt. Using margin labels instead of legends for multiple line graphs While legends are the most commonly used method of providing a key to read multiple variable graphs, they are often not the easiest to read. Labelling lines directly is one way of getting around that problem. Getting ready We will use the base graphics library for this recipe, so all you need to do is run the recipe at the R prompt. It is good practice to save your code as a script to use again later. How to do it... Let's use the gdp.txt example dataset to look at the trends in the annual GDP of five countries: gdp<-read.table("gdp_long.txt",header=T) library(RColorBrewer) pal<-brewer.pal(5,"Set1") par(mar=par()$mar+c(0,0,0,2),bty="l") plot(Canada~Year,data=gdp,type="l",lwd=2,lty=1,ylim=c(30,60), col=pal[1],main="Percentage change in GDP",ylab="") mtext(side=4,at=gdp$Canada[length(gdp$Canada)],text="Canada", col=pal[1],line=0.3,las=2) lines(gdp$France~gdp$Year,col=pal[2],lwd=2) mtext(side=4,at=gdp$France[length(gdp$France)],text="France", col=pal[2],line=0.3,las=2) lines(gdp$Germany~gdp$Year,col=pal[3],lwd=2) mtext(side=4,at=gdp$Germany[length(gdp$Germany)],text="Germany", col=pal[3],line=0.3,las=2) lines(gdp$Britain~gdp$Year,col=pal[4],lwd=2) mtext(side=4,at=gdp$Britain[length(gdp$Britain)],text="Britain", col=pal[4],line=0.3,las=2) lines(gdp$USA~gdp$Year,col=pal[5],lwd=2) mtext(side=4,at=gdp$USA[length(gdp$USA)]-2, text="USA",col=pal[5],line=0.3,las=2) How it works... We first read the gdp.txt data file using the read.table() function. Next we loaded the RColorBrewer color palette library and set our color palette pal to "Set1" (with five colors). Before drawing the graph, we used the par() command to add extra space to the right margin, so that we have enough space for the labels. Depending on the size of the text labels you may have to experiment with this margin until you get it right. Finally, we set the box type (bty) to an L-shape ("l") so that there is no line on the right margin. We can also set it to "c" if we want to keep the top line. We used the mtext() function to label each of the lines individually in the right margin. The first argument we passed to the function is the side where we want the label to be placed. Sides (margins) are numbered starting from 1 for the bottom side and going round in a clockwise direction so that 2 is left, 3 is top, and 4 is right. The at argument was used to specify the Y co-ordinate of the label. This is a bit tricky because we have to make sure we place the label as close to the corresponding line as possible. So, here we have used the last value of each line. For example, gdp$France[length(gdp$France) picks the last value in the France vector by using its length as the index. Note that we had to adjust the value for USA by subtracting 2 from its last value so that it doesn't overlap the label for Canada. We used the text argument to set the text of the labels as country names. We set the col argument to the appropriate element of the pal vector by using a number index. The line argument sets an offset in terms of margin lines, starting at 0 counting outwards. Finally, setting las to 2 rotates the labels to be perpendicular to the axis, instead of the default value of 1 which makes them parallel to the axis. Sometimes, simply using the last value of a set of values may not work because the value may be missing. In that case we can use the second last value or visually choose a value that places the label closest to the line. Also, the size of the plot window and the proximity of the final values may cause overlapping of labels. So, we may need to iterate a few times before we get the placement right. We can write functions to automate this process but it is still good to visually inspect the outcome.
Read more
  • 0
  • 0
  • 4319

article-image-working-apps-splunk
Packt
08 Mar 2013
6 min read
Save for later

Working with Apps in Splunk

Packt
08 Mar 2013
6 min read
(For more resources related to this topic, see here.) Defining an app In the strictest sense, an app is a directory of configurations and, sometimes, code. The directories and files inside have a particular naming convention and structure. All configurations are in plain text, and can be edited using your choice of text editor. Apps generally serve one or more of the following purposes: A container for searches, dashboards, and related configurations: This is what most users will do with apps. This is not only useful for logical grouping, but also for limiting what configurations are applied and at what time. This kind of app usually does not affect other apps. Providing extra functionality: Many objects can be provided in an app for use by other apps. These include field extractions, lookups, external commands, saved searches, workflow actions, and even dashboards. These apps often have no user interface at all; instead they add functionality to other apps. Configuring a Splunk installation for a specific purpose: In a distributed deployment, there are several different purposes that are served by the multiple installations of Splunk. The behavior of each installation is controlled by its configuration, and it is convenient to wrap those configurations into one or more apps. These apps completely change the behavior of a particular installation. Included apps Without apps, Splunk has no user interface, rendering it essentially useless. Luckily, Splunk comes with a few apps to get us started. Let's look at a few of these apps: gettingstarted: This app provides the help screens that you can access from the launcher. There are no searches, only a single dashboard that simply includes an HTML page. search: This is the app where users spend most of their time. It contains the main search dashboard that can be used from any app, external search commands that can be used from any app, admin dashboards, custom navigation, custom css, a custom app icon, a custom app logo, and many other useful elements. splunk_datapreview: This app provides the data preview functionality in the admin interface. It is built entirely using JavaScript and custom REST endpoints. SplunkDeploymentMonitor: This app provides searches and dashboards to help you keep track of your data usage and the health of your Splunk deployment. It also defines indexes, saved searches, and summary indexes. It is a good source for more advanced search examples. SplunkForwarder and SplunkLightForwarder: These apps, which are disabled by default, simply disable portions of a Splunk installation so that the installation is lighter in weight. If you never create or install another app, and instead simply create saved searches and dashboards in the app search, you can still be quite successful with Splunk. Installing and creating more apps, however, allows you to take advantage of others' work, organize your own work, and ultimately share your work with others. Installing apps Apps can either be installed from Splunkbase or uploaded through the admin interface. To get started, let's navigate to Manager | Apps, or choose Manage apps... from the App menu as shown in the following screenshot: Installing apps from Splunkbase If your Splunk server has direct access to the Internet, you can install apps from Splunkbase with just a few clicks. Navigate to Manager | Apps and click on Find more apps online. The most popular apps will be listed as follows: Let's install a pair of apps and have a little fun. First, install Geo Location Lookup Script (powered by MAXMIND) by clicking on the Install free button. You will be prompted for your splunk.com login. This is the same login that you created when you downloaded Splunk. If you don't have an account, you will need to create one. Next, install the Google Maps app. This app was built by a Splunk customer and contributed back to the Splunk community. This app will prompt you to restart Splunk. Once you have restarted and logged back in, check the App menu. Google Maps is now visible, but where is Geo Location Lookup Script? Remember that not all apps have dashboards; nor do they necessarily have any visible components at all. Using Geo Location Lookup Script Geo Location Lookup Script provides a lookup script to provide geolocation information for IP addresses. Looking at the documentation, we see this example: eventtype=firewall_event | lookup geoip clientip as src_ip You can find the documentation for any Splunkbase app by searching for it at splunkbase.com, or by clicking on Read more next to any installed app by navigating to Manager | Apps | Browse more apps. Let's read through the arguments of the lookup command: geoip: This is the name of the lookup provided by Geo Location Lookup Script. You can see the available lookups by going to Manager | Lookups | Lookup definitions. clientip: This is the name of the field in the lookup that we are matching against. as src_ip: This says to use the value of src_ip to populate the field before it; in this case, clientip. I personally find this wording confusing. In my mind, I read this as "using" instead of "as". Included in the ImplementingSplunkDataGenerator app (available at http://packtpub.com/) is a sourcetype instance named impl_splunk_ips, which looks like this: 2012-05-26T18:23:44 ip=64.134.155.137 The IP addresses in this fictitious log are from one of my websites. Let's see some information about these addresses: sourcetype="impl_splunk_ips" | lookup geoip clientip AS ip | top client_country This gives us a table similar to the one shown in the following screenshot: That's interesting. I wonder who is visiting my site from Slovenia! Using Google Maps Now let's do a similar search in the Google Maps app. Choose Google Maps from the App menu. The interface looks like the standard search interface, but with a map instead of an event listing. Let's try this remarkably similar (but not identical) query using a lookup provided in the Google Maps app: sourcetype="impl_splunk_ips" | lookup geo ip The map generated looks like this: Unsurprisingly, most of the traffic to this little site came from my house in Austin, Texas. Installing apps from a file It is not uncommon for Splunk servers to not have access to the Internet, particularly in a datacenter. In this case, follow these steps: Download the app from splunkbase.com. The file will have a .spl or .tgz extension. Navigate to Manager | Apps. Click on Install app from file. Upload the downloaded file using the form provided. Restart if the app requires it. Configure the app if required. That's it. Some apps have a configuration form. If this is the case, you will see a Set up link next to the app when you go to Manager | Apps. If something goes wrong, contact the author of the app. If you have a distributed environment, in most cases the app only needs to be installed on your search head. The components that your indexers need will be distributed automatically by the search head. Check the documentation for the app.
Read more
  • 0
  • 0
  • 4316
Modal Close icon
Modal Close icon