Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-getting-started-apache-spark-dataframes
Packt
22 Sep 2015
5 min read
Save for later

Getting Started with Apache Spark DataFrames

Packt
22 Sep 2015
5 min read
 In this article article about Arun Manivannan’s book Scala Data Analysis Cookbook, we will cover the following recipes: Getting Apache Spark ML – a framework for large-scale machine learning Creating a data frame from CSV (For more resources related to this topic, see here.) Getting started with Apache Spark Breeze is the building block of Spark MLLib, the machine learning library for Apache Spark. In this recipe, we'll see how to bring Spark into our project (using SBT) and look at how it works internally. The code for this recipe could be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/build.sbt. How to do it... Pulling Spark ML into our project is just a matter of adding a few dependencies on our build.sbt file: spark-core, spark-sql, and spark-mllib: Under a brand new folder (which will be our project root), we create a new file called build.sbt. Next, let's add to the project dependencies the Spark libraries: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion ) resolvers ++= Seq( "Apache HBase" at "https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/" ) How it works... Spark has four major higher level tools built on top of the Spark Core: Spark Streaming, Spark ML Lib (Machine Learning), Spark SQL (An SQL interface for accessing data), and GraphX (for graph processing). The Spark Core is the heart of Spark, providing higher level abstractions in various languages for data representation, serialization, scheduling, metrics, and so on. For this recipe, we skipped streaming and GraphX and added the remaining three libraries. There’s more… Apache Spark is a cluster computing platform that claims to run about 100 times faster than Hadoop (that's a mouthful). In our terms, we could consider that as a means to run our complex logic over a massive amount of data at a blazingly high speed. The other good thing about Spark is that the programs we write are much smaller than the typical Map Reduce classes that we write for Hadoop. So, not only do our programs run faster, but it also takes lesser time to write them in the first place. Creating a data frame from CSV In this recipe, we'll look at how to create a new data frame from a Delimiter Separated Values (DSV) file. The code for this recipe could be found athttps://github.com/arunma/ScalaDataAnalysisCookbook/tree/master/chapter1-spark-csv in the DataFrameCSV class. How to do it... CSV support isn't first-class in Spark but is available through an external library from databricks. So, let's go ahead and add that up in build.sbt: After adding the spark-csv dependency, our complete build.sbt looks as follows: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion, "com.databricks" %% "spark-csv" % "1.0.3" ) resolvers ++= Seq( "Apache HBase" at"https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/" ) fork := true Before we create the actual data frame, there are three steps that we ought to do: create the Spark configuration, create the Spark context, and create the SQL context. SparkConf holds all of the information for running this Spark cluster. For this recipe, we are running locally, and we intend to use only two cores in the machine—local[2]: val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]") For this recipe, we'll be running Spark on standalone mode. Now let's load our pipe-separated file: org.apache.spark.sql.DataFrame val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|') How it works... The csvFile function of sqlContext accepts the full filePath of the file to be loaded. If the CSV has a header, then the useHeader flag will read the first row as column names. The delimiter flag, as expected, defaults to a comma, but you can override the character as needed. Instead of using the csvFile function, you can also use the load function available in the SQL context. The load function accepts the format of the file (in our case, it is CSV) and options as a map. We can specify the same parameters that we specified earlier using Map, like this: val options=Map("header"->"true", "path"->"ModifiedStudent.csv") val newStudents=sqlContext.load("com.databricks.spark.csv",options) Summary In this article, you learned in detail Apache Spark ML, a framework for large-scale machine learning. Then we saw the creation of a data frame from CSV with the help of example code. Resources for Article: Further resources on this subject: Integrating Scala, Groovy, and Flex Development with Apache Maven[article] Ridge Regression[article] Reactive Data Streams [article]
Read more
  • 0
  • 0
  • 9371

article-image-learning-d3js-mapping
Packt
08 May 2015
3 min read
Save for later

Learning D3.js Mapping

Packt
08 May 2015
3 min read
What is D3.js? D3.js (Data-Driven Documents) is a JavaScript library used for data visualization. D3.js is used to display graphical representations of information in a browser using JavaScript. Because they are, essentially, a collection of JavaScript code, graphical elements produced by D3.js can react to changes made on the client or server side. D3.js has seen major adoption as websites include more dynamic charts, graphs, infographics and other forms of visualized data. (For more resources related to this topic, see here.) Why this book? This book by the authors, Thomas Newton and Oscar Villarreal, explores the JavaScript library, D3.js, and its ability to help us create maps and amazing visualizations. You will no longer be confined to third-party tools in order to get a nice looking map. With D3.js, you can build your own maps and customize them as you please. This book will go from the basics of SVG and JavaScript to data trimming and modification with TopoJSON. Using D3.js to glue together these three key ingredients, we will create very attractive maps that will cover many common use cases for maps, such as choropleths, data overlays on maps, and interactivity. Key features Dive into D3.js and apply its powerful data binding ability in order to create stunning visualizations Learn the key concepts of SVG, JavaScript, CSS, and the DOM in order to project images onto the browser Solve a wide range of problems faced while building interactive maps with this solution-based guide Authors Thomas Newton has 20 years of experience in the technical industry, working on everything from low-level system designs and data visualization to software design and architecture. Currently, he is creating data visualizations to solve analytical problems for clients. Oscar Villarreal has been developing interfaces for the past 10 years, and most recently, he has been focusing on data visualization and web applications. In his spare time, he loves to write on his blog, oscarvillarreal.com. In short You will find a few books on D3.js but they require intermediate-level developers who already know how to use D3.js, a few of them will cover a much wider range of D3 usage, whereas this book is focused exclusively on mapping; it fully explores this core task, and takes a solution-based approach. Recommendations and all the wealthy knowledge that authors have shared in the book is based on many years of experience and many projects delivered using D3. What this book covers Learn all the tools you need to create a map using D3 A high-level overview of Scalable Vector Graphics (SVG) presentation with explanation on how it operates and what elements it encompasses Exploring D3.js—producing graphics from data Step-by-step guide to build a map with D3 Get you started with interactivity in your D3 map visualizations Most important aspects of map visualization in detail via the use of TopoJSON Assistance for the long-term maintenance of your D3 code base and different techniques to keep it healthy over the lifespan of your project Summary So far, you may have got an idea of what all be covered in the book. This book is carefully designed to allow the reader to jump between chapters based on what they are planning to get out of the book. Every chapter is full of pragmatic examples that can easily provide the foundation to more complex work. Authors have explained, step by step, how each example works. Resources for Article: Further resources on this subject: Using Canvas and D3 [article] Interacting with your Visualization [article] Simple graphs with d3.js [article]
Read more
  • 0
  • 0
  • 9366

article-image-predicting-sports-winners-decision-trees-and-pandas
Packt
12 Aug 2015
6 min read
Save for later

Predicting Sports Winners with Decision Trees and pandas

Packt
12 Aug 2015
6 min read
In this article by Robert Craig Layton, author of Learning Data Mining with Python, we will look at predicting the winner of games of the National Basketball Association (NBA) using a different type of classification algorithm—decision trees. Collecting the data The data we will be using is the match history data for the NBA, for the 2013-2014 season. The Basketball-Reference.com website contains a significant number of resources and statistics collected from the NBA and other leagues. Perform the following steps to download the dataset: Navigate to http://www.basketball-reference.com/leagues/NBA_2014_games.html in your web browser. Click on the Export button next to the Regular Season heading. Download the file to your data folder (and make a note of the path). This will download a CSV file containing the results of 1,230 games in the regular season of the NBA. We will load the file with the pandas library, which is an incredibly useful library for manipulating data. Python also contains a built-in library called csv that supports reading and writing CSV files. We will use pandas instead as it provides more powerful functions to work with datasets. For this article, you will need to install pandas. The easiest way to do that is to use pip3, which you may previously have used to install scikit-learn: $pip3 install pandas Using pandas to load the dataset We can load the dataset using the read_csv function in pandas as follows: import pandas as pddataset = pd.read_csv(data_filename) The result of this is a data frame, a data structure used by pandas. The pandas.read_csv function has parameters to fix some of the problems in the data, such as missing headings, which we can specify when loading the file: dataset = pd.read_csv(data_filename, parse_dates=["Date"],skiprows=[0,])dataset.columns = ["Date", "Score Type", "Visitor Team","VisitorPts", "Home Team", "HomePts", "OT?", "Notes"] We can now view a sample of the data frame: dataset.ix[:5] Extracting new features We extract our classes, 1 for a home win, and 0 for a visitor win. We can specify this using the following code to extract those wins into a NumPy array: dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"] y_true = dataset["HomeWin"].values The first two new features we want to create are to indicate whether each of the two teams won their previous game. This would roughly approximate which team is currently playing well. We will compute this feature by iterating through the rows in order, and recording which team won. When we get to a new row, we look up whether the team won the last time: from collections import defaultdictwon_last = defaultdict(int) We can then iterate over all the rows and update the current row with the team's last result (win or loss): for index, row in dataset.iterrows():home_team = row["Home Team"]visitor_team = row["Visitor Team"]row["HomeLastWin"] = won_last[home_team]row["VisitorLastWin"] = won_last[visitor_team]dataset.ix[index] = row We then set our dictionary with each team's result (from this row) for the next time we see these teams: won_last[home_team] = row["HomeWin"]won_last[visitor_team] = not row["HomeWin"] Decision trees Decision trees are a class of classification algorithm such as a flow chart that consist of a sequence of nodes, where the values for a sample are used to make a decision on the next node to go to. We can use the DecisionTreeClassifier class to create a decision tree: from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier(random_state=14) We now need to extract the dataset from our pandas data frame in order to use it with our scikit-learn classifier. We do this by specifying the columns we wish to use and using the values parameter of a view of the data frame: X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values Decision trees are estimators, and therefore, they have fit and predict methods. We can also use the cross_val_score method as before to get the average score: scores = cross_val_score(clf, X_previouswins, y_true,scoring='accuracy')print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100)) This scores up to 56.1%; we are better off choosing randomly! Predicting sports outcomes We have a method for testing how accurate our models are using the cross_val_score method that allows us to try new features. For the first feature, we will create a feature that tells us whether the home team is generally better than the visitors by seeing whether they ranked higher in the previous season. To obtain the data, perform the following steps: Head to http://www.basketball-reference.com/leagues/NBA_2013_standings.html Scroll down to Expanded Standings. This gives us a single list for the entire league. Click on the Export link to the right of this heading. Save the download in your data folder. In your IPython Notebook, enter the following into a new cell. You'll need to ensure that the file was saved into the location pointed to by the data_folder variable: standings_filename = os.path.join(data_folder,"leagues_NBA_2013_standings_expanded-standings.csv")standings = pd.read_csv(standings_filename, skiprows=[0,1]) We then iterate over the rows and compare the team's standings: dataset["HomeTeamRanksHigher"] = 0for index, row in dataset.iterrows():home_team = row["Home Team"]visitor_team = row["Visitor Team"] Between 2013 and 2014, a team was renamed as follows: if home_team == "New Orleans Pelicans":home_team = "New Orleans Hornets"elif visitor_team == "New Orleans Pelicans":visitor_team = "New Orleans Hornets" Now, we can get the rankings for each team. We then compare them and update the feature in the row: home_rank = standings[standings["Team"] ==home_team]["Rk"].values[0]visitor_rank = standings[standings["Team"] ==visitor_team]["Rk"].values[0]row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)dataset.ix[index] = row Next, we use the cross_val_score function to test the result. First, we extract the dataset as before: X_homehigher = dataset[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values Then, we create a new DecisionTreeClassifier class and run the evaluation: clf = DecisionTreeClassifier(random_state=14)scores = cross_val_score(clf, X_homehigher, y_true,scoring='accuracy')print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100)) This now scores up to 60.3%—even better than our previous result. Unleash the full power of Python machine learning with our 'Learning Data Mining with Python' book.
Read more
  • 0
  • 0
  • 9355

article-image-2018-new-year-resolutions-to-thrive-in-the-algorithmic-world-part-3-of-3
Sugandha Lahoti
05 Jan 2018
5 min read
Save for later

2018 new year resolutions to thrive in the Algorithmic World - Part 3 of 3

Sugandha Lahoti
05 Jan 2018
5 min read
We have already talked about a simple learning roadmap for you to develop your data science skills in the first resolution. We also talked about the importance of staying relevant in an increasingly automated job market, in our second resolution. Now it’s time to think about the kind of person you want to be and the legacy you will leave behind. 3rd Resolution: Choose projects wisely and be mindful of their impact. Your work has real consequences. And your projects will often be larger than what you know or can do. As such, the first step toward creating impact with intention is to define the project scope, purpose, outcomes and assets clearly. The next most important factor is choosing the project team. 1. Seek out, learn from and work with a diverse group of people To become a successful data scientist you must learn how to collaborate. Not only does it make projects fun and efficient, but it also brings in diverse points of view and expertise from other disciplines. This is a great advantage for machine learning projects that attempt to solve complex real-world problems. You could benefit from working with other technical professionals like web developers, software programmers, data analysts, data administrators, game developers etc. Collaborating with such people will enhance your own domain knowledge and skills and also let you see your work from a broader technical perspective. Apart from the people involved in the core data and software domain, there are others who also have a primary stake in your project’s success. These include UX designers, people with humanities background if you are building a product intended to participate in society (which most products often are), business development folks, who actually sell your product and bring revenue, marketing people, who are responsible for bringing your product to a much wider audience to name a few. Working with people of diverse skill sets will help market your product right and make it useful and interpretable to the target audience. In addition to working with a melange of people with diverse skill sets and educational background it is also important to work with people who think differently from you, and who have experiences that are different from yours to get a more holistic idea of the problems your project is trying to tackle and to arrive at a richer and unique set of solutions to solve those problems. 2. Educate yourself on ethics for data science As an aspiring data scientist, you should always keep in mind the ethical aspects surrounding privacy, data sharing, and algorithmic decision-making.  Here are some ways to develop a mind inclined to designing ethically-sound data science projects and models. Listen to seminars and talks by experts and researchers in fairness, accountability, and transparency in machine learning systems. Our favorites include Kate Crawford’s talk on The trouble with bias, Tricia Wang on The human insights missing from big data and Ethics & Data Science by Jeff Hammerbacher. Follow top influencers on social media and catch up with their blogs and about their work regularly. Some of these researchers include Kate Crawford, Margaret Mitchell, Rich Caruana, Jake Metcalf, Michael Veale, and Kristian Lum among others. Take up courses which will guide you on how to eliminate unintended bias while designing data-driven algorithms. We recommend Data Science Ethics by the University of Michigan, available on edX. You can also take up a course on basic Philosophy from your choice of University.   Start at the beginning. Read books on ethics and philosophy when you get long weekends this year. You can begin with Aristotle's Nicomachean Ethics to understand the real meaning of ethics, a term Aristotle helped develop. We recommend browsing through The Stanford Encyclopedia of Philosophy, which is an online archive of peer-reviewed publication of original papers in philosophy, freely accessible to Internet users. You can also try Practical Ethics, a book by Peter Singer and The Elements of Moral Philosophy by James Rachels. Attend or follow upcoming conferences in the field of bringing transparency in socio-technical systems. For starters, FAT* (Conference on Fairness, Accountability, and Transparency) is scheduled on February 23 and 24th, 2018 at New York University, NYC. We also have the 5th annual conference of FAT/ML, later in the year.  3. Question/Reassess your hypotheses before, during and after actual implementation Finally, for any data science project, always reassess your hypotheses before, during, and after the actual implementation. Always ask yourself these questions after each of the above steps and compare them with the previous answers. What question are you asking? What is your project about? Whose needs is it addressing? Who could it adversely impact? What data are you using? Is the data-type suitable for your type of model? Is the data relevant and fresh? What are its inherent biases and limitations? How robust are your workarounds for them? What techniques are you going to try? What algorithms are you going to implement? What would be its complexity? Is it interpretable and transparent? How will you evaluate your methods and results? What do you expect the results to be? Are the results biased? Are they reproducible? These pointers will help you evaluate your project goals from a customer and business point of view. Additionally, it will also help you in building efficient models which can benefit the society and your organization at large. With this, we come to the end of our new year resolutions for an aspiring data scientist. However, the beauty of the ideas behind these resolutions is that they are easily transferable to anyone in any job. All you gotta do is get your foundations right, stay relevant, and be mindful of your impact. We hope this gives a great kick start to your career in 2018. “Motivation is what gets you started. Habit is what keeps you going.” ― Jim Ryun Happy New Year! May the odds and the God(s) be in your favor this year to help you build your resolutions into your daily routines and habits!
Read more
  • 0
  • 0
  • 9343

article-image-sap-hana-architecture
Packt
20 Dec 2013
12 min read
Save for later

SAP HANA Architecture

Packt
20 Dec 2013
12 min read
(For more resources related to this topic, see here.) Understanding the SAP HANA architecture Architecture is the key for SAP HANA to be a game changing innovative technology. SAP HANA has been designed so well architecture-wise such that it makes a lot of difference when compared to other traditional databases available today. This section explains us the various components of SAP HANA and its functionalities. Getting ready Enterprise application requirements have become more demanding—complex reports with high computation on huge volumes of transaction data and also business data of other formats (both structured and semi-structured). Data is being written or updated, and also read from the database in parallel. Thus, integration of both transactional and analytical data into single database is required, where SAP HANA has evolved. Columnar storage exploits modern hardware and technology (multiple CPU cores, large main memory, and caches) in achieving the requirements of enterprise applications. Apart from this, it should also support procedural logic where certain tasks cannot be completed with simple SQL. How it works… The SAP HANA database consists of several services (servers). Index server is the most important component of all the servers. Other servers are name server, preprocessor server, statistics server, and XS Engine: Index server: This server holds the actual data and the engines for processing the data. When SQL or MDX is fired against the SAP HANA system in the case of authenticated sessions and transactions, an index server takes care of these commands and processes them. Name server: This server holds complete information about the system landscape. Name server is responsible for the topology of the SAP HANA system. In a distributed system, SAP HANA instances will be running on multiple hosts. In this kind of setup, the name server knows where the components are running and how data is spread on different servers. Preprocessor server: This server comes into the picture during text data analysis. Index server utilizes the capabilities of preprocessor server in text data analysis and searching. This helps to extract the information on which text search capabilities are based. Statistics server: This server helps in collecting the data for the system monitor and helps you know the health of the SAP HANA system. The statistics server is responsible for collecting the data related to status, resource allocation/consumption and performance of the SAP HANA system. Monitoring the clients and getting the status of various alert monitors use the data collected by Statistics server. This server also provides a history of measurement data for further analysis. XS Engine: The XS Engine allows external applications and application developers to access the SAP HANA system via the XS Engine clients, for example, a web browser accesses SAP HANA apps built by application developers via HTTP. Application developers build applications by using the XS Engine, and the users access the app via HTTP by using a web browser. The persistent model in the SAP HANA database is converted into a consumption model for clients to access it via HTTP. This allows an organization to host system services that are a part of the SAP HANA database (for example, Search service, a built-in web server that provides access to static content in the repository). The following diagram shows the architecture of SAP HANA: There's is more... Let us continue learning about the different components: SAP Host Agent: According to the new approach from SAP, the SAP Host Agent should be installed on all machines that are related to the SAP landscape. It is used by Adaptive Computing Controller (ACC) to manage the system and Software Update Manager (SUM) for automatic updates. LM-structure: LM-structure for SAP HANA contains the information about current installation details. This information will be used by SUM during automatic updates. SAP Solution Manager diagnostic agent: This agent provides all the data to SAP Solution Manager (SAP SOLMAN) to monitor the SAP HANA system. After the SAP SOLMAN is integrated with the SAP HANA system, this agent provides information about the database at a glance, which includes the database state and general information about the system, such as alerts, CPU, or memory and disk usage. SAP HANA Studio repository: This helps the end users to update the SAP HANA studio to higher versions. The SAP HANA Studio repository is the code that does this process. Software Update Manager for SAP HANA: This helps in automatic updates of SAP HANA from the SAP Marketplace and patching the SAP host agent. It also allows distribution of the Studio repository to the end users. See also http://help.sap.com/hana/SAP_HANA_Installation_Guide_en.pdf SAP Notes:1793303 and 1514967 Explaining IMCE and its components We have seen the architecture of SAP HANA and its components. In this section, we will learn about IMCE (in-memory computing engine) and how its components and its functionalities. Getting Ready The SAP in-memory computing engine (formerly Business Analytic Engine (BAE)) is the core engine for SAP's next generation high-performance, in-memory solutions as it leverages technologies such as in-memory computing, columnar databases, massively parallel processing (MPP), and data compression, to allow organizations to instantly explore and analyze large volumes of transactional and analytical data from across the enterprise in real time. How it works... In-memory computing allows the processing of massive quantities of real-time data in the main memory of the server, providing immediate results from analyses and transactions. The SAP in-memory computing database delivers the following capabilities: In-memory computing functionality with native support for row and columnar datastores providing full ACID (atomicity, consistency, isolation, and durability) transactional capabilities Integrated lifecycle management capabilities and data integration capabilities to access SAP and non-SAP data sources SAP IMCE Studio, which includes tools for data modeling, data and life cycle management, and data security The SAP IMCE that resides at the heart of SAP HANA is an integrated database and calculation layer that allows the processing of massive quantities of real-time data in the main memory to provide immediate results from analysis and transactions. Like any standard database, the SAP IMCE not only supports industry standards such as SQL and MDX, but also incorporates a high-performance calculation engine that embeds procedural language support directly into the database kernel. This approach is designed to remove the need to read data from the database, process it, and then write data back to the database, that is, process the data near the database and return the results. The IMCE is an in-memory, column-oriented database technology. It is a powerful calculation engine at the heart of SAP HANA. As data resides in the Random Access Memory (RAM), highly accelerated performance can be achieved compared to systems that read data from disks. The heart lies within the IMCE, which allows us to create and perform calculations on data. SAP IMCE Studio includes tools for data modeling activities, data and life cycle management, and also tools that are related to data security. The following diagram shows the components of IMCE alone: There's more… SAP HANA database has the following two engines: Column-based store: This engine stores the huge amounts of relational data in column-optimized tables, which are aggregated and used in analytical operations. Row-based store: This engine stores the relational data in rows, similar to the storage mechanism of traditional database systems. The row store is more optimized for write operations and has a lower compression rate. Also, the query performance is lower when compared to the column-based store. The engine that is used to store data can be selected on a per-table basis at the time of creating a table. Tables in the row-based store are loaded at start up time. In the case of column-based stores, tables can be either loaded at start up or on demand, that is, during normal operation of the SAP HANA database. Both engines share a common persistence layer, which provides data persistency that is consistent across both engines. Like a traditional database, we have page management and logging in SAP HANA. The changes made to the in-memory database pages are persisted through savepoints. These savepoints are written to those data volumes on the persistent storage for which the storage medium is hard drives. All transactions committed in the SAP HANA database are stored/saved/referenced by the logger of the persistency layer in a log entry written to the log volumes on the persistent storage. To get high I/O performance and low latency, log volumes use the flash technology storage. The relational engines can be accessed through a variety of interfaces. The SAP HANA database supports SQL (JDBC/ODBC), MDX (ODBO), and BICS (SQLDBC). The calculation engine performs all the calculations in the database. No data moves into the application layer until calculations are completed. It also contains the business functions library that is called by applications to perform calculations based on the business rules and logic. The SAP HANA-specific SQL script language is an extension of SQL that can be used to push down data-intensive application logic into the SAP HANA database for specific requirements. Session management This component creates and manages sessions and connections for the database clients. When a session is created, a set of parameters are maintained. These parameters are like auto-commit settings or the current transaction isolation level. After establishing a session, database clients communicate with the SAP HANA database using SQL statements. SAP HANA database treats all the statements as transactions while processing them. Each new session created will be assigned to a new transaction. Transaction manager The transaction manager is the component that coordinates database transactions, takes care of controlling transaction isolation, and keeps track of running and closed transactions. The transaction manager informs the involved storage engines about the running or closed transactions, so that they can execute necessary actions, when a transaction is committed or rolled back. The transaction manager cooperates with the persistence layer to achieve atomic and durable transactions. The client requests are analyzed and executed by a set of components summarized as request processing and execution control. The client requests are analyzed by a request parser, and then it is dispatched to the responsible component. The transaction control statements are forwarded to the transaction manager. The data definition statements are sent to the metadata manager. The object invocations are dispatched to the object store. The data manipulation statements are sent to the optimizer, which creates an optimized execution plan that is given to the execution layer. The SAP HANA database also has built-in support for domain-specific models (such as for financial planning domain) and it offers scripting capabilities that allow application-specific calculations to run inside the database. It has its own scripting language named SQLScript that is designed to enable optimizations and parallelization. This SQLScript is based on free functions that operate on tables by using SQL queries for set processing. The SAP HANA database also contains a component called the planning engine that allows financial planning applications to execute basic planning operations in the database layer. For example, while applying filters/transformations, a new version of a dataset will be created as a copy of an existing one. An example of planning operation is disaggregation operation in which based on a distribution function; target values from higher to lower aggregation levels are distributed. Metadata manager Metadata manager helps to access metadata. SAP HANA database's metadata consists of a variety of objects, such as definitions of tables, views and indexes, SQLScript function definitions, and object store metadata. All these types of metadata are stored in one common catalog for all the SAP HANA database stores. Metadata is stored in tables in the row store. The SAP HANA features such as transaction support and multi-version concurrency control (MVCC) are also used for metadata management. Central metadata is shared across the servers in the case of a distributed database systems. The background mechanism of metadata storage and sharing is hidden from the components that use the metadata manager. As row-based tables and columnar tables can be combined in one SQL statement, both the row and column engines must be able to consume the intermediate results. The main difference between the two engines is the way they process data: the row store operators process data in a row-at-a-time fashion, whereas column store operations (such as scan and aggregate) require the entire column to be available in contiguous memory locations. To exchange intermediate results created by each other, the row store provides results to the column store. The result materializes as complete rows in the memory, while the column store can expose results using the iterators interface needed by the row store. Persistence layer The persistence layer is responsible for durability and atomicity of transactions. The persistent layer ensures that the database is restored to the most recent committed state after a restart, and makes sure that transactions are either completely executed or completely rolled back. To achieve this in an efficient way, the persistence layer uses a combination of write-ahead logs, shadow paging, and savepoints. Moreover, the persistence layer also offers interfaces for writing and reading data. It also contains SAP HANA's logger that manages the transaction log. Authorization manager The authorization manager is invoked by other SAP HANA database components to check the required privileges for users to execute the requested operations. Privileges to other users or roles can be granted. A privilege grants the right to perform a specified operation (such as create, update, select, and execute data manipulation languages) on a specified object such as a table, view, and SQLScript function. Analytic privileges represent filters or hierarchy, and they drill down limitations for analytic queries. Analytic privileges such as granting access to values with a certain combination of dimension attributes are supported in SAP HANA. Users are authenticated either by the SAP HANA database itself (log in with username and password), or authentication can be delegated to external authentication providers third-party such as an LDAP directory. See also SAP HANA in-memory analytics and in-memory computing available at http://scn.sap.com/people/vitaliy.rudnytskiy/blog/2011/03/22/time-to-update-your-sap-hana-vocabulary Summary This article explains the SAP architecture and the IMCE feature in brief. Resources for Article: Further resources on this subject: SAP HANA integration with Microsoft Excel [Article] Data Migration Scenarios in SAP Business ONE Application- part 2 [Article] Data Migration Scenarios in SAP Business ONE Application- part 1 [Article]
Read more
  • 0
  • 0
  • 9284

article-image-visualization-tool-understand-data
Packt
22 Sep 2014
23 min read
Save for later

Visualization as a Tool to Understand Data

Packt
22 Sep 2014
23 min read
In this article by Nazmus Saquib, the author of Mathematica Data Visualization, we will look at a few simple examples that demonstrate the importance of data visualization. We will then discuss the types of datasets that we will encounter over the course of this book, and learn about the Mathematica interface to get ourselves warmed up for coding. (For more resources related to this topic, see here.) In the last few decades, the quick growth in the volume of information we produce and the capacity of digital information storage have opened a new door for data analytics. We have moved on from the age of terabytes to that of petabytes and exabytes. Traditional data analysis is now augmented with the term big data analysis, and computer scientists are pushing the bounds for analyzing this huge sea of data using statistical, computational, and algorithmic techniques. Along with the size, the types and categories of data have also evolved. Along with the typical and popular data domain in Computer Science (text, image, and video), graphs and various categorical data that arise from Internet interactions have become increasingly interesting to analyze. With the advances in computational methods and computing speed, scientists nowadays produce an enormous amount of numerical simulation data that has opened up new challenges in the field of Computer Science. Simulation data tends to be structured and clean, whereas data collected or scraped from websites can be quite unstructured and hard to make sense of. For example, let's say we want to analyze some blog entries in order to find out which blogger gets more follows and referrals from other bloggers. This is not as straightforward as getting some friends' information from social networking sites. Blog entries consist of text and HTML tags; thus, a combination of text analytics and tag parsing, coupled with a careful observation of the results would give us our desired outcome. Regardless of whether the data is simulated or empirical, the key word here is observation. In order to make intelligent observations, data scientists tend to follow a certain pipeline. The data needs to be acquired and cleaned to make sure that it is ready to be analyzed using existing tools. Analysis may take the route of visualization, statistics, and algorithms, or a combination of any of the three. Inference and refining the analysis methods based on the inference is an iterative process that needs to be carried out several times until we think that a set of hypotheses is formed, or a clear question is asked for further analysis, or a question is answered with enough evidence. Visualization is a very effective and perceptive method to make sense of our data. While statistics and algorithmic techniques provide good insights about data, an effective visualization makes it easy for anyone with little training to gain beautiful insights about their datasets. The power of visualization resides not only in the ease of interpretation, but it also reveals visual trends and patterns in data, which are often hard to find using statistical or algorithmic techniques. It can be used during any step of the data analysis pipeline—validation, verification, analysis, and inference—to aid the data scientist. How have you visualized your data recently? If you still have not, it is okay, as this book will teach you exactly that. However, if you had the opportunity to play with any kind of data already, I want you to take a moment and think about the techniques you used to visualize your data so far. Make a list of them. Done? Do you have 2D and 3D plots, histograms, bar charts, and pie charts in the list? If yes, excellent! We will learn how to style your plots and make them more interactive using Mathematica. Do you have chord diagrams, graph layouts, word cloud, parallel coordinates, isosurfaces, and maps somewhere in that list? If yes, then you are already familiar with some modern visualization techniques, but if you have not had the chance to use Mathematica as a data visualization language before, we will explore how visualization prototypes can be built seamlessly in this software using very little code. The aim of this book is to teach a Mathematica beginner the data-analysis and visualization powerhouse built into Mathematica, and at the same time, familiarize the reader with some of the modern visualization techniques that can be easily built with Mathematica. We will learn how to load, clean, and dissect different types of data, visualize the data using Mathematica's built-in tools, and then use the Mathematica graphics language and interactivity functions to build prototypes of a modern visualization. The importance of visualization Visualization has a broad definition, and so does data. The cave paintings drawn by our ancestors can be argued as visualizations as they convey historical data through a visual medium. Map visualizations were commonly used in wars since ancient times to discuss the past, present, and future states of a war, and to come up with new strategies. Astronomers in the 17th century were believed to have built the first visualization of their statistical data. In the 18th century, William Playfair invented many of the popular graphs we use today (line, bar, circle, and pie charts). Therefore, it appears as if many, since ancient times, have recognized the importance of visualization in giving some meaning to data. To demonstrate the importance of visualization in a simple mathematical setting, consider fitting a line to a given set of points. Without looking at the data points, it would be unwise to try to fit them with a model that seemingly lowers the error bound. It should also be noted that sometimes, the data needs to be changed or transformed to the correct form that allows us to use a particular tool. Visualizing the data points ensures that we do not fall into any trap. The following screenshot shows the visualization of a polynomial as a circle: Figure1.1 Fitting a polynomial In figure 1.1, the points are distributed around a circle. Imagine we are given these points in a Cartesian space (orthogonal x and y coordinates), and we are asked to fit a simple linear model. There is not much benefit if we try to fit these points to any polynomial in a Cartesian space; what we really need to do is change the parameter space to polar coordinates. A 1-degree polynomial in polar coordinate space (essentially a circle) would nicely fit these points when they are converted to polar coordinates, as shown in figure 1.1. Visualizing the data points in more complicated but similar situations can save us a lot of trouble. The following is a screenshot of Anscombe's quartet: Figure1.2 Anscombe's quartet, generated using Mathematica Downloading the color images of this book We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/2999OT_coloredimages.PDF. Anscombe's quartet (figure 1.2), named after the statistician Francis Anscombe, is a classic example of how simple data visualization like plotting can save us from making wrong statistical inferences. The quartet consists of four datasets that have nearly identical statistical properties (such as mean, variance, and correlation), and gives rise to the same linear model when a regression routine is run on these datasets. However, the second dataset does not really constitute a linear relationship; a spline would fit the points better. The third dataset (at the bottom-left corner of figure 1.2) actually has a different regression line, but the outlier exerts enough influence to force the same regression line on the data. The fourth dataset is not even a linear relationship, but the outlier enforces the same regression line again. These two examples demonstrate the importance of "seeing" our data before we blindly run algorithms and statistics. Fortunately, for visualization scientists like us, the world of data types is quite vast. Every now and then, this gives us the opportunity to create new visual tools other than the traditional graphs, plots, and histograms. These visual signatures and tools serve the same purpose that the graph plotting examples previously just did—spy and investigate data to infer valuable insights—but on different types of datasets other than just point clouds. Another important use of visualization is to enable the data scientist to interactively explore the data. Two features make today's visualization tools very attractive—the ability to view data from different perspectives (viewing angles) and at different resolutions. These features facilitate the investigator in understanding both the micro- and macro-level behavior of their dataset. Types of datasets There are many different types of datasets that a visualization scientist encounters in their work. This book's aim is to prepare an enthusiastic beginner to delve into the world of data visualization. Certainly, we will not comprehensively cover each and every visualization technique out there. Our aim is to learn to use Mathematica as a tool to create interactive visualizations. To achieve that, we will focus on a general classification of datasets that will determine which Mathematica functions and programming constructs we should learn in order to visualize the broad class of data covered in this book. Tables The table is one of the most common data structures in Computer Science. You might have already encountered this in a computer science, database, or even statistics course, but for the sake of completeness, we will describe the ways in which one could use this structure to represent different kinds of data. Consider the following table as an example:   Attribute 1 Attribute 2 … Item 1       Item 2       Item 3       When storing datasets in tables, each row in the table represents an instance of the dataset, and each column represents an attribute of that data point. For example, a set of two-dimensional Cartesian vectors can be represented as a table with two attributes, where each row represents a vector, and the attributes are the x and y coordinates relative to an origin. For three-dimensional vectors or more, we could just increase the number of attributes accordingly. Tables can be used to store more advanced forms of scientific, time series, and graph data. We will cover some of these datasets over the course of this book, so it is a good idea for us to get introduced to them now. Here, we explain the general concepts. Scalar fields There are many kinds of scientific dataset out there. In order to aid their investigations, scientists have created their own data formats and mathematical tools to analyze the data. Engineers have also developed their own visualization language in order to convey ideas in their community. In this book, we will cover a few typical datasets that are widely used by scientists and engineers. We will eventually learn how to create molecular visualizations and biomedical dataset exploration tools when we feel comfortable manipulating these datasets. In practice, multidimensional data (just like vectors in the previous example) is usually augmented with one or more characteristic variable values. As an example, let's think about how a physicist or an engineer would keep track of the temperature of a room. In order to tackle the problem, they would begin by measuring the geometry and the shape of the room, and put temperature sensors at certain places to measure the temperature. They will note the exact positions of those sensors relative to the room's coordinate system, and then, they will be all set to start measuring the temperature. Thus, the temperature of a room can be represented, in a discrete sense, by using a set of points that represent the temperature sensor locations and the actual temperature at those points. We immediately notice that the data is multidimensional in nature (the location of a sensor can be considered as a vector), and each data point has a scalar value associated with it (temperature). Such a discrete representation of multidimensional data is quite widely used in the scientific community. It is called a scalar field. The following screenshot shows the representation of a scalar field in 2D and 3D: Figure1.3 In practice, scalar fields are discrete and ordered Figure 1.3 depicts how one would represent an ordered scalar field in 2D or 3D. Each point in the 2D field has a well-defined x and y location, and a single temperature value gets associated with it. To represent a 3D scalar field, we can think of it as a set of 2D scalar field slices placed at a regular interval along the third dimension. Each point in the 3D field is a point that has {x, y, z} values, along with a temperature value. A scalar field can be represented using a table. We will denote each {x, y} point (for 2D) or {x, y, z} point values (for 3D) as a row, but this time, an additional attribute for the scalar value will be created in the table. Thus, a row will have the attributes {x, y, z, T}, where T is the temperature associated with the point defined by the x, y, and z coordinates. This is the most common representation of scalar fields. A widely used visualization technique to analyze scalar fields is to find out the isocontours or isosurfaces of interest. However, for now, let's take a look at the kind of application areas such analysis will enable one to pursue. Instead of temperature, one could think of associating regularly spaced points with any relevant scalar value to form problem-specific scalar fields. In an electrochemical simulation, it is important to keep track of the charge density in the simulation space. Thus, the chemist would create a scalar field with charge values at specific points. For an aerospace engineer, it is quite important to understand how air pressure varies across airplane wings; they would keep track of the pressure by forming a scalar field of pressure values. Scalar field visualization is very important in many other significant areas, ranging from from biomedical analysis to particle physics. In this book, we will cover how to visualize this type of data using Mathematica. Time series Another widely used data type is the time series. A time series is a sequence of data points that are measured usually over a uniform interval of time. Time series arise in many fields, but in today's world, they are mostly known for their applications in Economics and Finance. Other than these, they are frequently used in statistics, weather prediction, signal processing, astronomy, and so on. It is not the purpose of this book to describe the theory and mathematics of time series data. However, we will cover some of Mathematica's excellent capabilities for visualizing time series, and in the course of this book, we will construct our own visualization tool to view time series data. Time series can be easily represented using tables. Each row of the time series table will represent one point in the series, with one attribute denoting the time stamp—the time at which the data point was recorded, and the other attribute storing the actual data value. If the starting time and the time interval are known, then we can get rid of the time attribute and simply store the data value in each row. The actual timestamp of each value can be calculated using the initial time and time interval. Images and videos can be represented as tables too, with pixel-intensity values occupying each entry of the table. As we focus on visualization and not image processing, we will skip those types of data. Graphs Nowadays, graphs arise in all contexts of computer science and social science. This particular data structure provides a way to convert real-world problems into a set of entities and relationships. Once we have a graph, we can use a plethora of graph algorithms to find beautiful insights about the dataset. Technically, a graph can be stored as a table. However, Mathematica has its own graph data structure, so we will stick to its norm. Sometimes, visualizing the graph structure reveals quite a lot of hidden information. Graph visualization itself is a challenging problem, and is an active research area in computer science. A proper visualization layout, along with proper color maps and size distribution, can produce very useful outputs. Text The most common form of data that we encounter everywhere is text. Mathematica does not provide any specific visualization package for state-of-the-art text visualization methods. Cartographic data As mentioned before, map visualization is one of the ancient forms of visualization known to us. Nowadays, with the advent of GPS, smartphones, and publicly available country-based data repositories, maps are providing an excellent way to contrast and compare different countries, cities, or even communities. Cartographic data comes in various forms. A common form of a single data item is one that includes latitude, longitude, location name, and an attribute (usually numerical) that records a relevant quantity. However, instead of a latitude and longitude coordinate, we may be given a set of polygons that describe the geographical shape of the place. The attributable quantity may not be numerical, but rather something qualitative, like text. Thus, there is really no standard form that one can expect when dealing with cartographic data. Fortunately, Mathematica provides us with excellent data-mining and dissecting capabilities to build custom formats out of the data available to us. . Mathematica as a tool for visualization At this point, you might be wondering why Mathematica is suited for visualizing all the kinds of datasets that we have mentioned in the preceding examples. There are many excellent tools and packages out there to visualize data. Mathematica is quite different from other languages and packages because of the unique set of capabilities it presents to its user. Mathematica has its own graphics language, with which graphics primitives can be interactively rendered inside the worksheet. This makes Mathematica's capability similar to many widely used visualization languages. Mathematica provides a plethora of functions to combine these primitives and make them interactive. Speaking of interactivity, Mathematica provides a suite of functions to interactively display any of its process. Not only visualization, but any function or code evaluation can be interactively visualized. This is particularly helpful when managing and visualizing big datasets. Mathematica provides many packages and functions to visualize the kinds of datasets we have mentioned so far. We will learn to use the built-in functions to visualize structured and unstructured data. These functions include point, line, and surface plots; histograms; standard statistical charts; and so on. Other than these, we will learn to use the advanced functions that will let us build our own visualization tools. Another interesting feature is the built-in datasets that this software provides to its users. This feature provides a nice playground for the user to experiment with different datasets and visualization functions. From our discussion so far, we have learned that visualization tools are used to analyze very large datasets. While Mathematica is not really suited for dealing with petabytes or exabytes of data (and many other popularly used visualization tools are not suited for that either), often, one needs to build quick prototypes of such visualization tools using smaller sample datasets. Mathematica is very well suited to prototype such tools because of its efficient and fast data-handling capabilities, along with its loads of convenient functions and user-friendly interface. It also supports GPU and other high-performance computing platforms. Although it is not within the scope of this book, a user who knows how to harness the computing power of Mathematica can couple that knowledge with visualization techniques to build custom big data visualization solutions. Another feature that Mathematica presents to a data scientist is the ability to keep the workflow within one worksheet. In practice, many data scientists tend to do their data analysis with one package, visualize their data with another, and export and present their findings using something else. Mathematica provides a complete suite of a core language, mathematical and statistical functions, a visualization platform, and versatile data import and export features inside a single worksheet. This helps the user focus on the data instead of irrelevant details. By now, I hope you are convinced that Mathematica is worth learning for your data-visualization needs. If you still do not believe me, I hope I will be able to convince you again at the end of the book, when we will be done developing several visualization prototypes, each requiring only few lines of code! Getting started with Mathematica We will need to know a few basic Mathematica notebook essentials. Assuming you already have Mathematica installed on your computer, let's open a new notebook by navigating to File|New|Notebook, and do the following experiments. Creating and selecting cells In Mathematica, a chunk of code or any number of mathematical expressions can be written within a cell. Each cell in the notebook can be evaluated to see the output immediately below it. To start a new cell, simply start typing at the position of the blinking cursor. Each cell can be selected by clicking on the respective rightmost bracket. To select multiple cells, press Ctrl + right-mouse button in Windows or Linux (or cmd + right-mouse button on a Mac) on each of the cells. The following screenshot shows several cells selected together, along with the output from each cell: Figure1.4 Selecting and evaluating cells in Mathematica We can place a new cell in between any set of cells in order to change the sequence of instruction execution. Use the mouse to place the cursor in between two cells, and start typing your commands to create a new cell. We can also cut, copy, and paste cells by selecting them and applying the usual shortcuts (for example, Ctrl + C, Ctrl + X, and Ctrl + V in Windows/Linux, or cmd + C, cmd + X, and cmd + V in Mac) or using the Edit menu bar. In order to delete cell(s), select the cell(s) and press the Delete key. Evaluating a cell A cell can be evaluated by pressing Shift + Enter. Multiple cells can be selected and evaluated in the same way. To evaluate the full notebook, press Ctrl + A (to select all the cells) and then press Shift + Enter. In this case, the cells will be evaluated one after the other in the sequence in which they appear in the notebook. To see examples of notebooks filled with commands, code, and mathematical expressions, you can open the notebooks supplied with this article, which are the polar coordinates fitting and Anscombe's quartet examples, and select each cell (or all of them) and evaluate them. If we evaluate a cell that uses variables declared in a previous cell, and the previous cell was not already evaluated, then we may get errors. It is possible that Mathematica will treat the unevaluated variables as a symbolic expression; in that case, no error will be displayed, but the results will not be numeric anymore. Suppressing output from a cell If we don't wish to see the intermediate output as we load data or assign values to variables, we can add a semicolon (;) at the end of each line that we want to leave out from the output. Cell formatting Mathematica input cells treat everything inside them as mathematical and/or symbolic expressions. By default, every new cell you create by typing at the horizontal cursor will be an input expression cell. However, you can convert the cell to other formats for convenient typesetting. In order to change the format of cell(s), select the cell(s) and navigate to Format|Style from the menu bar, and choose a text format style from the available options. You can add mathematical symbols to your text by selecting Palettes|Basic Math Assistant. Note that evaluating a text cell will have no effect/output. Commenting We can write any comment in a text cell as it will be ignored during the evaluation of our code. However, if we would like to write a comment inside an input cell, we use the (* operator to open a comment and the *) operator to close it, as shown in the following code snippet: (* This is a comment *) The shortcut Ctrl + / (cmd + / in Mac) is used to comment/uncomment a chunk of code too. This operation is also available in the menu bar. Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. Aborting evaluation We can abort the currently running evaluation of a cell by navigating to Evaluation|Abort Evaluation in the menu bar, or simply by pressing Alt + . (period). This is useful when you want to end a time-consuming process that you suddenly realize will not give you the correct results at the end of the evaluation, or end a process that might use up the available memory and shut down the Mathematica kernel. Further reading The history of visualization deserves a separate book, as it is really fascinating how the field has matured over the centuries, and it is still growing very strongly. Michael Friendly, from York University, published a historical development paper that is freely available online, titled Milestones in History of Data Visualization: A Case Study in Statistical Historiography. This is an entertaining compilation of the history of visualization methods. The book The Visual Display of Quantitative Information by Edward R. Tufte published by Graphics Press USA, is an excellent resource and a must-read for every data visualization practitioner. This is a classic book on the theory and practice of data graphics and visualization. Since we will not have the space to discuss the theory of visualization, the interested reader can consider reading this book for deeper insights. Summary In this article, we discussed the importance of data visualization in different contexts. We also introduced the types of dataset that will be visualized over the course of this book. The flexibility and power of Mathematica as a visualization package was discussed, and we will see the demonstration of these properties throughout the book with beautiful visualizations. Finally, we have taken the first step to writing code in Mathematica. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [article] Importing Dynamic Data [article] Interacting with Data for Dashboards [article]
Read more
  • 0
  • 0
  • 9283
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-implementing-deep-learning-keras
Amey Varangaonkar
05 Dec 2017
4 min read
Save for later

Implementing Deep Learning with Keras

Amey Varangaonkar
05 Dec 2017
4 min read
[box type="note" align="" class="" width=""]The following excerpt is from the title Deep Learning with Theano, Chapter 5 written by Christopher Bourez. The book offers a complete overview of Deep Learning with Theano, a Python-based library that makes optimizing numerical expressions and deep learning models easy on CPU or GPU. [/box] In this article, we introduce you to the highly popular deep learning library - Keras, which sits on top of the both, Theano and Tensorflow. It is a flexible platform for training deep learning models with ease. Keras is a high-level neural network API, written in Python and capable of running on top of either TensorFlow or Theano. It was developed to make implementing deep learning models as fast and easy as possible for research and development. You can install Keras easily using conda, as follows: conda install keras When writing your Python code, importing Keras will tell you which backend is used: >>> import keras Using Theano backend. Using cuDNN version 5110 on context None Preallocating 10867/11439 Mb (0.950000) on cuda0 Mapped name None to device cuda0: Tesla K80 (0000:83:00.0) Mapped name dev0 to device cuda0: Tesla K80 (0000:83:00.0) Using cuDNN version 5110 on context dev1 Preallocating 10867/11439 Mb (0.950000) on cuda1 Mapped name dev1 to device cuda1: Tesla K80 (0000:84:00.0) If you have installed Tensorflow, it might not use Theano. To specify which backend to use, write a Keras configuration file, ~/.keras/keras.json: { "epsilon": 1e-07, "floatx": "float32", "image_data_format": "channels_last", "backend": "theano" } It is also possible to specify the Theano backend directly with the environment Variable: KERAS_BACKEND=theano python Note that the device used is the device we specified for Theano in the ~/.theanorc file. It is also possible to modify these variables with Theano environment variables: KERAS_BACKEND=theano THEANO_FLAGS=device=cuda,floatX=float32,mode=FAST_ RUN python Programming with Keras Keras provides a set of methods for data pre-processing and for building models. Layers and models are callable functions on tensors and return tensors. In Keras, there is no difference between a layer/module and a model: a model can be part of a bigger model and composed of multiple layers. Such a sub-model behaves as a module, with inputs/outputs. Let's create a network with two linear layers, a ReLU non-linearity in between, and a softmax output: from keras.layers import Input, Dense from keras.models import Model inputs = Input(shape=(784,)) x = Dense(64, activation='relu')(inputs) predictions = Dense(10, activation='softmax')(x) model = Model(inputs=inputs, outputs=predictions) The model module contains methods to get input and output shape for either one or multiple inputs/outputs, and list the submodules of our module: >>> model.input_shape (None, 784) >>> model.get_input_shape_at(0) (None, 784) >>> model.output_shape (None, 10) >>> model.get_output_shape_at(0) (None, 10) >>> model.name 'Sequential_1' >>> model.input /dense_3_input >>> model.output Softmax.0 >>> model.get_output_at(0) Softmax.0 >>> model.layers [<keras.layers.core.Dense object at 0x7f0abf7d6a90>, <keras.layers.core.Dense object at 0x7f0abf74af90>] In order to avoid specify inputs to every layer, Keras proposes a functional way of writing models with the Sequential module, to build a new module or model composed. The following definition of the model builds exactly the same model as shown previously, with input_dim to specify the input dimension of the block that would be unknown otherwise and generate an error: from keras.models import Sequential from keras.layers import Dense, Activation model = Sequential() model.add(Dense(units=64, input_dim=784, activation='relu')) model.add(Dense(units=10, activation='softmax')) The model is considered a module or layer that can be part of a bigger model: model2 = Sequential() model2.add(model) model2.add(Dense(units=10, activation='softmax')) Each module/model/layer can be compiled then and trained with data : model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) model.fit(data, labels) Thus, we see it is fairly easy to train a model in Keras. The simplicity and ease of use that Keras offers makes it a very popular choice of tool for deep learning. If you think the article is useful, check out the book Deep Learning with Theano for interesting deep learning concepts and their implementation using Theano. For more information on the Keras library and how to train efficient deep learning models, make sure to check our highly popular title Deep Learning with Keras.  
Read more
  • 0
  • 0
  • 9250

article-image-postgresql-extensible-rdbms
Packt
03 Mar 2015
18 min read
Save for later

PostgreSQL as an Extensible RDBMS

Packt
03 Mar 2015
18 min read
This article by Usama Dar, the author of the book PostgreSQL Server Programming - Second Edition, explains the process of creating a new operator, overloading it, optimizing it, creating index access methods, and much more. PostgreSQL is an extensible database. I hope you've learned this much by now. It is extensible by virtue of the design that it has. As discussed before, PostgreSQL uses a catalog-driven design. In fact, PostgreSQL is more catalog-driven than most of the traditional relational databases. The key benefit here is that the catalogs can be changed or added to, in order to modify or extend the database functionality. PostgreSQL also supports dynamic loading, that is, a user-written code can be provided as a shared library, and PostgreSQL will load it as required. (For more resources related to this topic, see here.) Extensibility is critical for many businesses, which have needs that are specific to that business or industry. Sometimes, the tools provided by the traditional database systems do not fulfill those needs. People in those businesses know best how to solve their particular problems, but they are not experts in database internals. It is often not possible for them to cook up their own database kernel or modify the core or customize it according to their needs. A truly extensible database will then allow you to do the following: Solve domain-specific problems in a seamless way, like a native solution Build complete features without modifying the core database engine Extend the database without interrupting availability PostgreSQL not only allows you to do all of the preceding things, but also does these, and more with utmost ease. In terms of extensibility, you can do the following things in a PostgreSQL database: Create your own data types Create your own functions Create your own aggregates Create your own operators Create your own index access methods (operator classes) Create your own server programming language Create foreign data wrappers (SQL/MED) and foreign tables What can't be extended? Although PostgreSQL is an extensible platform, there are certain things that you can't do or change without explicitly doing a fork, as follows: You can't change or plug in a new storage engine. If you are coming from the MySQL world, this might annoy you a little. However, PostgreSQL's storage engine is tightly coupled with its executor and the rest of the system, which has its own benefits. You can't plug in your own planner/parser. One can argue for and against the ability to do that, but at the moment, the planner, parser, optimizer, and so on are baked into the system and there is no possibility of replacing them. There has been some talk on this topic, and if you are of the curious kind, you can read some of the discussion at http://bit.ly/1yRMkK7. We will now briefly discuss some more of the extensibility capabilities of PostgreSQL. We will not dive deep into the topics, but we will point you to the appropriate link where more information can be found. Creating a new operator Now, let's take look at how we can add a new operator in PostgreSQL. Adding new operators is not too different from adding new functions. In fact, an operator is syntactically just a different way to use an existing function. For example, the + operator calls a built-in function called numeric_add and passes it the two arguments. When you define a new operator, you must define the data types that the operator expects as arguments and define which function is to be called. Let's take a look at how to define a simple operator. You have to use the CREATE OPERATOR command to create an operator. Let's use that function to create a new Fibonacci operator, ##, which will have an integer on its left-hand side: CREATE OPERATOR ## (PROCEDURE=fib, LEFTARG=integer); Now, you can use this operator in your SQL to calculate a Fibonacci number: testdb=# SELECT 12##;?column?----------144(1 row) Note that we defined that the operator will have an integer on the left-hand side. If you try to put a value on the right-hand side of the operator, you will get an error: postgres=# SELECT ##12;ERROR: operator does not exist: ## integer at character 8HINT: No operator matches the given name and argument type(s). Youmight need to add explicit type casts.STATEMENT: select ##12;ERROR: operator does not exist: ## integerLINE 1: select ##12;^HINT: No operator matches the given name and argument type(s). Youmight need to add explicit type casts. Overloading an operator Operators can be overloaded in the same way as functions. This means, that an operator can have the same name as an existing operator but with a different set of argument types. More than one operator can have the same name, but two operators can't share the same name if they accept the same types and positions of the arguments. As long as there is a function that accepts the same kind and number of arguments that an operator defines, it can be overloaded. Let's override the ## operator we defined in the last example, and also add the ability to provide an integer on the right-hand side of the operator: CREATE OPERATOR ## (PROCEDURE=fib, RIGHTARG=integer); Now, running the same SQL, which resulted in an error last time, should succeed, as shown here: testdb=# SELECT ##12;?column?----------144(1 row) You can drop the operator using the DROP OPERATOR command. You can read more about creating and overloading new operators in the PostgreSQL documentation at http://www.postgresql.org/docs/current/static/sql-createoperator.html and http://www.postgresql.org/docs/current/static/xoper.html. There are several optional clauses in the operator definition that can optimize the execution time of the operators by providing information about operator behavior. For example, you can specify the commutator and the negator of an operator that help the planner use the operators in index scans. You can read more about these optional clauses at http://www.postgresql.org/docs/current/static/xoper-optimization.html. Since this article is just an introduction to the additional extensibility capabilities of PostgreSQL, we will just introduce a couple of optimization options; any serious production quality operator definitions should include these optimization clauses, if applicable. Optimizing operators The optional clauses tell the PostgreSQL server about how the operators behave. These options can result in considerable speedups in the execution of queries that use the operator. However, if you provide these options incorrectly, it can result in a slowdown of the queries. Let's take a look at two optimization clauses called commutator and negator. COMMUTATOR This clause defines the commuter of the operator. An operator A is a commutator of operator B if it fulfils the following condition: x A y = y B x. It is important to provide this information for the operators that will be used in indexes and joins. As an example, the commutator for > is <, and the commutator of = is = itself. This helps the optimizer to flip the operator in order to use an index. For example, consider the following query: SELECT * FROM employee WHERE new_salary > salary; If the index is defined on the salary column, then PostgreSQL can rewrite the preceding query as shown: SELECT * from employee WHERE salary < new_salary This allows PostgreSQL to use a range scan on the index column salary. For a user-defined operator, the optimizer can only do this flip around if the commutator of a user-defined operator is defined: CREATE OPERATOR > (LEFTARG=integer, RIGHTARG=integer, PROCEDURE=comp, COMMUTATOR = <) NEGATOR The negator clause defines the negator of the operator. For example, <> is a negator of =. Consider the following query: SELECT * FROM employee WHERE NOT (dept = 10); Since <> is defined as a negator of =, the optimizer can simplify the preceding query as follows: SELECT * FROM employee WHERE dept <> 10; You can even verify that using the EXPLAIN command: postgres=# EXPLAIN SELECT * FROM employee WHERE NOTdept = 'WATER MGMNT';QUERY PLAN---------------------------------------------------------Foreign Scan on employee (cost=0.00..1.10 rows=1 width=160)Filter: ((dept)::text <> 'WATER MGMNT'::text)Foreign File: /Users/usamadar/testdata.csvForeign File Size: 197(4 rows) Creating index access methods Let's discuss how to index new data types or user-defined types and operators. In PostgreSQL, an index is more of a framework that can be extended or customized for using different strategies. In order to create new index access methods, we have to create an operator class. Let's take a look at a simple example. Let's consider a scenario where you have to store some special data such as an ID or a social security number in the database. The number may contain non-numeric characters, so it is defined as a text type: CREATE TABLE test_ssn (ssn text);INSERT INTO test_ssn VALUES ('222-11-020878');INSERT INTO test_ssn VALUES ('111-11-020978'); Let's assume that the correct order for this data is such that it should be sorted on the last six digits and not the ASCII value of the string. The fact that these numbers need a unique sort order presents a challenge when it comes to indexing the data. This is where PostgreSQL operator classes are useful. An operator allows a user to create a custom indexing strategy. Creating an indexing strategy is about creating your own operators and using them alongside a normal B-tree. Let's start by writing a function that changes the order of digits in the value and also gets rid of the non-numeric characters in the string to be able to compare them better: CREATE OR REPLACE FUNCTION fix_ssn(text)RETURNS text AS $$BEGINRETURN substring($1,8) || replace(substring($1,1,7),'-','');END;$$LANGUAGE 'plpgsql' IMMUTABLE; Let's run the function and verify that it works: testdb=# SELECT fix_ssn(ssn) FROM test_ssn;fix_ssn-------------0208782221102097811111(2 rows) Before an index can be used with a new strategy, we may have to define some more functions depending on the type of index. In our case, we are planning to use a simple B-tree, so we need a comparison function: CREATE OR REPLACE FUNCTION ssn_compareTo(text, text)RETURNS int AS$$BEGINIF fix_ssn($1) < fix_ssn($2)THENRETURN -1;ELSIF fix_ssn($1) > fix_ssn($2)THENRETURN +1;ELSERETURN 0;END IF;END;$$ LANGUAGE 'plpgsql' IMMUTABLE; It's now time to create our operator class: CREATE OPERATOR CLASS ssn_opsFOR TYPE text USING btreeASOPERATOR 1 < ,OPERATOR 2 <= ,OPERATOR 3 = ,OPERATOR 4 >= ,OPERATOR 5 > ,FUNCTION 1 ssn_compareTo(text, text); You can also overload the comparison operators if you need to compare the values in a special way, and use the functions in the compareTo function as well as provide them in the CREATE OPERATOR CLASS command. We will now create our first index using our brand new operator class: CREATE INDEX idx_ssn ON test_ssn (ssn ssn_ops); We can check whether the optimizer is willing to use our special index, as follows: testdb=# SET enable_seqscan=off;testdb=# EXPLAIN SELECT * FROM test_ssn WHERE ssn = '02087822211';QUERY PLAN------------------------------------------------------------------Index Only Scan using idx_ssn on test_ssn (cost=0.13..8.14 rows=1width=32)Index Cond: (ssn = '02087822211'::text)(2 rows) Therefore, we can confirm that the optimizer is able to use our new index. You can read about index access methods in the PostgreSQL documentation at http://www.postgresql.org/docs/current/static/xindex.html. Creating user-defined aggregates User-defined aggregate functions are probably a unique PostgreSQL feature, yet they are quite obscure and perhaps not many people know how to create them. However, once you are able to create this function, you will wonder how you have lived for so long without using this feature. This functionality can be incredibly useful, because it allows you to perform custom aggregates inside the database, instead of querying all the data from the client and doing a custom aggregate in your application code, that is, the number of hits on your website per minute from a specific country. PostgreSQL has a very simple process for defining aggregates. Aggregates can be defined using any functions and in any languages that are installed in the database. Here are the basic steps to building an aggregate function in PostgreSQL: Define a start function that will take in the values of a result set; this function can be defined in any PL language you want. Define an end function that will do something with the final output of the start function. This can be in any PL language you want. Define the aggregate using the CREATE AGGREGATE command, providing the start and end functions you just created. Let's steal an example from the PostgreSQL wiki at http://wiki.postgresql.org/wiki/Aggregate_Median. In this example, we will calculate the statistical median of a set of data. For this purpose, we will define start and end aggregate functions. Let's define the end function first, which takes an array as a parameter and calculates the median. We are assuming here that our start function will pass an array to the following end function: CREATE FUNCTION _final_median(anyarray) RETURNS float8 AS $$WITH q AS(SELECT valFROM unnest($1) valWHERE VAL IS NOT NULLORDER BY 1),cnt AS(SELECT COUNT(*) AS c FROM q)SELECT AVG(val)::float8FROM(SELECT val FROM qLIMIT 2 - MOD((SELECT c FROM cnt), 2)OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) - 1,0)) q2;$$ LANGUAGE sql IMMUTABLE; Now, we create the aggregate as shown in the following code: CREATE AGGREGATE median(anyelement) (SFUNC=array_append,STYPE=anyarray,FINALFUNC=_final_median,INITCOND='{}'); The array_append start function is already defined in PostgreSQL. This function appends an element to the end of an array. In our example, the start function takes all the column values and creates an intermediate array. This array is passed on to the end function, which calculates the median. Now, let's create a table and some test data to run our function: testdb=# CREATE TABLE median_test(t integer);CREATE TABLEtestdb=# INSERT INTO median_test SELECT generate_series(1,10);INSERT 0 10 The generate_series function is a set returning function that generates a series of values, from start to stop with a step size of one. Now, we are all set to test the function: testdb=# SELECT median(t) FROM median_test;median--------5.5(1 row) The mechanics of the preceding example are quite easy to understand. When you run the aggregate, the start function is used to append all the table data from column t into an array using the append_array PostgreSQL built-in. This array is passed on to the final function, _final_median, which calculates the median of the array and returns the result in the same data type as the input parameter. This process is done transparently to the user of the function who simply has a convenient aggregate function available to them. You can read more about the user-defined aggregates in the PostgreSQL documentation in much more detail at http://www.postgresql.org/docs/current/static/xaggr.html. Using foreign data wrappers PostgreSQL foreign data wrappers (FDW) are an implementation of SQL Management of External Data (SQL/MED), which is a standard added to SQL in 2013. FDWs are drivers that allow PostgreSQL database users to read and write data to other external data sources, such as other relational databases, NoSQL data sources, files, JSON, LDAP, and even Twitter. You can query the foreign data sources using SQL and create joins across different systems or even across different data sources. There are several different types of data wrappers developed by different developers and not all of them are production quality. You can see a select list of wrappers on the PostgreSQL wiki at http://wiki.postgresql.org/wiki/Foreign_data_wrappers. Another list of FDWs can be found on PGXN at http://pgxn.org/tag/fdw/. Let's take look at a small example of using file_fdw to access data in a CSV file. First, you need to install the file_fdw extension. If you compiled PostgreSQL from the source, you will need to install the file_fdw contrib module that is distributed with the source. You can do this by going into the contrib/file_fdw folder and running make and make install. If you used an installer or a package for your platform, this module might have been installed automatically. Once the file_fdw module is installed, you will need to create the extension in the database: postgres=# CREATE EXTENSION file_fdw;CREATE EXTENSION Let's now create a sample CSV file that uses the pipe, |, as a separator and contains some employee data: $ cat testdata.csvAARON, ELVIA J|WATER RATE TAKER|WATER MGMNT|81000.00|73862.00AARON, JEFFERY M|POLICE OFFICER|POLICE|74628.00|74628.00AARON, KIMBERLEI R|CHIEF CONTRACT EXPEDITER|FLEETMANAGEMNT|77280.00|70174.00 Now, we should create a foreign server that is pretty much a formality because the file is on the same server. A foreign server normally contains the connection information that a foreign data wrapper uses to access an external data resource. The server needs to be unique within the database: CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw; The next step, is to create a foreign table that encapsulates our CSV file: CREATE FOREIGN TABLE employee (emp_name VARCHAR,job_title VARCHAR,dept VARCHAR,salary NUMERIC,sal_after_tax NUMERIC) SERVER file_serverOPTIONS (format 'csv',header 'false' , filename '/home/pgbook/14/testdata.csv', delimiter '|', null '');''); The CREATE FOREIGN TABLE command creates a foreign table and the specifications of the file are provided in the OPTIONS section of the preceding code. You can provide the format, and if the first line of the file is a header (header 'false'), in our case there is no file header. We then provide the name and path of the file and the delimiter used in the file, which in our case is the pipe symbol |. In this example, we also specify that the null values should be represented as an empty string. Let's run a SQL command on our foreign table: postgres=# select * from employee;-[ RECORD 1 ]-+-------------------------emp_name | AARON, ELVIA Jjob_title | WATER RATE TAKERdept | WATER MGMNTsalary | 81000.00sal_after_tax | 73862.00-[ RECORD 2 ]-+-------------------------emp_name | AARON, JEFFERY Mjob_title | POLICE OFFICERdept | POLICEsalary | 74628.00sal_after_tax | 74628.00-[ RECORD 3 ]-+-------------------------emp_name | AARON, KIMBERLEI Rjob_title | CHIEF CONTRACT EXPEDITERdept | FLEET MANAGEMNTsalary | 77280.00sal_after_tax | 70174.00 Great, looks like our data is successfully loaded from the file. You can also use the d meta command to see the structure of the employee table: postgres=# d employee;Foreign table "public.employee"Column | Type | Modifiers | FDW Options---------------+-------------------+-----------+-------------emp_name | character varying | |job_title | character varying | |dept | character varying | |salary | numeric | |sal_after_tax | numeric | |Server: file_serverFDW Options: (format 'csv', header 'false',filename '/home/pg_book/14/testdata.csv', delimiter '|',"null" '') You can run explain on the query to understand what is going on when you run a query on the foreign table: postgres=# EXPLAIN SELECT * FROM employee WHERE salary > 5000;QUERY PLAN---------------------------------------------------------Foreign Scan on employee (cost=0.00..1.10 rows=1 width=160)Filter: (salary > 5000::numeric)Foreign File: /home/pgbook/14/testdata.csvForeign File Size: 197(4 rows) The ALTER FOREIGN TABLE command can be used to modify the options. More information about the file_fdw is available at http://www.postgresql.org/docs/current/static/file-fdw.html. You can take a look at the CREATE SERVER and CREATE FOREIGN TABLE commands in the PostgreSQL documentation for more information on the many options available. Each of the foreign data wrappers comes with its own documentation about how to use the wrapper. Make sure that an extension is stable enough before it is used in production. The PostgreSQL core development group does not support most of the FDW extensions. If you want to create your own data wrappers, you can find the documentation at http://www.postgresql.org/docs/current/static/fdwhandler.html as an excellent starting point. The best way to learn, however, is to read the code of other available extensions. Summary This includes the ability to add new operators, new index access methods, and create your own aggregates. You can access foreign data sources, such as other databases, files, and web services using PostgreSQL foreign data wrappers. These wrappers are provided as extensions and should be used with caution, as most of them are not officially supported. Even though PostgreSQL is very extensible, you can't plug in a new storage engine or change the parser/planner and executor interfaces. These components are very tightly coupled with each other and are, therefore, highly optimized and mature. Resources for Article: Further resources on this subject: Load balancing MSSQL [Article] Advanced SOQL Statements [Article] Running a PostgreSQL Database Server [Article]
Read more
  • 0
  • 0
  • 9211

article-image-highcharts-configurations
Packt
21 Jan 2015
53 min read
Save for later

Highcharts Configurations

Packt
21 Jan 2015
53 min read
This article is written by Joe Kuan, the author of Learning Highcharts 4. All Highcharts graphs share the same configuration structure and it is crucial for us to become familiar with the core components. However, it is not possible to go through all the configurations within the book. In this article, we will explore the functional properties that are most used and demonstrate them with examples. We will learn how Highcharts manages layout, and then explore how to configure axes, specify single series and multiple series data, followed by looking at formatting and styling tool tips in both JavaScript and HTML. After that, we will get to know how to polish our charts with various types of animations and apply color gradients. Finally, we will explore the drilldown interactive feature. In this article, we will cover the following topics: Understanding Highcharts layout Framing the chart with axes (For more resources related to this topic, see here.) Configuration structure In the Highcharts configuration object, the components at the top level represent the skeleton structure of a chart. The following is a list of the major components that are covered in this article: chart: This has configurations for the top-level chart properties such as layouts, dimensions, events, animations, and user interactions series: This is an array of series objects (consisting of data and specific options) for single and multiple series, where the series data can be specified in a number of ways xAxis/yAxis/zAxis: This has configurations for all the axis properties such as labels, styles, range, intervals, plotlines, plot bands, and backgrounds tooltip: This has the layout and format style configurations for the series data tool tips drilldown: This has configurations for drilldown series and the ID field associated with the main series title/subtitle: This has the layout and style configurations for the chart title and subtitle legend: This has the layout and format style configurations for the chart legend plotOptions: This contains all the plotting options, such as display, animation, and user interactions, for common series and specific series types exporting: This has configurations that control the layout and the function of print and export features For reference information concerning all configurations, go to http://api.highcharts.com. Understanding Highcharts' layout Before we start to learn how Highcharts layout works, it is imperative that we understand some basic concepts first. First, set a border around the plot area. To do that we can set the options of plotBorderWidth and plotBorderColor in the chart section, as follows:         chart: {                renderTo: 'container',                type: 'spline',                plotBorderWidth: 1,                plotBorderColor: '#3F4044'        }, The second border is set around the Highcharts container. Next, we extend the preceding chart section with additional settings:         chart: {                renderTo: 'container',                ....                borderColor: '#a1a1a1',                borderWidth: 2,                borderRadius: 3        }, This sets the container border color with a width of 2 pixels and corner radius of 3 pixels. As we can see, there is a border around the container and this is the boundary that the Highcharts display cannot exceed: By default, Highcharts displays have three different areas: spacing, labeling, and plot area. The plot area is the area inside the inner rectangle that contains all the plot graphics. The labeling area is the area where labels such as title, subtitle, axis title, legend, and credits go, around the plot area, so that it is between the edge of the plot area and the inner edge of the spacing area. The spacing area is the area between the container border and the outer edge of the labeling area. The following screenshot shows three different kinds of areas. A gray dotted line is inserted to illustrate the boundary between the spacing and labeling areas. Each chart label position can be operated in one of the following two layouts: Automatic layout: Highcharts automatically adjusts the plot area size based on the labels' positions in the labeling area, so the plot area does not overlap with the label element at all. Automatic layout is the simplest way to configure, but has less control. This is the default way of positioning the chart elements. Fixed layout: There is no concept of labeling area. The chart label is specified in a fixed location so that it has a floating effect on the plot area. In other words, the plot area side does not automatically adjust itself to the adjacent label position. This gives the user full control of exactly how to display the chart. The spacing area controls the offset of the Highcharts display on each side. As long as the chart margins are not defined, increasing or decreasing the spacing area has a global effect on the plot area measurements in both automatic and fixed layouts. Chart margins and spacing settings In this section, we will see how chart margins and spacing settings have an effect on the overall layout. Chart margins can be configured with the properties margin, marginTop, marginLeft, marginRight, and marginBottom, and they are not enabled by default. Setting chart margins has a global effect on the plot area, so that none of the label positions or chart spacing configurations can affect the plot area size. Hence, all the chart elements are in a fixed layout mode with respect to the plot area. The margin option is an array of four margin values covered for each direction, the same as in CSS, starting from north and going clockwise. Also, the margin option has a lower precedence than any of the directional margin options, regardless of their order in the chart section. Spacing configurations are enabled by default with a fixed value on each side. These can be configured in the chart section with the property names spacing, spacingTop, spacingLeft, spacingBottom, and spacingRight. In this example, we are going to increase or decrease the margin or spacing property on each side of the chart and observe the effect. The following are the chart settings:             chart: {                renderTo: 'container',                type: ...                marginTop: 10,                marginRight: 0,                spacingLeft: 30,                spacingBottom: 0            }, The following screenshot shows what the chart looks like: The marginTop property fixes the plot area's top border 10 pixels away from the container border. It also changes the top border into fixed layout for any label elements, so the chart title and subtitle float on top of the plot area. The spacingLeft property increases the spacing area on the left-hand side, so it pushes the y axis title further in. As it is in automatic layout (without declaring marginLeft), it also pushes the plot area's west border in. Setting marginRight to 0 will override all the default spacing on the chart's right-hand side and change it to fixed layout mode. Finally, setting spacingBottom to 0 makes the legend touch the lower bar of the container, so it also stretches the plot area downwards. This is because the bottom edge is still in automatic layout even though spacingBottom is set to 0. Chart label properties Chart labels such as xAxis.title, yAxis.title, legend, title, subtitle, and credits share common property names, as follows: align: This is for the horizontal alignment of the label. Possible keywords are 'left', 'center', and 'right'. As for the axis title, it is 'low', 'middle', and 'high'. floating: This is to give the label position a floating effect on the plot area. Setting this to true will cause the label position to have no effect on the adjacent plot area's boundary. margin: This is the margin setting between the label and the side of the plot area adjacent to it. Only certain label types have this setting. verticalAlign: This is for the vertical alignment of the label. The keywords are 'top', 'middle', and 'bottom'. x: This is for horizontal positioning in relation to alignment. y: This is for vertical positioning in relation to alignment. As for the labels' x and y positioning, they are not used for absolute positioning within the chart. They are designed for fine adjustment with the label alignment. The following diagram shows the coordinate directions, where the center represents the label location: We can experiment with these properties with a simple example of the align and y position settings, by placing both title and subtitle next to each other. The title is shifted to the left with align set to 'left', whereas the subtitle alignment is set to 'right'. In order to make both titles appear on the same line, we change the subtitle's y position to 15, which is the same as the title's default y value:  title: {     text: 'Web browsers ...',     align: 'left' }, subtitle: {     text: 'From 2008 to present',     align: 'right',     y: 15 }, The following is a screenshot showing both titles aligned on the same line: In the following subsections, we will experiment with how changes in alignment for each label element affect the layout behavior of the plot area. Title and subtitle alignments Title and subtitle have the same layout properties, and the only differences are that the default values and title have the margin setting. Specifying verticalAlign for any value changes from the default automatic layout to fixed layout (it internally switches floating to true). However, manually setting the subtitle's floating property to false does not switch back to automatic layout. The following is an example of title in automatic layout and subtitle in fixed layout:     title: {       text: 'Web browsers statistics'    },    subtitle: {       text: 'From 2008 to present',       verticalAlign: 'top',       y: 60       }, The verticalAlign property for the subtitle is set to 'top', which switches the layout into fixed layout, and the y offset is increased to 60. The y offset pushes the subtitle's position further down. Due to the fact that the plot area is not in an automatic layout relationship to the subtitle anymore, the top border of the plot area goes above the subtitle. However, the plot area is still in automatic layout towards the title, so the title is still above the plot area: Legend alignment Legends show different behavior for the verticalAlign and align properties. Apart from setting the alignment to 'center', all other settings in verticalAlign and align remain in automatic positioning. The following is an example of a legend located on the right-hand side of the chart. The verticalAlign property is switched to the middle of the chart, where the horizontal align is set to 'right':           legend: {                align: 'right',                verticalAlign: 'middle',                layout: 'vertical'          }, The layout property is assigned to 'vertical' so that it causes the items inside the legend box to be displayed in a vertical manner. As we can see, the plot area is automatically resized for the legend box: Note that the border decoration around the legend box is disabled in the newer version. To display a round border around the legend box, we can add the borderWidth and borderRadius options using the following:           legend: {                align: 'right',                verticalAlign: 'middle',                layout: 'vertical',                borderWidth: 1,                borderRadius: 3          }, Here is the legend box with a round corner border: Axis title alignment Axis titles do not use verticalAlign. Instead, they use the align setting, which is either 'low', 'middle', or 'high'. The title's margin value is the distance between the axis title and the axis line. The following is an example of showing the y-axis title rotated horizontally instead of vertically (which it is by default) and displayed on the top of the axis line instead of next to it. We also use the y property to fine-tune the title location:             yAxis: {                title: {                    text: 'Percentage %',                    rotation: 0,                    y: -15,                    margin: -70,                    align: 'high'                },                min: 0            }, The following is a screenshot of the upper-left corner of the chart showing that the title is aligned horizontally at the top of the y axis. Alternatively, we can use the offset option instead of margin to achieve the same result. Credits alignment Credits is a bit different from other label elements. It only supports the align, verticalAlign, x, and y properties in the credits.position property (shorthand for credits: { position: … }), and is also not affected by any spacing setting. Suppose we have a graph without a legend and we have to move the credits to the lower-left area of the chart, the following code snippet shows how to do it:             legend: {                enabled: false            },            credits: {                position: {                   align: 'left'                },                text: 'Joe Kuan',                href: 'http://joekuan.wordpress.com'            }, However, the credits text is off the edge of the chart, as shown in the following screenshot: Even if we move the credits label to the right with x positioning, the label is still a bit too close to the x axis interval label. We can introduce extra spacingBottom to put a gap between both labels, as follows:             chart: {                   spacingBottom: 30,                    ....            },            credits: {                position: {                   align: 'left',                   x: 20,                   y: -7                },            },            .... The following is a screenshot of the credits with the final adjustments: Experimenting with an automatic layout In this section, we will examine the automatic layout feature in more detail. For the sake of simplifying the example, we will start with only the chart title and without any chart spacing settings:      chart: {         renderTo: 'container',         // border and plotBorder settings         borderWidth: 2,         .....     },     title: {            text: 'Web browsers statistics,     }, From the preceding example, the chart title should appear as expected between the container and the plot area's borders: The space between the title and the top border of the container has the default setting spacingTop for the spacing area (a default value of 10-pixels high). The gap between the title and the top border of the plot area is the default setting for title.margin, which is 15-pixels high. By setting spacingTop in the chart section to 0, the chart title moves up next to the container top border. Hence the size of the plot area is automatically expanded upwards, as follows: Then, we set title.margin to 0; the plot area border moves further up, hence the height of the plot area increases further, as follows: As you may notice, there is still a gap of a few pixels between the top border and the chart title. This is actually due to the default value of the title's y position setting, which is 15 pixels, large enough for the default title font size. The following is the chart configuration for setting all the spaces between the container and the plot area to 0: chart: {     renderTo: 'container',     // border and plotBorder settings     .....     spacingTop: 0},title: {     text: null,     margin: 0,     y: 0} If we set title.y to 0, all the gap between the top edge of the plot area and the top container edge closes up. The following is the final screenshot of the upper-left corner of the chart, to show the effect. The chart title is not visible anymore as it has been shifted above the container: Interestingly, if we work backwards to the first example, the default distance between the top of the plot area and the top of the container is calculated as: spacingTop + title.margin + title.y = 10 + 15 + 15 = 40 Therefore, changing any of these three variables will automatically adjust the plot area from the top container bar. Each of these offset variables actually has its own purpose in the automatic layout. Spacing is for the gap between the container and the chart content; thus, if we want to display a chart nicely spaced with other elements on a web page, spacing elements should be used. Equally, if we want to use a specific font size for the label elements, we should consider adjusting the y offset. Hence, the labels are still maintained at a distance and do not interfere with other components in the chart. Experimenting with a fixed layout In the preceding section, we have learned how the plot area dynamically adjusted itself. In this section, we will see how we can manually position the chart labels. First, we will start with the example code from the beginning of the Experimenting with automatic layout section and set the chart title's verticalAlign to 'bottom', as follows: chart: {    renderTo: 'container',    // border and plotBorder settings    .....},title: {    text: 'Web browsers statistics',    verticalAlign: 'bottom'}, The chart title is moved to the bottom of the chart, next to the lower border of the container. Notice that this setting has changed the title into floating mode; more importantly, the legend still remains in the default automatic layout of the plot area: Be aware that we haven't specified spacingBottom, which has a default value of 15 pixels in height when applied to the chart. This means that there should be a gap between the title and the container bottom border, but none is shown. This is because the title.y position has a default value of 15 pixels in relation to spacing. According to the diagram in the Chart label properties section, this positive y value pushes the title towards the bottom border; this compensates for the space created by spacingBottom. Let's make a bigger change to the y offset position this time to show that verticalAlign is floating on top of the plot area:  title: {     text: 'Web browsers statistics',     verticalAlign: 'bottom',     y: -90 }, The negative y value moves the title up, as shown here: Now the title is overlapping the plot area. To demonstrate that the legend is still in automatic layout with regard to the plot area, here we change the legend's y position and the margin settings, which is the distance from the axis label:                legend: {                   margin: 70,                   y: -10               }, This has pushed up the bottom side of the plot area. However, the chart title still remains in fixed layout and its position within the chart hasn't been changed at all after applying the new legend setting, as shown in the following screenshot: By now, we should have a better understanding of how to position label elements, and their layout policy relating to the plot area. Framing the chart with axes In this section, we are going to look into the configuration of axes in Highcharts in terms of their functional area. We will start off with a plain line graph and gradually apply more options to the chart to demonstrate the effects. Accessing the axis data type There are two ways to specify data for a chart: categories and series data. For displaying intervals with specific names, we should use the categories field that expects an array of strings. Each entry in the categories array is then associated with the series data array. Alternatively, the axis interval values are embedded inside the series data array. Then, Highcharts extracts the series data for both axes, interprets the data type, and formats and labels the values appropriately. The following is a straightforward example showing the use of categories:     chart: {        renderTo: 'container',        height: 250,        spacingRight: 20    },    title: {        text: 'Market Data: Nasdaq 100'    },    subtitle: {        text: 'May 11, 2012'    },    xAxis: {        categories: [ '9:30 am', '10:00 am', '10:30 am',                       '11:00 am', '11:30 am', '12:00 pm',                       '12:30 pm', '1:00 pm', '1:30 pm',                       '2:00 pm', '2:30 pm', '3:00 pm',                       '3:30 pm', '4:00 pm' ],         labels: {             step: 3         }     },     yAxis: {         title: {             text: null         }     },     legend: {         enabled: false     },     credits: {         enabled: false     },     series: [{         name: 'Nasdaq',         color: '#4572A7',         data: [ 2606.01, 2622.08, 2636.03, 2637.78, 2639.15,                 2637.09, 2633.38, 2632.23, 2632.33, 2632.59,                 2630.34, 2626.89, 2624.59, 2615.98 ]     }] The preceding code snippet produces a graph that looks like the following screenshot: The first name in the categories field corresponds to the first value, 9:30 am, 2606.01, in the series data array, and so on. Alternatively, we can specify the time values inside the series data and use the type property of the x axis to format the time. The type property supports 'linear' (default), 'logarithmic', or 'datetime'. The 'datetime' setting automatically interprets the time in the series data into human-readable form. Moreover, we can use the dateTimeLabelFormats property to predefine the custom format for the time unit. The option can also accept multiple time unit formats. This is for when we don't know in advance how long the time span is in the series data, so each unit in the resulting graph can be per hour, per day, and so on. The following example shows how the graph is specified with predefined hourly and minute formats. The syntax of the format string is based on the PHP strftime function:     xAxis: {         type: 'datetime',          // Format 24 hour time to AM/PM          dateTimeLabelFormats: {                hour: '%I:%M %P',              minute: '%I %M'          }               },     series: [{         name: 'Nasdaq',         color: '#4572A7',         data: [ [ Date.UTC(2012, 4, 11, 9, 30), 2606.01 ],                  [ Date.UTC(2012, 4, 11, 10), 2622.08 ],                   [ Date.UTC(2012, 4, 11, 10, 30), 2636.03 ],                  .....                ]     }] Note that the x axis is in the 12-hour time format, as shown in the following screenshot: Instead, we can define the format handler for the xAxis.labels.formatter property to achieve a similar effect. Highcharts provides a utility routine, Highcharts.dateFormat, that converts the timestamp in milliseconds to a readable format. In the following code snippet, we define the formatter function using dateFormat and this.value. The keyword this is the axis's interval object, whereas this.value is the UTC time value for the instance of the interval:     xAxis: {         type: 'datetime',         labels: {             formatter: function() {                 return Highcharts.dateFormat('%I:%M %P', this.value);             }         }     }, Since the time values of our data points are in fixed intervals, they can also be arranged in a cut-down version. All we need is to define the starting point of time, pointStart, and the regular interval between them, pointInterval, in milliseconds: series: [{     name: 'Nasdaq',     color: '#4572A7',     pointStart: Date.UTC(2012, 4, 11, 9, 30),     pointInterval: 30 * 60 * 1000,     data: [ 2606.01, 2622.08, 2636.03, 2637.78,             2639.15, 2637.09, 2633.38, 2632.23,             2632.33, 2632.59, 2630.34, 2626.89,             2624.59, 2615.98 ] }] Adjusting intervals and background We have learned how to use axis categories and series data arrays in the last section. In this section, we will see how to format interval lines and the background style to produce a graph with more clarity. We will continue from the previous example. First, let's create some interval lines along the y axis. In the chart, the interval is automatically set to 20. However, it would be clearer to double the number of interval lines. To do that, simply assign the tickInterval value to 10. Then, we use minorTickInterval to put another line in between the intervals to indicate a semi-interval. In order to distinguish between interval and semi-interval lines, we set the semi-interval lines, minorGridLineDashStyle, to a dashed and dotted style. There are nearly a dozen line style settings available in Highcharts, from 'Solid' to 'LongDashDotDot'. Readers can refer to the online manual for possible values. The following is the first step to create the new settings:             yAxis: {                 title: {                     text: null                 },                 tickInterval: 10,                 minorTickInterval: 5,                 minorGridLineColor: '#ADADAD',                 minorGridLineDashStyle: 'dashdot'            } The interval lines should look like the following screenshot: To make the graph even more presentable, we add a striping effect with shading using alternateGridColor. Then, we change the interval line color, gridLineColor, to a similar range with the stripes. The following code snippet is added into the yAxis configuration:                 gridLineColor: '#8AB8E6',                 alternateGridColor: {                     linearGradient: {                         x1: 0, y1: 1,                         x2: 1, y2: 1                     },                     stops: [ [0, '#FAFCFF' ],                              [0.5, '#F5FAFF'] ,                              [0.8, '#E0F0FF'] ,                              [1, '#D6EBFF'] ]                   } The following is the graph with the new shading background: The next step is to apply a more professional look to the y axis line. We are going to draw a line on the y axis with the lineWidth property, and add some measurement marks along the interval lines with the following code snippet:                  lineWidth: 2,                  lineColor: '#92A8CD',                  tickWidth: 3,                  tickLength: 6,                  tickColor: '#92A8CD',                  minorTickLength: 3,                  minorTickWidth: 1,                  minorTickColor: '#D8D8D8' The tickWidth and tickLength properties add the effect of little marks at the start of each interval line. We apply the same color on both the interval mark and the axis line. Then we add the ticks minorTickLength and minorTickWidth into the semi-interval lines in a smaller size. This gives a nice measurement mark effect along the axis, as shown in the following screenshot: Now, we apply a similar polish to the xAxis configuration, as follows:            xAxis: {                type: 'datetime',                labels: {                    formatter: function() {                        return Highcharts.dateFormat('%I:%M %P', this.value);                    },                },                gridLineDashStyle: 'dot',                gridLineWidth: 1,                tickInterval: 60 * 60 * 1000,                lineWidth: 2,                lineColor: '#92A8CD',                tickWidth: 3,                tickLength: 6,                tickColor: '#92A8CD',            }, We set the x axis interval lines to the hourly format and switch the line style to a dotted line. Then, we apply the same color, thickness, and interval ticks as on the y axis. The following is the resulting screenshot: However, there are some defects along the x axis line. To begin with, the meeting point between the x axis and y axis lines does not align properly. Secondly, the interval labels at the x axis are touching the interval ticks. Finally, part of the first data point is covered by the y-axis line. The following is an enlarged screenshot showing the issues: There are two ways to resolve the axis line alignment problem, as follows: Shift the plot area 1 pixel away from the x axis. This can be achieved by setting the offset property of xAxis to 1. Increase the x-axis line width to 3 pixels, which is the same width as the y-axis tick interval. As for the x-axis label, we can simply solve the problem by introducing the y offset value into the labels setting. Finally, to avoid the first data point touching the y-axis line, we can impose minPadding on the x axis. What this does is to add padding space at the minimum value of the axis, the first point. The minPadding value is based on the ratio of the graph width. In this case, setting the property to 0.02 is equivalent to shifting along the x axis 5 pixels to the right (250 px * 0.02). The following are the additional settings to improve the chart:     xAxis: {         ....         labels: {                formatter: ...,                y: 17         },         .....         minPadding: 0.02,         offset: 1     } The following screenshot shows that the issues have been addressed: As we can see, Highcharts has a comprehensive set of configurable variables with great flexibility. Using plot lines and plot bands In this section, we are going to see how we can use Highcharts to place lines or bands along the axis. We will continue with the example from the previous section. Let's draw a couple of lines to indicate the day's highest and lowest index points on the y axis. The plotLines field accepts an array of object configurations for each plot line. There are no width and color default values for plotLines, so we need to specify them explicitly in order to see the line. The following is the code snippet for the plot lines:       yAxis: {               ... ,               plotLines: [{                    value: 2606.01,                    width: 2,                    color: '#821740',                    label: {                        text: 'Lowest: 2606.01',                        style: {                            color: '#898989'                        }                    }               }, {                    value: 2639.15,                    width: 2,                    color: '#4A9338',                    label: {                        text: 'Highest: 2639.15',                        style: {                            color: '#898989'                        }                    }               }]         } The following screenshot shows what it should look like: We can improve the look of the chart slightly. First, the text label for the top plot line should not be next to the highest point. Second, the label for the bottom line should be remotely covered by the series and interval lines, as follows: To resolve these issues, we can assign the plot line's zIndex to 1, which brings the text label above the interval lines. We also set the x position of the label to shift the text next to the point. The following are the new changes:              plotLines: [{                    ... ,                    label: {                        ... ,                        x: 25                    },                    zIndex: 1                    }, {                    ... ,                    label: {                        ... ,                        x: 130                    },                    zIndex: 1               }] The following graph shows the label has been moved away from the plot line and over the interval line: Now, we are going to change the preceding example with a plot band area that shows the index change between the market's opening and closing values. The plot band configuration is very similar to plot lines, except that it uses the to and from properties, and the color property accepts gradient settings or color code. We create a plot band with a triangle text symbol and values to signify a positive close. Instead of using the x and y properties to fine-tune label position, we use the align option to adjust the text to the center of the plot area (replace the plotLines setting from the above example):               plotBands: [{                    from: 2606.01,                    to: 2615.98,                    label: {                        text: '▲ 9.97 (0.38%)',                        align: 'center',                        style: {                            color: '#007A3D'                        }                    },                    zIndex: 1,                    color: {                        linearGradient: {                            x1: 0, y1: 1,                            x2: 1, y2: 1                        },                        stops: [ [0, '#EBFAEB' ],                                 [0.5, '#C2F0C2'] ,                                 [0.8, '#ADEBAD'] ,                                 [1, '#99E699']                        ]                    }               }] The triangle is an alt-code character; hold down the left Alt key and enter 30 in the number keypad. See http://www.alt-codes.net for more details. This produces a chart with a green plot band highlighting a positive close in the market, as shown in the following screenshot: Extending to multiple axes Previously, we ran through most of the axis configurations. Here, we explore how we can use multiple axes, which are just an array of objects containing axis configurations. Continuing from the previous stock market example, suppose we now want to include another market index, Dow Jones, along with Nasdaq. However, both indices are different in nature, so their value ranges are vastly different. First, let's examine the outcome by displaying both indices with the common y axis. We change the title, remove the fixed interval setting on the y axis, and include data for another series:             chart: ... ,             title: {                 text: 'Market Data: Nasdaq & Dow Jones'             },             subtitle: ... ,             xAxis: ... ,             credits: ... ,             yAxis: {                 title: {                     text: null                 },                 minorGridLineColor: '#D8D8D8',                 minorGridLineDashStyle: 'dashdot',                 gridLineColor: '#8AB8E6',                 alternateGridColor: {                     linearGradient: {                         x1: 0, y1: 1,                         x2: 1, y2: 1                     },                     stops: [ [0, '#FAFCFF' ],                              [0.5, '#F5FAFF'] ,                              [0.8, '#E0F0FF'] ,                              [1, '#D6EBFF'] ]                 },                 lineWidth: 2,                 lineColor: '#92A8CD',                 tickWidth: 3,                 tickLength: 6,                 tickColor: '#92A8CD',                 minorTickLength: 3,                 minorTickWidth: 1,                 minorTickColor: '#D8D8D8'             },             series: [{               name: 'Nasdaq',               color: '#4572A7',               data: [ [ Date.UTC(2012, 4, 11, 9, 30), 2606.01 ],                          [ Date.UTC(2012, 4, 11, 10), 2622.08 ],                           [ Date.UTC(2012, 4, 11, 10, 30), 2636.03 ],                          ...                        ]             }, {               name: 'Dow Jones',               color: '#AA4643',               data: [ [ Date.UTC(2012, 4, 11, 9, 30), 12598.32 ],                          [ Date.UTC(2012, 4, 11, 10), 12538.61 ],                           [ Date.UTC(2012, 4, 11, 10, 30), 12549.89 ],                          ...                        ]             }] The following is the chart showing both market indices: As expected, the index changes that occur during the day have been normalized by the vast differences in value. Both lines look roughly straight, which falsely implies that the indices have hardly changed. Let us now explore putting both indices onto separate y axes. We should remove any background decoration on the y axis, because we now have a different range of data shared on the same background. The following is the new setup for yAxis:            yAxis: [{                  title: {                     text: 'Nasdaq'                 },               }, {                 title: {                     text: 'Dow Jones'                 },                 opposite: true             }], Now yAxis is an array of axis configurations. The first entry in the array is for Nasdaq and the second is for Dow Jones. This time, we display the axis title to distinguish between them. The opposite property is to put the Dow Jones y axis onto the other side of the graph for clarity. Otherwise, both y axes appear on the left-hand side. The next step is to align indices from the y-axis array to the series data array, as follows:             series: [{                 name: 'Nasdaq',                 color: '#4572A7',                 yAxis: 0,                 data: [ ... ]             }, {                 name: 'Dow Jones',                 color: '#AA4643',                 yAxis: 1,                 data: [ ... ]             }]          We can clearly see the movement of the indices in the new graph, as follows: Moreover, we can improve the final view by color-matching the series to the axis lines. The Highcharts.getOptions().colors property contains a list of default colors for the series, so we use the first two entries for our indices. Another improvement is to set maxPadding for the x axis, because the new y-axis line covers parts of the data points at the high end of the x axis:             xAxis: {                 ... ,                 minPadding: 0.02,                 maxPadding: 0.02                 },             yAxis: [{                 title: {                     text: 'Nasdaq'                 },                 lineWidth: 2,                 lineColor: '#4572A7',                 tickWidth: 3,                 tickLength: 6,                 tickColor: '#4572A7'             }, {                 title: {                     text: 'Dow Jones'                 },                 opposite: true,                 lineWidth: 2,                 lineColor: '#AA4643',                 tickWidth: 3,                 tickLength: 6,                 tickColor: '#AA4643'             }], The following screenshot shows the improved look of the chart: We can extend the preceding example and have more than a couple of axes, simply by adding entries into the yAxis and series arrays, and mapping both together. The following screenshot shows a 4-axis line graph: Summary In this article, major configuration components were discussed and experimented with, and examples shown. By now, we should be comfortable with what we have covered already and ready to plot some of the basic graphs with more elaborate styles. Resources for Article: Further resources on this subject: Theming with Highcharts [article] Integrating with other Frameworks [article] Highcharts [article]
Read more
  • 0
  • 0
  • 9155

article-image-understanding-streaming-applications-in-spark-sql
Amarabha Banerjee
04 Dec 2017
7 min read
Save for later

Understanding Streaming Applications in Spark SQL

Amarabha Banerjee
04 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This article is a book excerpt from Learning Spark SQL written by Aurobindo Sarkar. This book gives an insight into the engineering practices used to design and build real-world Spark based applications. The hands on examples illustrated in the book will give you required confidence to work on future projects you encounter in Spark SQL. [/box] In the article, we shall talk about Spark SQL and its use in streaming applications. What are streaming applications? A streaming application is a program that has its necessary components downloaded as needed instead of being installed ahead of time on a computer. Application streaming is a method of delivering virtualized applications. Streaming applications are getting increasingly complex, because such computations don't run in isolation. They need to interact with batch data, support interactive analysis, support sophisticated machine learning applications, and so on. Typically, such applications store incoming event stream(s) on long-term storage, continuously monitor events, and run machine learning models on the stored data, while simultaneously enabling continuous learning on the incoming stream. They also have the capability to interactively query the stored data while providing exactly-once write guarantees, handling late arriving data, performing aggregations, and so on. These types of applications are a lot more than mere streaming applications and have, therefore, been termed as continuous applications. SparkSQL and Structured Streaming Before Spark 2.0, streaming applications were built on the concept of DStreams. There were several pain points associated with using DStreams. In DStreams, the timestamp was when the event actually came into the Spark system; the time embedded in the event was not taken into consideration. In addition, though the same engine can process both the batch and streaming computations, the APIs involved, though similar between RDDs (batch) and DStream (streaming), required the developer to make code changes. The DStream streaming model placed the burden on the developer to address various failure conditions, and it was hard to reason about data consistency issues. In Spark 2.0, Structured Streaming was introduced to deal with all of these pain points.  Structured Streaming is a fast, fault-tolerant, exactly-once stateful stream processing approach. It enables streaming analytics without having to reason about the underlying mechanics of streaming. In the new model, the input can be thought of as data from an append-only table (that grows continuously). A trigger specifies the time interval for checking the input for the arrival of new data. As shown in the following figure, the query represents the queries or the operations, such as map, filter, and reduce on the input, and result represents the final table that is updated in each trigger interval, as per the specified operation. The output defines the part of the result to be written to the data sink in each time interval.  The output modes can be complete, delta, or append, where the complete output mode means writing the full result table every time, the delta output mode writes the changed rows from the previous batch, and the append output mode writes the new rows only, Respectively: In Spark 2.0, in addition to the static bounded DataFrames, we have the concept of a continuous unbounded DataFrame. Both static and continuous DataFrames use the same API, thereby unifying streaming, interactive, and batch queries. For example, you can aggregate data in a stream and then serve it using JDBC. The high-level streaming API is built on the Spark SQL engine and is tightly integrated with SQL queries and the DataFrame/Dataset APIs. The primary benefit is that you use the same high-level Spark DataFrame and Dataset APIs, and the Spark engine figures out the incremental and continuous execution required for operations. Additionally, there are query management APIs that you can use to manage multiple, concurrently running, and streaming queries. For instance, you can list running queries, stop and restart queries, retrieve exceptions in case of failures, and so on. In the example code below, we use two bid files from the iPinYou Dataset as the source for our streaming data. First, we define our input records schema and create a streaming input DataFrame: scala> import org.apache.spark.sql.types._ scala> import org.apache.spark.sql.functions._ scala> import scala.concurrent.duration._ scala> import org.apache.spark.sql.streaming.ProcessingTime scala> import org.apache.spark.sql.streaming.OutputMode.Complete scala> val bidSchema = new StructType().add("bidid", StringType).add("timestamp", StringType).add("ipinyouid", StringType).add("useragent", StringType).add("IP", StringType).add("region", IntegerType).add("city", IntegerType).add("adexchange", StringType).add("domain", StringType).add("url:String", StringType).add("urlid: String", StringType).add("slotid: String", StringType).add("slotwidth", StringType).add("slotheight", StringType).add("slotvisibility", StringType).add("slotformat", StringType).add("slotprice", StringType).add("creative", StringType).add("bidprice", StringType) scala> val streamingInputDF = spark.readStream.format("csv").schema(bidSchema).option("header", false).option("inferSchema", true).option("sep", "t").option("maxFilesPerTrigger", 1).load("file:///Users/aurobindosarkar/Downloads/make-ipinyou-datamaster/ original-data/ipinyou.contest.dataset/bidfiles") Next, we define our query with a time interval of 20 seconds and the output mode as Complete: scala> val streamingCountsDF = streamingInputDF.groupBy($"city").count() scala> val query = streamingCountsDF.writeStream.format("console").trigger(ProcessingTime(20.s econds)).queryName("counts").outputMode(Complete).start() In the output, it is observed that the count of bids from each region gets updated in each time interval as new data arrives. The new bid files need to be dropped (or start with multiple bid files, as they will get picked up for processing one at a time based on the value of maxFilesPerTrigger) from the original Dataset into the bidfiles directory to see the updated results. Structured Streaming Internals Sources and incrementally executes the computation on it before writing it to the sink. In addition, any running aggregates required by your application are maintained as in-memory states backed by a Write-Ahead Log (WAL). The in-memory state data is generated and used across incremental executions. The fault tolerance requirements for such applications include the ability to recover and replay all data and metadata in the system. The planner writes offsets to a fault-tolerant WAL on persistent storage, such as HDFS, before execution as illustrated in the figure: In case the planner fails on the current incremental execution, the restarted planner reads from the WAL and re-executes the exact range of offsets required. Typically, sources such as Kafka are also fault-tolerant and generate the original transactions data, given the appropriate offsets recovered by the planner. The state data is usually maintained in a versioned, key-value map in Spark workers and is backed by a WAL on HDFS. The planner ensures that the correct version of the state is used to re-execute the transactions subsequent to a failure. Additionally, the sinks are idempotent by design, and can handle the re-executions without double commits of the output. Hence, an overall combination of offset tracking in WAL, state management, and fault-tolerant sources and sinks provide the end-to- end exactly-once guarantees. Summary SparkSQL provides one of the best platforms for implementing streaming applications. The internal architecture and the fault tolerant behavior implies that modern day developers who want to create data intensive applications with data streaming capabilities, will have to use the power of SparkSQL. If you liked our post, please be sure to check out Learning Spark SQL which consists of more useful techniques on data extraction and data analysis using Spark SQL.
Read more
  • 0
  • 0
  • 9141
article-image-sql-server-powershell
Packt
19 Oct 2015
8 min read
Save for later

SQL Server with PowerShell

Packt
19 Oct 2015
8 min read
In this article by Donabel Santos, author of the book, SQL Server 2014 with Powershell v5 Cookbook explains scripts and snippets of code that accomplish basic SQL Server tasks using PowerShell. She discusses simple tasks such as Listing SQL Server Instances and Discovering SQL Server Services to make you comfortable working with SQL Server programmatically. However, even if ever you explore how to create some common database objects using PowerShell, keep in mind that PowerShell will not always be the best tool for the task. There will be tasks that are best completed using T-SQL. It is still good to know what is possible in PowerShell and how to do them, so you know that you have alternatives depending on your requirements or situation. For the recipes, we are going to use PowerShell ISE quite a lot. If you prefer running the script from the PowerShell console rather run running the commands from the ISE, you can save the scripts in a .ps1 file and run it from the PowerShell console. (For more resources related to this topic, see here.) Listing SQL Server Instances In this recipe, we will list all SQL Server Instances in the local network. Getting ready Log in to the server that has your SQL Server development instance as an administrator. How to do it... Let's look at the steps to list your SQL Server instances: Open PowerShell ISE as administrator. Let's use the Start-Service cmdlet to start the SQL Browser service: Import-Module SQLPS -DisableNameChecking #out of the box, the SQLBrowser is disabled. To enable: Set-Service SQLBrowser -StartupType Automatic #sql browser must be installed and running for us #to discover SQL Server instances Start-Service "SQLBrowser" Next, you need to create a ManagedComputer object to get access to instances. Type the following script and run: $instanceName = "localhost" $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName #list server instances $managedComputer.ServerInstances Your result should look similar to the one shown in the following screenshot: Notice that $managedComputer.ServerInstances gives you not only instance names, but also additional properties such as ServerProtocols, Urn, State, and so on. Confirm that these are the same instances you see from SQL Server Management Studio. Open SQL Server Management Studio. Go to Connect | Database Engine. In the Server Name dropdown, click on Browse for More. Select the Network Servers tab and check the instances listed. Your screen should look similar to this: How it works... All services in a Windows operating system are exposed and accessible using Windows Management Instrumentation (WMI). WMI is Microsoft's framework for listing, setting, and configuring any Microsoft-related resource. This framework follows Web-based Enterprise Management (WBEM). The DISTRIBUTED MANAGEMENT TASK FORCE, INC. (http://www.dmtf.org/standards/wbem) defines WBEM as follows: A set of management and Internet standard technologies developed to unify the management of distributed computing environments. WBEM provides the ability for the industry to deliver a well-integrated set of standard-based management tools, facilitating the exchange of data across otherwise disparate technologies and platforms. In order to access SQL Server WMI-related objects, you can create a WMI ManagedComputer instance: $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName The ManagedComputer object has access to a ServerInstance property, which in turn lists all available instances in the local network. These instances however are only identifiable if the SQL Server Browser service is running. The SQL Server Browser is a Windows Service that can provide information on installed instances in a box. You need to start this service if you want to list the SQL Server-related services. There's more... The Services instance of the ManagedComputer object can also provide similar information, but you will have to filter for the server type SqlServer: #list server instances $managedComputer.Services | Where-Object Type –eq "SqlServer" | Select-Object Name, State, Type, StartMode, ProcessId Your result should look like this: Instead of creating a WMI instance by using the New-Object method, you can also use the Get-WmiObject cmdlet when creating your variable. Get-WmiObject, however, will not expose exactly the same properties exposed by the Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer object. To list instances using Get-WmiObject, you will need to discover what namespace is available in your environment: $hostName = "localhost" $namespace = Get-WMIObject -ComputerName $hostName -Namespace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" #see matching namespace objects $namespace #see namespace names $namespace | Select-Object -ExpandProperty "__NAMESPACE" $namespace | Select-Object -ExpandProperty "Name" If you are using PowerShell v2, you will have to change the Where-Object cmdlet usage to use the curly braces {} and the $_ variable: Where-Object {$_.Name -like "ComputerManagement*" } For SQL Server 2014, the namespace value is: ROOTMicrosoftSQLServerComputerManagement12 This value can be derived from $namespace.__NAMESPACE and $namespace.Name. Once you have the namespace, you can use this with Get-WmiObject to retrieve the instances. We can use the SqlServiceType property to filter. According to MSDN (http://msdn.microsoft.com/en-us/library/ms179591.aspx), these are the values of SqlServiceType: SqlServiceType Description 1 SQL Server Service 2 SQL Server Agent Service 3 Full-Text Search Engine Service 4 Integration Services Service 5 Analysis Services Service 6 Reporting Services Service 7 SQL Browser Service Thus, to retrieve the SQL Server instances, we need to provide the full namespace ROOTMicrosoftSQLServerComputerManagement12. We also need to filter for SQL Server Service type, or SQLServiceType = 1. The code is as follows: Get-WmiObject -ComputerName $hostName -Namespace "$($namespace.__NAMESPACE)$($namespace.Name)" -Class SqlService | Where-Object SQLServiceType -eq 1 | Select-Object ServiceName, DisplayName, SQLServiceType | Format-Table –AutoSize Your result should look similar to the following screenshot: Yet another way to list all the SQL Server instances in the local network is by using the System.Data.Sql.SQLSourceEnumerator class, instead of ManagedComputer. This class has a static method called Instance.GetDataSources that will list all SQL Server instances: [System.Data.Sql.SqlDataSourceEnumerator]: :Instance.GetDataSources() | Format-Table -AutoSize When you execute, your result should look similar to the following: If you have multiple SQL Server versions, you can use the following code to display your instances: #list services using WMI foreach ($path in $namespace) { Write-Verbose "SQL Services in:$($path.__NAMESPACE)$($path.Name)" Get-WmiObject -ComputerName $hostName ` -Namespace "$($path.__NAMESPACE)$($path.Name)" ` -Class SqlService | Where-Object SQLServiceType -eq 1 | Select-Object ServiceName, DisplayName, SQLServiceType | Format-Table –AutoSize } Discovering SQL Server Services In this recipe, we will enumerate all SQL Server Services and list their statuses. Getting ready Check which SQL Server services are installed in your instance. Go to Start | Run and type services.msc. You should see a screen similar to this: How to do it... Let's assume you are running this script on the server box: Open PowerShell ISE as administrator. Add the following code and execute: Import-Module SQLPS -DisableNameChecking #you can replace localhost with your instance name $instanceName = "localhost" $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName #list services $managedComputer.Services | Select-Object Name, Type, ServiceState, DisplayName | Format-Table -AutoSize Your result will look similar to the one shown in the following screenshot: Items listed in your screen will vary depending on the features installed and running in your instance Confirm that these are the services that exist in your server. Check your services window. How it works... Services that are installed on a system can be queried using WMI. Specific services for SQL Server are exposed through SMO's WMI ManagedComputer object. Some of the exposed properties are as follows: ClientProtocols ConnectionSettings ServerAliases ServerInstances Services There's more... An alternative way to get SQL Server-related services is by using Get-WMIObject. We will need to pass in the host name as well as the SQL Server WMI Provider for the ComputerManagement namespace. For SQL Server 2014, this value is ROOTMicrosoftSQLServerComputerManagement12. The script to retrieve the services is provided here. Note that we are dynamically composing the WMI namespace. The code is as follows: $hostName = "localhost" $namespace = Get-WMIObject -ComputerName $hostName -NameSpace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" Get-WmiObject -ComputerName $hostname -Namespace "$($namespace.__NAMESPACE)$($namespace.Name)" -Class SqlService | Select-Object ServiceName If you have multiple SQL Server versions installed and want to see just the most recent version's services, you can limit to the latest namespace by adding Select-Object –Last 1: $namespace = Get-WMIObject -ComputerName $hostName -NameSpace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" | Select-Object –Last 1 Yet another alternative but less accurate way of listing possible SQL Server related services is the following snippet of code: #alterative - but less accurate Get-Service *SQL* This uses the Get-Service cmdlet and filters base on the service name. This is less accurate because this grabs all processes that have SQL in the name, but may not necessarily be related to SQL Server. For example, if you have MySQL installed, it will get picked up as a process. Conversely, this will not pick up SQL Server-related services that do not have SQL in the name, such as ReportServer. Summary You will find that many of the scripts can be accomplished using PowerShell and SQL Management Objects (SMO). SMO is a library that exposes SQL Server classes that allow programmatic manipulation and automation of many database tasks. For some , we will also explore alternative ways of accomplishing the same tasks using different native PowerShell cmdlets. Now that we have a gist of SQL Server 2014 with PowerShell, lets build a full-fledged e-commerce project with SQL Server 2014 with Powershell v5 Cookbook. Resources for Article: Further resources on this subject: Exploring Windows PowerShell 5.0 [article] Working with PowerShell [article] Installing/upgrading PowerShell [article]
Read more
  • 0
  • 0
  • 9069

article-image-use-macros-ibm-cognos-8-report-studio
Packt
25 May 2010
13 min read
Save for later

Use of macros in IBM Cognos 8 Report Studio

Packt
25 May 2010
13 min read
Cognos Report Studio is widely used for creating and managing business reports in medium to large companies. It is simple enough for any business analyst, power user, or developer to pick up and start developing basic reports. However, when it comes to developing more sophisticated, fully functional business reports for wider audiences, report authors will need guidance. In this article, by Abhishek Sanghani, author of IBM Cognos 8 Report Studio Cookbook, we will show you  that even though macros are often considered a Framework Modeler's tool, they can be used within Report Studio as well. These recipes will show you some very useful macros around security, string manipulation, and prompting. (Read more interesting articles on Compiere here.) Introduction This article will introduce you to an interesting and useful tool of Cognos BI, called 'macros'. They can be used in Framework Manager as well as Report Studio. The Cognos engine understands the presence of a macro as it is written within a pair of hashes (#). It executes the macros first and puts the result back into report specification like a literal string replacement. We can use this to alter data items, filters, and slicers at run time. You won't find the macro functions and their details within Report Studio environment (which is strange, as it fully supports them). Anyways, you can always open Framework Manager and check different macro functions and their syntaxes from there. Also, there is documentation available in Cognos' help and online materials. Working with Dimensional Model (in the"Swapping dimension" recipe). In this article, I will show you more examples and introduce you to more functions which you can later build upon to achieve sophisticated functionalities. We will be writing some SQL straight against the GO Data Warehouse data source. Also, we will use the "GO Data Warehouse (Query)" package for some recipes. Add data level security using CSVIdentityMap macro A report shows the employee names by region and country. We need to implement data security in this report such that a user can see the records only for the country he belongs to. There are already User Groups defined on the Cognos server (in the directory) and users are made members of appropriate groups. For this sample, I have added my user account to a user group called 'Spain'. Getting ready Open a new list report with GO Data Warehouse (Query) as the package. How to do it... Drag the appropriate columns (Region, Country, and Employee name) on to the report from Employee by Region query subject. Go to Query Explorer and drag a new detail filter. Define the filter as: [Country] in (#CSVIdentityNameList(',')#) Run the report to test it. You will notice that a user can see only the rows of the country/countries of which he is a member. How it works... Here we are using a macro function called CSVIdentityNameList. This function returns a list of groups and roles that the user belongs to, along with the user's account name. Hence, when I run the report, one of the values returned will be 'Spain' and I will see data for Spain. The function accepts a string parameter which is used as a separator in the result. Here we are passing a comma (,) as the separator. If a user belongs to multiple country groups, he will see data for all the countries listed in the result of a macro. There's more... This solution, conspicuously, has its limitations. None of the user accounts or roles should be same as a country name, because that will wrongly show data for a country the user doesnot belong to. For example, for a user called 'Paris', it will show data for the 'Paris' region. So, there need to be certain restrictions. However, you can build upon the knowledge of this macro function and use it in many practical business scenarios. Using prompt macro in native SQL In this recipe, we will write an SQL statement straight to be fired on the data source. We will use the Prompt macro to dynamically change the filter condition. We will write a report that shows list of employee by Region and Country. We will use the Prompt macro to ask the users to enter a country name. Then the SQL statement will search for the employee belonging to that country. Getting ready Create a new blank list report against 'GO Data Warehouse (Query)' package. How to do it... Go to the Query Explorer and drag an SQL object on the Query Subject that is linked to the list (Query1 in usual case). Select the SQL object and ensure that great_outdoor_warehouse is selected as the data source. Open the SQL property and add the following statement: select distinct "Branch_region_dimension"."REGION_EN" "Region" ,"Branch_region_dimension"."COUNTRY_EN" "Country" , "EMP_EMPLOYEE_DIM"."EMPLOYEE_NAME" "Employee_name"from "GOSALESDW"."GO_REGION_DIM" "Branch_region_dimension","GOSALESDW"."EMP_EMPLOYEE_DIM" "EMP_EMPLOYEE_DIM","GOSALESDW"."GO_BRANCH_DIM" "GO_BRANCH_DIM"where ("Branch_region_dimension"."COUNTRY_EN" in(#prompt('Region')#))and "Branch_region_dimension"."COUNTRY_CODE" = "GO_BRANCH_DIM"."COUNTRY_CODE" and "EMP_EMPLOYEE_DIM"."BRANCH_CODE" = "GO_BRANCH_DIM"."BRANCH_CODE" Hit the OK button. This will validate the query and will close the dialog box. You will see that three data items (Region, Country, and Employee_Name) are added to Query1. Now go to the report page. Drag these data items on the list and run the report to test it. How it works... Here we are using the macro in native SQL statement. Native SQL allows us to directly fire a query on the data source and use the result on the report. This is useful in certain scenarios where we don't need to define any Framework Model. If you examine the SQL statement, you will notice that it is a very simple one that joins three tables and returns appropriate columns. We have added a filter condition on country name which is supposed to dynamically change depending on the value entered by user. The macro function that we have used here is Prompt(). As the name suggests, it is used to generate a prompt and returns the parameter value back to be used in an SQL statement. Prompt() function takes five arguments. The first argument is the parameter name and it is mandatory. It allows us to link a prompt page object (value prompt, date prompt, and so on) to the prompt function. The rest of the four arguments are optional and we are not using them here. You will read about them in the next recipe. Please note that we also have an option of adding a detail filter in the query subject instead of using PROMPT() macro within query. However, sometimes you would want to filter a table before joining it with other tables. In that case, using PROMPT() macro within the query helps. There's more... Similar to the Prompt() function, there is a i macro function. This works in exactly the same way and allows users to enter multiple values for the parameter. Those values are returned as a comma-separated list. Making prompt optional The previous recipe showed you how to generate a prompt through a macro. In this recipe, we will see how to make it optional using other arguments of the function. We will generate two simple list reports, both based on a native SQL. These lists will show product details for selected product line. However, the product line prompt will be made optional using two different approaches. Getting ready Create a report with two simple list objects based on native SQL. For that, create the Query Subjects in the same way as we did in the previous recipe. Use the following query in the SQL objects: select distinct "SLS_PRODUCT_LINE_LOOKUP"."PRODUCT_LINE_EN" "Product_line" , "SLS_PRODUCT_LOOKUP"."PRODUCT_NAME" "Product_name" , "SLS_PRODUCT_COLOR_LOOKUP"."PRODUCT_COLOR_EN" "Product_color" , "SLS_PRODUCT_SIZE_LOOKUP"."PRODUCT_SIZE_EN" "Product_size"from "GOSALESDW"."SLS_PRODUCT_DIM" "SLS_PRODUCT_DIM","GOSALESDW"."SLS_PRODUCT_LINE_LOOKUP" "SLS_PRODUCT_LINE_LOOKUP","GOSALESDW"."SLS_PRODUCT_TYPE_LOOKUP" "SLS_PRODUCT_TYPE_LOOKUP", "GOSALESDW"."SLS_PRODUCT_LOOKUP" "SLS_PRODUCT_LOOKUP","GOSALESDW"."SLS_PRODUCT_COLOR_LOOKUP" "SLS_PRODUCT_COLOR_LOOKUP","GOSALESDW"."SLS_PRODUCT_SIZE_LOOKUP" "SLS_PRODUCT_SIZE_LOOKUP","GOSALESDW"."SLS_PRODUCT_BRAND_LOOKUP" "SLS_PRODUCT_BRAND_LOOKUP"where "SLS_PRODUCT_LOOKUP"."PRODUCT_LANGUAGE" = N'EN' and "SLS_PRODUCT_DIM"."PRODUCT_LINE_CODE" = "SLS_PRODUCT_LINE_LOOKUP"."PRODUCT_LINE_CODE" and "SLS_PRODUCT_DIM"."PRODUCT_NUMBER" = "SLS_PRODUCT_LOOKUP"."PRODUCT_NUMBER" and "SLS_PRODUCT_DIM"."PRODUCT_SIZE_CODE"= "SLS_PRODUCT_SIZE_LOOKUP"."PRODUCT_SIZE_CODE" and "SLS_PRODUCT_DIM"."PRODUCT_TYPE_CODE" = "SLS_PRODUCT_TYPE_LOOKUP"."PRODUCT_TYPE_CODE" and "SLS_PRODUCT_DIM"."PRODUCT_COLOR_CODE" = "SLS_PRODUCT_COLOR_LOOKUP"."PRODUCT_COLOR_CODE" and "SLS_PRODUCT_BRAND_LOOKUP"."PRODUCT_BRAND_CODE" = "SLS_PRODUCT_DIM"."PRODUCT_BRAND_CODE" This is a simple query that joins product related tables and retrieves required columns. How to do it... We have created two list reports based on two SQL query subjects. Both the SQL objects use the same query as mentioned above. Now, we will start with altering them. For that open Query Explorer. Rename first query subject as Optional_defaultValue and the second one as Pure_Optional. In the Optional_defaultValue SQL object, amend the query with following lines: and "SLS_PRODUCT_LINE_LOOKUP"."PRODUCT_LINE_EN" = #sq(prompt ('ProductLine','string','Golf Equipment'))# Similarly, amend the Pure_Optional SQL object query with the following line: #prompt ('Product Line','string','and 1=1', ' and "SLS_PRODUCT_LINE_LOOKUP"."PRODUCT_LINE_EN" = ')# Now run the report. You will be prompted to enter a product line. Don't enter any value and just hit OK button. Notice that the report runs (which means the prompt is optional). First, list object returns rows for 'Golf Equipment'. The second list is populated by all the products. How it works... Fundamentally, this report works the same as the one in the previous report. We are firing the SQL statements straight on the data source. The filter condition in the WHERE clause are using the PROMPT macro. Optional_defaultValue In this query, we are using the second and third arguments of Prompt() function. Second argument defines the data type of value which is 'String' in our case. The third argument defines default value of the prompt. When the user doesn't enter any value for the prompt, this default value is used. This is what makes the prompt optional. As we have defined 'Golf Equipment' as the default value, the first list object shows data for 'Golf Equipment' when prompt is left unfilled. Pure_Optional In this query, we are using fourth argument of Prompt() function. This argument is of string type. If the user provides any value for the prompt, the prompt value is concatenated to this string argument and the result is returned. In our case, the fourth argument is the left part of filtering condition that is, 'and . 'and "SLS_PRODUCT_LINE_LOOKUP"."PRODUCT_LINE_EN" ='. So, if the user enters the value as 'XYZ', the macro is replaced by the following filter: and "SLS_PRODUCT_LINE_LOOKUP"."PRODUCT_LINE_EN" = 'XYZ' Interestingly, if the user doesn't provide any prompt value, then the fourth argument is simply ignored. The macro is then replaced by the third argument which is in our case is 'and 1=1'. Hence, the second list returns all the rows when user doesn't provide any value for the prompt. This way it makes the PRODUCT_LINE_EN filter purely optional. There's more... Prompt macro accepts two more arguments (fifth and sixth). Please check the help documents or internet sources to find information and examples about them. Adding token using macro In this recipe, we will see how to dynamically change the field on which filter is being applied using macro. We will use prompt macro to generate one of the possible tokens and then use it in the query. Getting ready Create a list report based on native SQL similar to the previous recipe. We will use the same query that works on the product tables but filtering will be different. For that, define the SQL as following: select distinct "SLS_PRODUCT_LINE_LOOKUP"."PRODUCT_LINE_EN" "Product_line" , "SLS_PRODUCT_LOOKUP"."PRODUCT_NAME" "Product_name" , "SLS_PRODUCT_COLOR_LOOKUP"."PRODUCT_COLOR_EN" "Product_color" , "SLS_PRODUCT_SIZE_LOOKUP"."PRODUCT_SIZE_EN" "Product_size"from "GOSALESDW"."SLS_PRODUCT_DIM" "SLS_PRODUCT_DIM","GOSALESDW"."SLS_PRODUCT_LINE_LOOKUP" "SLS_PRODUCT_LINE_LOOKUP","GOSALESDW"."SLS_PRODUCT_TYPE_LOOKUP" "SLS_PRODUCT_TYPE_LOOKUP", "GOSALESDW"."SLS_PRODUCT_LOOKUP" "SLS_PRODUCT_LOOKUP","GOSALESDW"."SLS_PRODUCT_COLOR_LOOKUP" "SLS_PRODUCT_COLOR_LOOKUP","GOSALESDW"."SLS_PRODUCT_SIZE_LOOKUP" "SLS_PRODUCT_SIZE_LOOKUP","GOSALESDW"."SLS_PRODUCT_BRAND_LOOKUP" "SLS_PRODUCT_BRAND_LOOKUP"where "SLS_PRODUCT_LOOKUP"."PRODUCT_LANGUAGE" = N'EN' and "SLS_PRODUCT_DIM"."PRODUCT_LINE_CODE" = "SLS_PRODUCT_LINE_LOOKUP"."PRODUCT_LINE_CODE" and "SLS_PRODUCT_DIM"."PRODUCT_NUMBER" = "SLS_PRODUCT_LOOKUP"."PRODUCT_NUMBER" and "SLS_PRODUCT_DIM"."PRODUCT_SIZE_CODE"= "SLS_PRODUCT_SIZE_LOOKUP"."PRODUCT_SIZE_CODE" and "SLS_PRODUCT_DIM"."PRODUCT_TYPE_CODE" = "SLS_PRODUCT_TYPE_LOOKUP"."PRODUCT_TYPE_CODE" and "SLS_PRODUCT_DIM"."PRODUCT_COLOR_CODE" = "SLS_PRODUCT_COLOR_LOOKUP"."PRODUCT_COLOR_CODE" and "SLS_PRODUCT_BRAND_LOOKUP"."PRODUCT_BRAND_CODE" = "SLS_PRODUCT_DIM"."PRODUCT_BRAND_CODE"and#prompt ('Field','token','"SLS_PRODUCT_LINE_LOOKUP"."PRODUCT_LINE_EN"')# like #prompt ('Value','string')# This is the same basic query that joins the product related tables and fetches required columns. The last statement in WHERE clause uses two prompt macros. We will talk about it in detail. How to do it... We have already created a list report based on an SQL query subject as mentioned previously. Drag the columns from the query subject on the list over the report page. Now create a new prompt page. Add a value prompt on the prompt page. Define two static choices for this. Display value Use value Filter on product line "SLS_PRODUCT_LINE_LOOKUP"."PRODUCT_LINE_EN" Filter on product name "SLS_PRODUCT_LOOKUP"."PRODUCT_NAME Set the parameter for this prompt to 'Field'. This will come pre-populated as existing parameter, as it is defined in the query subject. Choose the UI as radio button group and Filter on Product Line as default selection. Now add a text box prompt on to the prompt page. Set its parameter to Value which comes as a choice in an existing parameter (as it is already defined in the query). Run the report to test it. You will see an option to filter on product line or product name. The value you provide in the text box prompt will be used to filter either of the fields depending on the choice selected in radio buttons.
Read more
  • 0
  • 0
  • 9025

article-image-driving-visual-analyses-automobile-data-python
Packt
19 Sep 2014
19 min read
Save for later

Driving Visual Analyses with Automobile Data (Python)

Packt
19 Sep 2014
19 min read
This article written by Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, and Abhijit Dasgupta, authors of the book Practical Data Science Cookbook, will cover the following topics: Getting started with IPython Exploring IPython Notebook Preparing to analyze automobile fuel efficiencies Exploring and describing the fuel efficiency data with Python (For more resources related to this topic, see here.) The dataset, available at http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip, contains fuel efficiency performance metrics over time for all makes and models of automobiles in the United States of America. This dataset also contains numerous other features and attributes of the automobile models other than fuel economy, providing an opportunity to summarize and group the data so that we can identify interesting trends and relationships. We will perform the entire analysis using Python. However, we will ask the same questions and follow the same sequence of steps as before, again following the data science pipeline. With study, this will allow you to see the similarities and differences between the two languages for a mostly identical analysis. In this article, we will take a very different approach using Python as a scripting language in an interactive fashion that is more similar to R. We will introduce the reader to the unofficial interactive environment of Python, IPython, and the IPython notebook, showing how to produce readable and well-documented analysis scripts. Further, we will leverage the data analysis capabilities of the relatively new but powerful pandas library and the invaluable data frame data type that it offers. pandas often allows us to complete complex tasks with fewer lines of code. The drawback to this approach is that while you don't have to reinvent the wheel for common data manipulation tasks, you do have to learn the API of a completely different package, which is pandas. The goal of this article is not to guide you through an analysis project that you have already completed but to show you how that project can be completed in another language. More importantly, we want to get you, the reader, to become more introspective with your own code and analysis. Think not only about how something is done but why something is done that way in that particular language. How does the language shape the analysis? Getting started with IPython IPython is the interactive computing shell for Python that will change the way you think about interactive shells. It brings to the table a host of very useful functionalities that will most likely become part of your default toolbox, including magic functions, tab completion, easy access to command-line tools, and much more. We will only scratch the surface here and strongly recommend that you keep exploring what can be done with IPython. Getting ready If you have completed the installation, you should be ready to tackle the following recipes. Note that IPython 2.0, which is a major release, was launched in 2014. How to do it… The following steps will get you up and running with the IPython environment: Open up a terminal window on your computer and type ipython. You should be immediately presented with the following text: Python 2.7.5 (default, Mar 9 2014, 22:15:05)Type "copyright", "credits" or "license" for more information. IPython 2.1.0 -- An enhanced Interactive Python.?         -> Introduction and overview of IPython's features.%quickref -> Quick reference.help     -> Python's own help system.object?   -> Details about 'object', use 'object??' for extra details.In [1]: Note that your version might be slightly different than what is shown in the preceding command-line output. Just to show you how great IPython is, type in ls, and you should be greeted with the directory listing! Yes, you have access to common Unix commands straight from your Python prompt inside the Python interpreter. Now, let's try changing directories. Type cd at the prompt, hit space, and now hit Tab. You should be presented with a list of directories available from within the current directory. Start typing the first few letters of the target directory, and then, hit Tab again. If there is only one option that matches, hitting the Tab key automatically will insert that name. Otherwise, the list of possibilities will show only those names that match the letters that you have already typed. Each letter that is entered acts as a filter when you press Tab. Now, type ?, and you will get a quick introduction to and overview of IPython's features. Let's take a look at the magic functions. These are special functions that IPython understands and will always start with the % symbol. The %paste function is one such example and is amazing for copying and pasting Python code into IPython without losing proper indentation. We will try the %timeit magic function that intelligently benchmarks Python code. Enter the following commands: n = 100000%timeit range(n)%timeit xrange(n) We should get an output like this: 1000 loops, best of 3: 1.22 ms per loop1000000 loops, best of 3: 258 ns per loop This shows you how much faster xrange is than range (1.22 milliseconds versus 2.58 nanoseconds!) and helps show you the utility of generators in Python. You can also easily run system commands by prefacing the command with an exclamation mark. Try the following command: !ping www.google.com You should see the following output: PING google.com (74.125.22.101): 56 data bytes64 bytes from 74.125.22.101: icmp_seq=0 ttl=38 time=40.733 ms64 bytes from 74.125.22.101: icmp_seq=1 ttl=38 time=40.183 ms64 bytes from 74.125.22.101: icmp_seq=2 ttl=38 time=37.635 ms Finally, IPython provides an excellent command history. Simply press the up arrow key to access the previously entered command. Continue to press the up arrow key to walk backwards through the command list of your session and the down arrow key to come forward. Also, the magic %history command allows you to jump to a particular command number in the session. Type the following command to see the first command that you entered: %history 1 Now, type exit to drop out of IPython and back to your system command prompt. How it works… There isn't much to explain here and we have just scratched the surface of what IPython can do. Hopefully, we have gotten you interested in diving deeper, especially with the wealth of new features offered by IPython 2.0, including dynamic and user-controllable data visualizations. See also IPython at http://ipython.org/ The IPython Cookbook at https://github.com/ipython/ipython/wiki?path=Cookbook IPython: A System for Interactive Scientific Computing at http://fperez.org/papers/ipython07_pe-gr_cise.pdf Learning IPython for Interactive Computing and Data Visualization, Cyrille Rossant, Packt Publishing, available at http://www.packtpub.com/learning-ipython-for-interactive-computing-and-data-visualization/book The future of IPython at http://www.infoworld.com/print/236429 Exploring IPython Notebook IPython Notebook is the perfect complement to IPython. As per the IPython website: "The IPython Notebook is a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document." While this is a bit of a mouthful, it is actually a pretty accurate description. In practice, IPython Notebook allows you to intersperse your code with comments and images and anything else that might be useful. You can use IPython Notebooks for everything from presentations (a great replacement for PowerPoint) to an electronic laboratory notebook or a textbook. Getting ready If you have completed the installation, you should be ready to tackle the following recipes. How to do it… These steps will get you started with exploring the incredibly powerful IPython Notebook environment. We urge you to go beyond this simple set of steps to understand the true power of the tool. Type ipython notebook --pylab=inline in the command prompt. The --pylab=inline option should allow your plots to appear inline in your notebook. You should see some text quickly scroll by in the terminal window, and then, the following screen should load in the default browser (for me, this is Chrome). Note that the URL should be http://127.0.0.1:8888/, indicating that the browser is connected to a server running on the local machine at port 8888. You should not see any notebooks listed in the browser (note that IPython Notebook files have a .ipynb extension) as IPython Notebook searches the directory you launched it from for notebook files. Let's create a notebook now. Click on the New Notebook button in the upper right-hand side of the page. A new browser tab or window should open up, showing you something similar to the following screenshot: From the top down, you can see the text-based menu followed by the toolbar for issuing common commands, and then, your very first cell, which should resemble the command prompt in IPython. Place the mouse cursor in the first cell and type 5+5. Next, either navigate to Cell | Run or press Shift + Enter as a keyboard shortcut to cause the contents of the cell to be interpreted. You should now see something similar to the following screenshot. Basically, we just executed a simple Python statement within the first cell of our first IPython Notebook. Click on the second cell, and then, navigate to Cell | Cell Type | Markdown. Now, you can easily write markdown in the cell for documentation purposes. Close the two browser windows or tabs (the notebook and the notebook browser). Go back to the terminal in which you typed ipython notebook, hit Ctrl + C, then hit Y, and press Enter. This will shut down the IPython Notebook server. How it works… For those of you coming from either more traditional statistical software packages, such as Stata, SPSS, or SAS, or more traditional mathematical software packages, such as MATLAB, Mathematica, or Maple, you are probably used to the very graphical and feature-rich interactive environments provided by the respective companies. From this background, IPython Notebook might seem a bit foreign but hopefully much more user friendly and less intimidating than the traditional Python prompt. Further, IPython Notebook offers an interesting combination of interactivity and sequential workflow that is particularly well suited for data analysis, especially during the prototyping phases. R has a library called Knitr (http://yihui.name/knitr/) that offers the report-generating capabilities of IPython Notebook. When you type in ipython notebook, you are launching a server running on your local machine, and IPython Notebook itself is really a web application that uses a server-client architecture. The IPython Notebook server, as per ipython.org, uses a two-process kernel architecture with ZeroMQ (http://zeromq.org/) and Tornado. ZeroMQ is an intelligent socket library for high-performance messaging, helping IPython manage distributed compute clusters among other tasks. Tornado is a Python web framework and asynchronous networking module that serves IPython Notebook's HTTP requests. The project is open source and you can contribute to the source code if you are so inclined. IPython Notebook also allows you to export your notebooks, which are actually just text files filled with JSON, to a large number of alternative formats using the command-line tool called nbconvert (http://ipython.org/ipython-doc/rel-1.0.0/interactive/nbconvert.html). Available export formats include HTML, LaTex, reveal.js HTML slideshows, Markdown, simple Python scripts, and reStructuredText for the Sphinx documentation. Finally, there is IPython Notebook Viewer (nbviewer), which is a free web service where you can both post and go through static, HTML versions of notebook files hosted on remote servers (these servers are currently donated by Rackspace). Thus, if you create an amazing .ipynb file that you want to share, you can upload it to http://nbviewer.ipython.org/ and let the world see your efforts. There's more… We will try not to sing too loudly the praises of Markdown, but if you are unfamiliar with the tool, we strongly suggest that you try it out. Markdown is actually two different things: a syntax for formatting plain text in a way that can be easily converted to a structured document and a software tool that converts said text into HTML and other languages. Basically, Markdown enables the author to use any desired simple text editor (VI, VIM, Emacs, Sublime editor, TextWrangler, Crimson Editor, or Notepad) that can capture plain text yet still describe relatively complex structures such as different levels of headers, ordered and unordered lists, and block quotes as well as some formatting such as bold and italics. Markdown basically offers a very human-readable version of HTML that is similar to JSON and offers a very human-readable data format. See also IPython Notebook at http://ipython.org/notebook.html The IPython Notebook documentation at http://ipython.org/ipython-doc/stable/interactive/notebook.html An interesting IPython Notebook collection at https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks The IPython Notebook development retrospective at http://blog.fperez.org/2012/01/ipython-notebook-historical.html Setting up a remote IPython Notebook server at http://nbviewer.ipython.org/github/Unidata/tds-python-workshop/blob/master/ipython-notebook-server.ipynb The Markdown home page at https://daringfireball.net/projects/markdown/basics Preparing to analyze automobile fuel efficiencies In this recipe, we are going to start our Python-based analysis of the automobile fuel efficiencies data. Getting ready If you completed the first installation successfully, you should be ready to get started. How to do it… The following steps will see you through setting up your working directory and IPython for the analysis for this article: Create a project directory called fuel_efficiency_python. Download the automobile fuel efficiency dataset from http://fueleconomy.gov/feg/epadata/vehicles.csv.zip and store it in the preceding directory. Extract the vehicles.csv file from the zip file into the same directory. Open a terminal window and change the current directory (cd) to the fuel_efficiency_python directory. At the terminal, type the following command: ipython notebook Once the new page has loaded in your web browser, click on New Notebook. Click on the current name of the notebook, which is untitled0, and enter in a new name for this analysis (mine is fuel_efficiency_python). Let's use the top-most cell for import statements. Type in the following commands: import pandas as pdimport numpy as npfrom ggplot import *%matplotlib inline Then, hit Shift + Enter to execute the cell. This imports both the pandas and numpy libraries, assigning them local names to save a few characters while typing commands. It also imports the ggplot library. Please note that using the from ggplot import * command line is not a best practice in Python and pours the ggplot package contents into our default namespace. However, we are doing this so that our ggplot syntax most closely resembles the R ggplot2 syntax, which is strongly not Pythonic. Finally, we use a magic command to tell IPython Notebook that we want matploblib graphs to render in the notebook. In the next cell, let's import the data and look at the first few records: vehicles = pd.read_csv("vehicles.csv")vehicles.head Then, press Shift + Enter. The following text should be shown: However, notice that a red warning message appears as follows: /Library/Python/2.7/site-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (22,23,70,71,72,73) have mixed types. Specify dtype option on import or set low_memory=False.   data = self._reader.read(nrows) This tells us that columns 22, 23, 70, 71, 72, and 73 contain mixed data types. Let's find the corresponding names using the following commands: column_names = vehicles.columns.values column_names[[22, 23, 70, 71, 72, 73]]   array([cylinders, displ, fuelType2, rangeA, evMotor, mfrCode], dtype=object) Mixed data types sounds like it could be problematic so make a mental note of these column names. Remember, data cleaning and wrangling often consume 90 percent of project time. How it works… With this recipe, we are simply setting up our working directory and creating a new IPython Notebook that we will use for the analysis. We have imported the pandas library and very quickly read the vehicles.csv data file directly into a data frame. Speaking from experience, pandas' robust data import capabilities will save you a lot of time. Although we imported data directly from a comma-separated value file into a data frame, pandas is capable of handling many other formats, including Excel, HDF, SQL, JSON, Stata, and even the clipboard using the reader functions. We can also write out the data from data frames in just as many formats using writer functions accessed from the data frame object. Using the bound method head that is part of the Data Frame class in pandas, we have received a very informative summary of the data frame, including a per-column count of non-null values and a count of the various data types across the columns. There's more… The data frame is an incredibly powerful concept and data structure. Thinking in data frames is critical for many data analyses yet also very different from thinking in array or matrix operations (say, if you are coming from MATLAB or C as your primary development languages). With the data frame, each column represents a different variable or characteristic and can be a different data type, such as floats, integers, or strings. Each row of the data frame is a separate observation or instance with its own set of values. For example, if each row represents a person, the columns could be age (an integer) and gender (a category or string). Often, we will want to select the set of observations (rows) that match a particular characteristic (say, all males) and examine this subgroup. The data frame is conceptually very similar to a table in a relational database. See also Data structures in pandas at http://pandas.pydata.org/pandas-docs/stable/dsintro.html Data frames in R at http://www.r-tutor.com/r-introduction/data-frame Exploring and describing the fuel efficiency data with Python Now that we have imported the automobile fuel efficiency dataset into IPython and witnessed the power of pandas, the next step is to replicate the preliminary analysis performed in R, getting your feet wet with some basic pandas functionality. Getting ready We will continue to grow and develop the IPython Notebook that we started in the previous recipe. If you've completed the previous recipe, you should have everything you need to continue. How to do it… First, let's find out how many observations (rows) are in our data using the following command: len(vehicles) 34287 If you switch back and forth between R and Python, remember that in R, the function is length and in Python, it is len. Next, let's find out how many variables (columns) are in our data using the following command: len(vehicles.columns) 74 Let's get a list of the names of the columns using the following command: print(vehicles.columns) Index([u'barrels08', u'barrelsA08', u'charge120', u'charge240', u'city08', u'city08U', u'cityA08', u'cityA08U', u'cityCD', u'cityE', u'cityUF', u'co2', u'co2A', u'co2TailpipeAGpm', u'co2TailpipeGpm', u'comb08', u'comb08U', u'combA08', u'combA08U', u'combE', u'combinedCD', u'combinedUF', u'cylinders', u'displ', u'drive', u'engId', u'eng_dscr', u'feScore', u'fuelCost08', u'fuelCostA08', u'fuelType', u'fuelType1', u'ghgScore', u'ghgScoreA', u'highway08', u'highway08U', u'highwayA08', u'highwayA08U', u'highwayCD', u'highwayE', u'highwayUF', u'hlv', u'hpv', u'id', u'lv2', u'lv4', u'make', u'model', u'mpgData', u'phevBlended', u'pv2', u'pv4', u'range', u'rangeCity', u'rangeCityA', u'rangeHwy', u'rangeHwyA', u'trany', u'UCity', u'UCityA', u'UHighway', u'UHighwayA', u'VClass', u'year', u'youSaveSpend', u'guzzler', u'trans_dscr', u'tCharger', u'sCharger', u'atvType', u'fuelType2', u'rangeA', u'evMotor', u'mfrCode'], dtype=object) The u letter in front of each string indicates that the strings are represented in Unicode (http://docs.python.org/2/howto/unicode.html) Let's find out how many unique years of data are included in this dataset and what the first and last years are using the following command: len(pd.unique(vehicles.year)) 31 min(vehicles.year) 1984 max(vehicles["year"]) 2014 Note that again, we have used two different syntaxes to reference individual columns within the vehicles data frame. Next, let's find out what types of fuel are used as the automobiles' primary fuel types. In R, we have the table function that will return a count of the occurrences of a variable's various values. In pandas, we use the following: pd.value_counts(vehicles.fuelType1)Regular Gasoline     24587Premium Gasoline     8521Diesel                    1025Natural Gas               57Electricity                 56Midgrade Gasoline      41dtype: int64 Now if we want to explore what types of transmissions these automobiles have, we immediately try the following command: pd.value_counts(vehicles.trany) However, this results in a bit of unexpected and lengthy output: What we really want to know is the number of cars with automatic and manual transmissions. We notice that the trany variable always starts with the letter A when it represents an automatic transmission and M for manual transmission. Thus, we create a new variable, trany2, that contains the first character of the trany variable, which is a string: vehicles["trany2"] = vehicles.trany.str[0]pd.value_counts(vehicles.trany2) The preceding command yields the answer that we wanted or twice as many automatics as manuals: A   22451M   11825dtype: int64 How it works… In this recipe, we looked at some basic functionality in Python and pandas. We have used two different syntaxes (vehicles['trany'] and vehicles.trany) to access variables within the data frame. We have also used some of the core pandas functions to explore the data, such as the incredibly useful unique and the value_counts function. There's more... In terms of the data science pipeline, we have touched on two stages in a single recipe: data cleaning and data exploration. Often, when working with smaller datasets where the time to complete a particular action is quite short and can be completed on our laptop, we will very quickly go through multiple stages of the pipeline and then loop back, depending on the results. In general, the data science pipeline is a highly iterative process. The faster we can accomplish steps, the more iterations we can fit into a fixed time, and often, we can create a better final analysis. See also The pandas API overview at http://pandas.pydata.org/pandas-docs/stable/api.html Summary This article took you through the process of analyzing and visualizing automobile data to identify trends and patterns in fuel efficiency over time using the powerful programming language, Python. Resources for Article: Further resources on this subject: Importing Dynamic Data [article] MongoDB data modeling [article] Report Data Filtering [article]
Read more
  • 0
  • 0
  • 9021
article-image-brainnet-an-interface-to-communicate-between-human-brains-could-soon-make-telepathy-real
Sunith Shetty
28 Sep 2018
3 min read
Save for later

BrainNet, an interface to communicate between human brains, could soon make Telepathy real

Sunith Shetty
28 Sep 2018
3 min read
BrainNet provides the first multi-person brain-to-brain interface which allows a nonthreatening direct collaboration between human brains. It can help small teams collaborate to solve a range of tasks using direct brain-to-brain communication. How does BrainNet operate? The noninvasive interface combines electroencephalography (EEG) to record brain signals and transcranial magnetic stimulation (TMS) to deliver the required information to the brain. For now, the interface allows three human subjects to collaborate, handle and solve a task using direct brain-to-brain communication. Two out of three human subjects are “Senders”. The senders’ brain signals are decoded using real-time EEG data analysis. This technique allows extracting decisions which are vital in communicating in order to solve the required challenges. Let’s take an example of a Tetris-like game--where you need quick decisions to decide whether to rotate a block or drop as it is in order to fill a line. The senders’ signals (decisions) are transmitted to the third subject human brain via the Internet, the “Receiver” in this case. The decisions are sent to the receiver brain via magnetic stimulation of the occipital cortex. The receiver can’t see the game screen to decide if the rotation of the block is required. The receiver integrates the decisions received and makes an informed call using an EEG interface regarding turning the position of the block or keeping it in the same position. The second round of the game allows the senders to validate the previous move and provide the necessary feedback to the receiver’s action. How did the results look? The group of researchers has implemented this technique for the Tetris game to evaluate the performance of BrainNet considering the following factors: Group-level performance during the game True/False positive rates of subject’s decisions Mutual information between subjects This was implemented among five groups of three human brain subjects to perform the Tetris task using BrainNet interface. The average accuracy result for the task was 0.813. Furthermore, they also tried varying the information reliability by injecting artificially generated noise into one of the senders’ signals. However, the receiver was able to classify which sender is more reliable based on the information transmitted to their brains. These positive results have open the gates and the possibilities of future brain-to-brain interfaces which holds the power of enabling cooperative problem solving by humans using a "social network" of connected brains. To know more, you can refer to the research paper. Read more Diffractive Deep Neural Network (D2NN): UCLA-developed AI device can identify objects at the speed of light Baidu announces ClariNet, a neural network for text-to-speech synthesis Optical training of Neural networks is making AI more efficient
Read more
  • 0
  • 0
  • 9020

article-image-apache-solr-php-integration
Packt
25 Nov 2013
7 min read
Save for later

Apache Solr PHP Integration

Packt
25 Nov 2013
7 min read
(For more resources related to this topic, see here.) We will be looking at installation on both Windows and Linux environments. We will be using the Solarium library for communication between Solr and PHP. This article will give a brief overview of the Solarium library and showcase some of the concepts and configuration options on Solr end for implementing certain features. Calling Solr using PHP code A ping query is used in Solr to check the status of the Solr server. The Solr URL for executing the ping query is http://localhost:8080/solr/collection1/admin/ping/?wt=json. Response of Solr ping query in browser We can use Curl to get the ping response from Solr via PHP code; a sample code for executing the previous ping query is as below $curl = curl_init("http://localhost:8080/solr/collection1/admin/ping/?wt=json"); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); $output = curl_exec($curl); $data = json_decode($output, true); echo "Ping Status : ".$data["status"].PHP_EOL; Though Curl can be used to execute almost any query on Solr, but it is preferable to use a library which does the work for us. In our case we will be using Solarium. To execute the same query on Solr using the Solarium library the code is as follows. include_once("vendor/autoload.php"); $config = array("endpoint" => array("localhost" => array("host"=>"127.0.0.1", "port"=>"8080", "path"=>"/solr", "core"=>"collection1",) ) ); We have included the Solarium library in our code. And defined the connection parameters for our Solr server. Next we will need to create a Solarium client with the previous Solr configuration. And call the createPing() function to create the ping query. $client = new SolariumClient($config); $ping = $client->createPing(); Finally execute the ping query and get the result. $result = $client->ping($ping); $result->getStatus(); The output should be similar to the one shown below. Output of ping query using PHP Adding documents to Solr index To create a Solr index, we need to add documents to the Solr index using the command line, Solr web interface or our PHP program. But before we create a Solr index, we need to define the structure or the schema of the Solr index. Schema consists of fields and field types. It defines how each field will be treated and handled during indexing or during search. Let us see a small piece of code for adding documents to the Solr index using PHP and Solarium library. Create a solarium client. Create an instance of the update query. Create the document in PHP and finally add fields to the document. $client = new SolariumClient($config); $updateQuery = $client->createUpdate(); $doc1 = $updateQuery->createDocument(); $doc1->id = 112233445; $doc1->cat = 'book'; $doc1->name = 'A Feast For Crows'; $doc1->price = 8.99; $doc1->inStock = 'true'; $doc1->author = 'George R.R. Martin'; $doc1->series_t = '"A Song of Ice and Fire"'; Id field has been marked as unique in our schema. So we will have to keep different values for Id field for different documents that we add to Solr. Add documents to the update query followed by commit command. Finally execute the query. $updateQuery->addDocuments(array($doc1)); $updateQuery->addCommit(); $result = $client->update($updateQuery); Let us execute the code. php insertSolr.php After executing the code, a search for martin will give these documents in the result. http://localhost:8080/solr/collection1/select/?q=martin Document added to Solr index Executing search on Solr Index Documents added to the Solr index can be searched using the following piece of PHP code. $selectConfig = array( 'query' => 'cat:book AND author:Martin', 'start' => 3, 'rows' => 3,'fields' => array('id','name','price','author'), 'sort' => array('price' => 'asc') ); $query = $client->createSelect($selectConfig); $resultSet = $client->select($query); The above code creates a simple Solr query and searches for book in cat field and Martin in author field. The results are sorted in ascending order or price and fields returned are id, name of book, price and author of book. Pagination has been implemented as 3 results per page, so this query returns results for 2nd page starting from 3rd result. In addition to this simple select query, Solr also supports some advanced query modes known as dismax and edismax. With the help of these query modes, we can boost certain fields to give more importance to certain fields in our query. We can also use function queries to do some type of dynamic boosting based on values in fields. If no sorting is provided, the Solr results are sorted by the score of documents which are calculated based on the terms in the query and the matching terms in the documents in the index. Score is calculated for each document in the result set using two main factors - term frequency known as tf and inverse document frequency known as idf. In addition to these, Solr provides a way of narrowing down the results using filter queries. Also facets can be created based on fields in the index and it can be used by the end users to narrow down the results. Highlighting search results using PHP and Solr Solr can be used to highlight the fields returned in a search result based on the query. Here is a sample code for highlighting the results for search keyword harry. $query->setQuery('harry'); $query->setFields(array('id','name','author','series_t','score','last_modified')); Get the highlighting component from the query, set the fields to be highlighted and also set the html tags to be used for highlighting. $hl = $query->getHighlighting(); $hl->setFields('name,series_t'); $hl->setSimplePrefix('<strong>')->setSimplePostfix('</strong>'); Once the query is run and result set is received, we will need to retrieve the highlighted results from the result set. Here is the output for the highlighting code. Highlighted search results In addition to highlighting, Solr can be used to create a spelling suggester and a spell checker. Spelling suggester can be used to prompt input query to the end user as the user keeps on typing. Spell check can be used to prompt spelling corrections similar to 'did you mean' to the user. Solr can also be used for finding documents which are similar to a certain document based on words in certain fields. This functionality of Solr is known as more like this and is exposed via Solarium by the MoreLikeThis component. Solr also provides grouping of the result based on a particular query or a certain field. Scaling Solr Solr can be scaled to handle large number of search requests by using master slave architecture. Also if the index is huge, it can be sharded across multiple Solr instances and we can run a distributed search to get results for our query from all the sharded instances. Solarium provides a load balancing plug-in which can be used to load balance queries across master-slave architecture. Summary Solr provides an extensive list of features for implementing search. These features can be easily accessed in PHP using the Solarium library to build a full features search application which can be used to power search on any website. Resources for Article: Further resources on this subject: Apache Solr Configuration [Article] Getting Started with Apache Solr [Article] Making Big Data Work for Hadoop and Solr [Article]
Read more
  • 0
  • 0
  • 8984
Modal Close icon
Modal Close icon