Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-oracle-e-business-suite-entering-and-reconciling-bank-statements
Packt
23 Aug 2011
4 min read
Save for later

Oracle E-Business Suite: Entering and Reconciling Bank Statements

Packt
23 Aug 2011
4 min read
Oracle E-Business Suite 12 Financials Cookbook Take the hard work out of your daily interactions with E-Business Suite financials by using the 50+ recipes from this cookbook. Entering bank statements Bank statements are downloaded from the bank to a local directory. Once the file is received, the bank account balance and statement information can be loaded into the bank statement open interface tables, using the bank statement loader program or a custom loader program. The files can also be loaded automatically using an interface program or using the XML Gateway. Bank statements can also be entered manually. In this recipe, we will look at how to enter bank statements. Getting ready The bank statement shown next has been loaded into the open interface table: Let's review the transactions in the open interface: Select the Cash Management responsibility. Navigate to Bank Statements | Bank Statement Interface Lines. Select 95-6891-3074 in the Account field. Click on the Lines button to view the transactions in the interface tables. How to do it... Let's list the steps required to automatically enter the bank statements from the import and AutoReconciliation program: Select the Cash Management responsibility. Navigate to Other | Programs | Run, or select View | Requests from the menu. Click on the Submit a New Request button. Select Single Request from the Options. Click on the OK button. In the Submit Request form, select Bank Statement Import & AutoReconciliation from the list of values. Please note that we could run the Bank Statement Import program, to run only the import. Select the Parameters field, and select Kings Cross as the Bank Branch Name, select 95-6891-3074 as the Bank Account Number, and select 20110314-0001 as the parameter for the Statement Number From and the Statement Number To fields. Accept the default values for the remaining fields. Click on the OK button. We can schedule the program to run periodically, for example, every day. Click on the Submit button to submit the request. Let's review the imported bank statements: Navigate to Bank Statement | Bank Statements and Reconciliation. The imported statement is displayed. Click on the Review button. (Move the mouse over the image to enlarge it.) In the Bank Statement window, select the Lines button. The imported lines are displayed. How it works... Bank statements can be imported automatically, using a SQL*Loader script against the bank file to populate the bank statement open interface. The bank statement information is then imported into the Bank Statement windows using the Bank Statement Import program. There's more... Now, let's look at how to enter statements manually. Entering bank statements manually Let's enter the bank statement for the 15th of March manually. The lines on the statement are as follows: Payment of 213.80. Receipt of 3,389.89 from A.C. Networks. Credit of 7,500.00 for Non Sufficient Funds for the receipt from Advantage Corp. Bank Transfer payment of 1,000.00. Select the Cash Management responsibility. Navigate to Bank Statement | Bank Statements and Reconciliation. (Move the mouse over the image to enlarge it.) In the Reconcile Bank Statements window, click on the New button. In the Account Number field, enter 95-6891-3074, the other details are automatically entered. In the Date field enter 15-MAR-2011. In the Statement Number field enter 20110314-0002. In the Control Totals region, let's enter control totals based on our bank statement. The Opening Balance of 125,727.21 is entered based on the previous opening balance. In the Receipts field, enter 3,389.89 and 1 in the Lines field. In the Payments field, enter 8,713.80 and 3 in the Lines field. The Closing Balance of 98,495.56 is entered automatically. Let's enter the bank statement lines: Click on the Lines button. (Move the mouse over the image to enlarge it.) In the Bank Statements Lines form, enter 1 in the Line field. Select Payment as the Type. Enter 100 as the code. In the Transaction Date field, enter 15-MAR-2011. In the Amount field, enter 213.80. Select the next line, and enter 2 in the Line field. Select Receipt as the Type. Enter 200 as the code. In the Transaction Date field, enter 15-MAR-2011. In the Amount field, enter 3,389.89. Select the Reference tab, and enter A.C. Networks. Select the next line, and enter 3 in the Line field. Select NSF as the Type. Enter 500 as the code. In the Transaction Date field, enter 15-MAR-2011. In the Amount field, enter 7,500.00. Select the Reference tab, and enter Advantage Corp. Select the next line, and enter 4 in the Line field. Select Payment as the Type. Enter 140 as the code. In the Transaction Date field, enter 15-MAR-2011. In the Amount field, enter 1,000.00. Save the record.
Read more
  • 0
  • 0
  • 5218

article-image-format-publish-code-using-r-markdown
Savia Lobo
29 Dec 2017
6 min read
Save for later

How to format and publish code using R Markdown

Savia Lobo
29 Dec 2017
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ahmed Sherif titled Practical Business Intelligence.  This book is a complete guide for implementing Business intelligence with the help of powerful tools like D3.js, R, Tableau, Qlikview and Python that are available on the market. It starts off by preparing you for data analytics and then moves on to teach you a range of techniques to fetch important information from various databases.[/box] Today you will explore how to use R Markdown, which is a format that allows reproducible reports with embedded R code that can be published into slideshows, Word documents, PDF files, and HTML web pages. Getting started with R Markdown R Markdown documents have the .RMD extension and are created by selecting R Markdown from the menu bar of RStudio, as seen here: If this is the first time you are creating an R Markdown report, you may be prompted to install some additional packages for R Markdown to work, as seen in the following screenshot: Once the packages are installed, we can define a title, author, and default output format, as follows: For our purposes we will use HTML output. The default output of the R Markdown document will appear like this: R Markdown features and components We can go ahead and delete everything below line 7 in the previous screenshot as we will create our own template with our embedded code and formatting. Header levels can be generated using # in front of a title. The largest font size will have a single # and each subsequent # added will decrease the header level font. Whenever we wish to embed actual R code into the report, we can include it inside of a code chunk by clicking on the icon shown here: Once that icon is selected, a shaded region is created between two ``` characters where R code can be generated identical to that used in RStudio. The first header generated will be for the results, and then the subsequent header will indicate the libraries used to generate the report. This can be generated using the following script: # Results ###### Libraries used are RODBC, plotly, and forecast Executing R code inside of R Markdown The next step is to run the actual R code inside of the chunk snippet that calls the required libraries needed to generate the report. This can be generated using the following script: ```{r} # We will not see the actual libraries loaded # as it is not necessary for the end user library('RODBC') library('plotly') library('forecast') ``` We can then click on the Knit HTML icon on the menu bar to generate a preview of our code results in R Markdown. Unfortunately, this output of library information is not useful to the end user. Exporting tips for R Markdown The report output includes all the messages and potential warnings that are the result of calling a package. This is not information that is useful to the report consumer. Fortunately for R developers, these types of messages can be concealed by tweaking the R chunk snippets to include the following logic in their script: ```{r echo = FALSE, results = 'hide', message = FALSE} ``` We can continue embedding R code into our report to run queries against the SQL Server database and produce summary data of the dataframe as well as the three main plots for the time series plot, observed versus fitted smoothing, and Holt-Winters forecasting: ###### Connectivity to Data Source is through ODBC ```{r echo = FALSE, results = 'hide', message = FALSE} connection_SQLBI<-odbcConnect("SQLBI") #Get Connection Details connection_SQLBI ##query fetching begin## SQL_Query_1<-sqlQuery(connection_SQLBI, 'SELECT [WeekInYear] ,[DiscountCode] FROM [AdventureWorks2014].[dbo].[DiscountCodebyWeek]' ) ##query fetching end## #begin table manipulation colnames(SQL_Query_1)<- c("Week", "Discount") SQL_Query_1$Weeks <- as.numeric(SQL_Query_1$Week) SQL_Query_1<-SQL_Query_1[,-1] #removes first column SQL_Query_1<-SQL_Query_1[c(2,1)] #reverses columns 1 and 2 #end table manipulation ``` ### Preview of First 6 rows of data ```{r echo = FALSE, message= FALSE} head(SQL_Query_1) ``` ### Summary of Table Observations ```{r echo = FALSE, message= FALSE} str(SQL_Query_1) ``` ### Time Series and Forecast Plots ```{r echo = FALSE, message= FALSE} Query1_TS<-ts(SQL_Query_1$Discount) par(mfrow=c(3,1)) plot.ts(Query1_TS, xlab = 'Week (1-52)', ylab = 'Discount', main = 'Time Series of Discount Code by Week') discountforecasts <- HoltWinters(Query1_TS, beta=FALSE, gamma=FALSE) plot(discountforecasts) discountforecasts_8periods <- forecast.HoltWinters(discountforecasts, h=8) plot.forecast(discountforecasts_8periods, ylab='Discount', xlab = 'Weeks (1-60)', main = 'Forecasting 8 periods') ``` The final output Before publishing the output with the results, R Markdown offers the developer opportunities to prettify the end product. One effect I like to add to a report is a logo of some kind. This can be done by applying the following code to any line in R Markdown: ![](http://website.com/logo.jpg) # image is on a website ![](images/logo.jpg) # image is locally on your machine The first option adds an image from a website, and the second option adds an image locally. For my purposes, I will add a PacktPub logo right above the Results section in the R Markdown, as seen in the following screenshot: To learn more about customizing an R Markdown document, visit the following website:  http://rmarkdown.rstudio.com/authoring_basics.html. Once we are ready to preview the results of the R Markdown output, we can once again select the Knit to HTML button on the menu. The new report can be seen in this screenshot: As can be seen in the final output, even if the R code is embedded within the R Markdown document, we can suppress the unnecessary technical output and reveal the relevant tables, fields, and charts that will provide the most benefit to end users and report consumers. If you have enjoyed reading this article and want to develop the ability to think along the right lines and use more than one tool to perform analysis depending on the needs of your business, do check out Practical Business Intelligence.
Read more
  • 0
  • 0
  • 5215

article-image-introduction-kibana
Packt
28 Oct 2015
28 min read
Save for later

An Introduction to Kibana

Packt
28 Oct 2015
28 min read
In this article by Yuvraj Gupta, author of the book, Kibana Essentials, explains Kibana is a tool that is part of the ELK stack, which consists of Elasticsearch, Logstash, and Kibana. It is built and developed by Elastic. Kibana is a visualization platform that is built on top of Elasticsearch and leverages the functionalities of Elasticsearch. (For more resources related to this topic, see here.) To understand Kibana better, let's check out the following diagram: This diagram shows that Logstash is used to push data directly into Elasticsearch. This data is not limited to log data, but can include any type of data. Elasticsearch stores data that comes as input from Logstash, and Kibana uses the data stored in Elasticsearch to provide visualizations. So, Logstash provides an input stream of data to Elasticsearch, from which Kibana accesses the data and uses it to create visualizations. Kibana acts as an over-the-top layer of Elasticsearch, providing beautiful visualizations for data (structured or nonstructured) stored in it. Kibana is an open source analytics product used to search, view, and analyze data. It provides various types of visualizations to visualize data in the form of tables, charts, maps, histograms, and so on. It also provides a web-based interface that can easily handle a large amount of data. It helps create dashboards that are easy to create and helps query data in real time. Dashboards are nothing but an interface for underlying JSON documents. They are used for saving, templating, and exporting. They are simple to set up and use, which helps us play with data stored in Elasticsearch in minutes without requiring any coding. Kibana is an Apache-licensed product that aims to provide a flexible interface combined with the powerful searching capabilities of Elasticsearch. It requires a web server (included in the Kibana 4 package) and any modern web browser, that is, a browser that supports industry standards and renders the web page in the same way across all browsers, to work. It connects to Elasticsearch using the REST API. It helps to visualize data in real time with the use of dashboards to provide real-time insights. As Kibana uses the functionalities of Elasticsearch, it is easier to learn Kibana by understanding the core functionalities of Elasticsearch. In this article, we are going to take a look at the following topics: The basic concepts of Elasticsearch Installation of Java Installation of Elasticsearch Installation of Kibana Importing a JSON file into Elasticsearch Understanding Elasticsearch Elasticsearch is a search server built on top of Lucene (licensed under Apache), which is completely written in Java. It supports distributed searches in a multitenant environment. It is a scalable search engine allowing high flexibility of adding machines easily. It provides a full-text search engine combined with a RESTful web interface and JSON documents. Elasticsearch harnesses the functionalities of Lucene Java Libraries, adding up by providing proper APIs, scalability, and flexibility on top of the Lucene full-text search library. All querying done using Elasticsearch, that is, searching text, matching text, creating indexes, and so on, is implemented by Apache Lucene. Without a setup of an Elastic shield or any other proxy mechanism, any user with access to Elasticsearch API can view all the data stored in the cluster. The basic concepts of Elasticsearch Let's explore some of the basic concepts of Elasticsearch: Field: This is the smallest single unit of data stored in Elasticsearch. It is similar to a column in a traditional relational database. Every document contains key-value pairs, which are referred to as fields. Values in a field can contain a single value, such as integer [27], string ["Kibana"], or multiple values, such as array [1, 2, 3, 4, 5]. The field type is responsible for specifying which type of data can be stored in a particular field, for example, integer, string, date, and so on. Document: This is the simplest unit of information stored in Elasticsearch. It is a collection of fields. It is considered similar to a row of a table in a traditional relational database. A document can contain any type of entry, such as a document for a single restaurant, another document for a single cuisine, and yet another for a single order. Documents are in JavaScript Object Notation (JSON), which is a language-independent data interchange format. JSON contains key-value pairs. Every document that is stored in Elasticsearch is indexed. Every document contains a type and an ID. An example of a document that has JSON values is as follows: { "name": "Yuvraj", "age": 22, "birthdate": "2015-07-27", "bank_balance": 10500.50, "interests": ["playing games","movies","travelling"], "movie": {"name":"Titanic","genre":"Romance","year" : 1997} } In the preceding example, we can see that the document supports JSON, having key-value pairs, which are explained as follows: The name field is of the string type The age field is of the numeric type The birthdate field is of the date type The bank_balance field is of the float type The interests field contains an array The movie field contains an object (dictionary) Type: This is similar to a table in a traditional relational database. It contains a list of fields, which is defined for every document. A type is a logical segregation of indexes, whose interpretation/semantics entirely depends on you. For example, you have data about the world and you put all your data into an index. In this index, you can define a type for continent-wise data, another type for country-wise data, and a third type for region-wise data. Types are used with a mapping API; it specifies the type of its field. An example of type mapping is as follows: { "user": { "properties": { "name": { "type": "string" }, "age": { "type": "integer" }, "birthdate": { "type": "date" }, "bank_balance": { "type": "float" }, "interests": { "type": "string" }, "movie": { "properties": { "name": { "type": "string" }, "genre": { "type": "string" }, "year": { "type": "integer" } } } } } } Now, let's take a look at the core data types specified in Elasticsearch, as follows: Type Definition string This contains text, for example, "Kibana" integer This contains a 32-bit integer, for example, 7 long This contains a 64-bit integer float IEEE float, for example, 2.7 double This is a double-precision float boolean This can be true or false date This is the UTC date/time, for example, "2015-06-30T13:10:10" geo_point This is the latitude or longitude Index: This is a collection of documents (one or more than one). It is similar to a database in the analogy with traditional relational databases. For example, you can have an index for user information, transaction information, and product type. An index has a mapping; this mapping is used to define multiple types. In other words, an index can contain single or multiple types. An index is defined by a name, which is always used whenever referring to an index to perform search, update, and delete operations for documents. You can define any number of indexes you require. Indexes also act as logical namespaces that map documents to primary shards, which contain zero or more replica shards for replicating data. With respect to traditional databases, the basic analogy is similar to the following: MySQL => Databases => Tables => Columns/Rows Elasticsearch => Indexes => Types => Documents with Fields You can store a single document or multiple documents within a type or index. As a document is within an index, it must also be assigned to a type within an index. Moreover, the maximum number of documents that you can store in a single index is 2,147,483,519 (2 billion 147 million), which is equivalent to Integer.Max_Value. ID: This is an identifier for a document. It is used to identify each document. If it is not defined, it is autogenerated for every document.The combination of index, type, and ID must be unique for each document. Mapping: Mappings are similar to schemas in a traditional relational database. Every document in an index has a type. A mapping defines the fields, the data type for each field, and how the field should be handled by Elasticsearch. By default, a mapping is automatically generated whenever a document is indexed. If the default settings are overridden, then the mapping's definition has to be provided explicitly. Node: This is a running instance of Elasticsearch. Each node is part of a cluster. On a standalone machine, each Elasticsearch server instance corresponds to a node. Multiple nodes can be started on a single standalone machine or a single cluster. The node is responsible for storing data and helps in the indexing/searching capabilities of a cluster. By default, whenever a node is started, it is identified and assigned a random Marvel Comics character name. You can change the configuration file to name nodes as per your requirement. A node also needs to be configured in order to join a cluster, which is identifiable by the cluster name. By default, all nodes join the Elasticsearch cluster; that is, if any number of nodes are started up on a network/machine, they will automatically join the Elasticsearch cluster. Cluster: This is a collection of nodes and has one or multiple nodes; they share a single cluster name. Each cluster automatically chooses a master node, which is replaced if it fails; that is, if the master node fails, another random node will be chosen as the new master node, thus providing high availability. The cluster is responsible for holding all of the data stored and provides a unified view for search capabilities across all nodes. By default, the cluster name is Elasticsearch, and it is the identifiable parameter for all nodes in a cluster. All nodes, by default, join the Elasticsearch cluster. While using a cluster in the production phase, it is advisable to change the cluster name for ease of identification, but the default name can be used for any other purpose, such as development or testing.The Elasticsearch cluster contains single or multiple indexes, which contain single or multiple types. All types contain single or multiple documents, and every document contains single or multiple fields. Sharding: This is an important concept of Elasticsearch while understanding how Elasticsearch allows scaling of nodes, when having a large amount of data termed as big data. An index can store any amount of data, but if it exceeds its disk limit, then searching would become slow and be affected. For example, the disk limit is 1 TB, and an index contains a large number of documents, which may not fit completely within 1 TB in a single node. To counter such problems, Elasticsearch provides shards. These break the index into multiple pieces. Each shard acts as an independent index that is hosted on a node within a cluster. Elasticsearch is responsible for distributing shards among nodes. There are two purposes of sharding: allowing horizontal scaling of the content volume, and improving performance by providing parallel operations across various shards that are distributed on nodes (single or multiple, depending on the number of nodes running).Elasticsearch helps move shards among multiple nodes in the event of an addition of new nodes or a node failure. There are two types of shards, as follows: Primary shard: Every document is stored within a primary index. By default, every index has five primary shards. This parameter is configurable and can be changed to define more or fewer shards as per the requirement. A primary shard has to be defined before the creation of an index. If no parameters are defined, then five primary shards will automatically be created.Whenever a document is indexed, it is usually done on a primary shard initially, followed by replicas. The number of primary shards defined in an index cannot be altered once the index is created. Replica shard: Replica shards are an important feature of Elasticsearch. They help provide high availability across nodes in the cluster. By default, every primary shard has one replica shard. However, every primary shard can have zero or more replica shards as required. In an environment where failure directly affects the enterprise, it is highly recommended to use a system that provides a failover mechanism to achieve high availability. To counter this problem, Elasticsearch provides a mechanism in which it creates single or multiple copies of indexes, and these are termed as replica shards or replicas. A replica shard is a full copy of the primary shard. Replica shards can be dynamically altered. Now, let's see the purposes of creating a replica. It provides high availability in the event of failure of a node or a primary shard. If there is a failure of a primary shard, replica shards are automatically promoted to primary shards. Increase performance by providing parallel operations on replica shards to handle search requests.A replica shard is never kept on the same node as that of the primary shard from which it was copied. Inverted index: This is also a very important concept in Elasticsearch. It is used to provide fast full-text search. Instead of searching text, it searches for an index. It creates an index that lists unique words occurring in a document, along with the document list in which each word occurs. For example, suppose we have three documents. They have a text field, and it contains the following: I am learning Kibana Kibana is an amazing product Kibana is easy to use To create an inverted index, the text field is broken into words (also known as terms), a list of unique words is created, and also a listing is done of the document in which the term occurs, as shown in this table: Term Doc 1 Doc 2 Doc 3 I X     Am X     Learning X     Kibana X X X Is   X X An   X   Amazing   X   Product   X   Easy     X To     X Use     X Now, if we search for is Kibana, Elasticsearch will use an inverted index to display the results: Term Doc 1 Doc 2 Doc 3 Is   X X Kibana X X X With inverted indexes, Elasticsearch uses the functionality of Lucene to provide fast full-text search results. An inverted index uses an index based on keywords (terms) instead of a document-based index. REST API: This stands for Representational State Transfer. It is a stateless client-server protocol that uses HTTP requests to store, view, and delete data. It supports CRUD operations (short for Create, Read, Update, and Delete) using HTTP. It is used to communicate with Elasticsearch and is implemented by all languages. It communicates with Elasticsearch over port 9200 (by default), which is accessible from any web browser. Also, Elasticsearch can be directly communicated with via the command line using the curl command. cURL is a command-line tool used to send, view, or delete data using URL syntax, as followed by the HTTP structure. A cURL request is similar to an HTTP request, which is as follows: curl -X <VERB> '<PROTOCOL>://<HOSTNAME>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>' The terms marked within the <> tags are variables, which are defined as follows: VERB: This is used to provide an appropriate HTTP method, such as GET (to get data), POST, PUT (to store data), or DELETE (to delete data). PROTOCOL: This is used to define whether the HTTP or HTTPS protocol is used to send requests. HOSTNAME: This is used to define the hostname of a node present in the Elasticsearch cluster. By default, the hostname of Elasticsearch is localhost. PORT: This is used to define the port on which Elasticsearch is running. By default, Elasticsearch runs on port 9200. PATH: This is used to define the index, type, and ID where the documents will be stored, searched, or deleted. It is specified as index/type/ID. QUERY_STRING: This is used to define any additional query parameter for searching data. BODY: This is used to define a JSON-encoded request within the body. In order to put data into Elasticsearch, the following curl command is used: curl -XPUT 'http://localhost:9200/testing/test/1' -d '{"name": "Kibana" }' Here, testing is the name of the index, test is the name of the type within the index, and 1 indicates the ID number. To search for the preceding stored data, the following curl command is used: curl -XGET 'http://localhost:9200/testing/_search? The preceding commands are provided just to give you an overview of the format of the curl command. Prerequisites for installing Kibana 4.1.1 The following pieces of software need to be installed before installing Kibana 4.1.1: Java 1.8u20+ Elasticsearch v1.4.4+ A modern web browser—IE 10+, Firefox, Chrome, Safari, and so on The installation process will be covered separately for Windows and Ubuntu so that both types of users are able to understand the process of installation easily. Installation of Java In this section, JDK needs to be installed so as to access Elasticsearch. Oracle Java 8 (update 20 onwards) will be installed as it is the recommended version for Elasticsearch from version 1.4.4 onwards. Installation of Java on Ubuntu 14.04 Install Java 8 using the terminal and the apt package in the following manner: Add the Oracle Java Personal Package Archive (PPA) to the apt repository list: sudo add-apt-repository -y ppa:webupd8team/java In this case, we use a third-party repository; however, the WebUpd8 team is trusted to install Java. It does not include any Java binaries. Instead, the PPA directly downloads from Oracle and installs it. As shown in the preceding screenshot, you will initially be prompted for the password for running the sudo command (only when you have not logged in as root), and on successful addition to the repository, you will receive an OK message, which means that the repository has been imported. Update the apt package database to include all the latest files under the packages: sudo apt-get update Install the latest version of Oracle Java 8: sudo apt-get -y install oracle-java8-installer Also, during the installation, you will be prompted to accept the license agreement, which pops up as follows: To check whether Java has been successfully installed, type the following command in the terminal:java –version This signifies that Java has been installed successfully. Installation of Java on Windows We can install Java on windows by going through the following steps: Download the latest version of the Java JDK from the Sun Microsystems site at http://www.oracle.com/technetwork/java/javase/downloads/index.html:                                                                                     As shown in the preceding screenshot, click on the DOWNLOAD button of JDK to download. You will be redirected to the download page. There, you have to first click on the Accept License Agreement radio button, followed by the Windows version to download the .exe file, as shown here: Double-click on the file to be installed and it will open as an installer. Click on Next, accept the license by reading it, and keep clicking on Next until it shows that JDK has been installed successfully. Now, to run Java on Windows, you need to set the path of JAVA in the environment variable settings of Windows. Firstly, open the properties of My Computer. Select the Advanced system settings and then click on the Advanced tab, wherein you have to click on the environment variables option, as shown in this screenshot: After opening Environment Variables, click on New (under the System variables) and give the variable name as JAVA_HOME and variable value as C:Program FilesJavajdk1.8.0_45 (do check in your system where jdk has been installed and provide the path corresponding to the version installed as mentioned in system directory), as shown in the following screenshot: Then, double-click on the Path variable (under the System variables) and move towards the end of textbox. Insert a semicolon if it is not already inserted, and add the location of the bin folder of JDK, like this: %JAVA_HOME%bin. Next, click on OK in all the windows opened. Do not delete anything within the path variable textbox. To check whether Java is installed or not, type the following command in Command Prompt: java –version This signifies that Java has been installed successfully. Installation of Elasticsearch In this section, Elasticsearch, which is required to access Kibana, will be installed. Elasticsearch v1.5.2 will be installed, and this section covers the installation on Ubuntu and Windows separately. Installation of Elasticsearch on Ubuntu 14.04 To install Elasticsearch on Ubuntu, perform the following steps: Download Elasticsearch v 1.5.2 as a .tar file using the following command on the terminal:  curl -L -O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.5.2.tar.gz Curl is a package that may not be installed on Ubuntu by the user. To use curl, you need to install the curl package, which can be done using the following command: sudo apt-get -y install curl Extract the downloaded .tar file using this command: tar -xvzf elasticsearch-1.5.2.tar.gzThis will extract the files and folder into the current working directory. Navigate to the bin directory within the elasticsearch-1.5.2 directory: cd elasticsearch-1.5.2/bin Now run Elasticsearch to start the node and cluster, using the following command:./elasticsearch The preceding screenshot shows that the Elasticsearch node has been started, and it has been given a random Marvel Comics character name. If this terminal is closed, Elasticsearch will stop running as this node will shut down. However, if you have multiple Elasticsearch nodes running, then shutting down a node will not result in shutting down Elasticsearch. To verify the Elasticsearch installation, open http://localhost:9200 in your browser. Installation of Elasticsearch on Windows The installation on Windows can be done by following similar steps as in the case of Ubuntu. To use curl commands on Windows, we will be installing GIT. GIT will also be used to import a sample JSON file into Elasticsearch using elasticdump, as described in the Importing a JSON file into Elasticsearch section. Installation of GIT To run curl commands on Windows, first download and install GIT, then perform the following steps: Download the GIT ZIP package from https://git-scm.com/download/win. Double-click on the downloaded file, which will walk you through the installation process. Keep clicking on Next by not changing the default options until the Finish button is clicked on. To validate the GIT installation, right-click on any folder in which you should be able to see the options of GIT, such as GIT Bash, as shown in the following screenshot: The following are the steps required to install Elasticsearch on Windows: Open GIT Bash and enter the following command in the terminal:  curl –L –O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.5.2.zip Extract the downloaded ZIP package by either unzipping it using WinRar, 7Zip, and so on (if you don't have any of these, download one of them) or using the following command in GIT Bash: unzip elasticsearch-1.5.2.zip This will extract the files and folder into the directory. Then click on the extracted folder and navigate through it to reach the bin folder. Click on the elasticsearch.bat file to run Elasticsearch. The preceding screenshot shows that the Elasticsearch node has been started, and it is given a random Marvel Comics character's name. Again, if this window is closed, Elasticsearch will stop running as this node will shut down. However, if you have multiple Elasticsearch nodes running, then shutting down a node will not result in shutting down Elasticsearch. To verify the Elasticsearch installation, open http://localhost:9200 in your browser. Installation of Kibana In this section, Kibana will be installed. We will install Kibana v4.1.1, and this section covers installations on Ubuntu and Windows separately. Installation of Kibana on Ubuntu 14.04 To install Kibana on Ubuntu, follow these steps: Download Kibana version 4.1.1 as a .tar file using the following command in the terminal:  curl -L -O https://download.elasticsearch.org/kibana/kibana/kibana-4.1.1-linux-x64.tar.gz Extract the downloaded .tar file using this command: tar -xvzf kibana-4.1.1-linux-x64.tar.gz The preceding command will extract the files and folder into the current working directory. Navigate to the bin directory within the kibana-4.1.1-linux-x64 directory: cd kibana-4.1.1-linux-x64/bin Now run Kibana to start the node and cluster using the following command: Make sure that Elasticsearch is running. If it is not running and you try to start Kibana, the following error will be displayed after you run the preceding command: To verify the Kibana installation, open http://localhost:5601 in your browser. Installation of Kibana on Windows To install Kibana on Windows, perform the following steps: Open GIT Bash and enter the following command in the terminal:  curl -L -O https://download.elasticsearch.org/kibana/kibana/kibana-4.1.1-windows.zip Extract the downloaded ZIP package by either unzipping it using WinRar or 7Zip (download it if you don't have it), or using the following command in GIT Bash: unzip kibana-4.1.1-windows.zip This will extract the files and folder into the directory. Then click on the extracted folder and navigate through it to get to the bin folder. Click on the kibana.bat file to run Kibana. Make sure that Elasticsearch is running. If it is not running and you try to start Kibana, the following error will be displayed after you click on the kibana.bat file: Again, to verify the Kibana installation, open http://localhost:5601 in your browser. Additional information You can change the Elasticsearch configuration for your production environment, wherein you have to change parameters such as the cluster name, node name, network address, and so on. This can be done using the information mentioned in the upcoming sections.. Changing the Elasticsearch configuration To change the Elasticsearch configuration, perform the following steps: Run the following command in the terminal to open the configuration file: sudo vi ~/elasticsearch-1.5.2/config/elasticsearch.yml Windows users can open the elasticsearch.yml file from the config folder. This will open the configuration file as follows: The cluster name can be changed, as follows: #cluster.name: elasticsearch to cluster.name: "your_cluster_name". In the preceding figure, the cluster name has been changed to test. Then, we save the file. To verify that the cluster name has been changed, run Elasticsearch as mentioned in the earlier section. Then open http://localhost:9200 in the browser to verify, as shown here: In the preceding screenshot, you can notice that cluster_name has been changed to test, as specified earlier. Changing the Kibana configuration To change the Kibana configuration, follow these steps: Run the following command in the terminal to open the configuration file: sudo vi ~/kibana-4.1.1-linux-x64/config/kibana.yml Windows users can open the kibana.yml file from the config folder In this file, you can change various parameters such as the port on which Kibana works, the host address on which Kibana works, the URL of Elasticsearch that you wish to connect to, and so on For example, the port on which Kibana works can be changed by changing the port address. As shown in the following screenshot, port: 5601 can be changed to any other port, such as port: 5604. Then we save the file. To check whether Kibana is running on port 5604, run Kibana as mentioned earlier. Then open http://localhost:5604 in the browser to verify, as follows: Importing a JSON file into Elasticsearch To import a JSON file into Elasticsearch, we will use the elasticdump package. It is a set of import and export tools used for Elasticsearch. It makes it easier to copy, move, and save indexes. To install elasticdump, we will require npm and Node.js as prerequisites. Installation of npm In this section, npm along with Node.js will be installed. This section covers the installation of npm and Node.js on Ubuntu and Windows separately. Installation of npm on Ubuntu 14.04 To install npm on Ubuntu, perform the following steps: Add the official Node.js PPA: sudo curl --silent --location https://deb.nodesource.com/setup_0.12 | sudo bash - As shown in the preceding screenshot, the command will add the official Node.js repository to the system and update the apt package database to include all the latest files under the packages. At the end of the execution of this command, we will be prompted to install Node.js and npm, as shown in the following screenshot: Install Node.js by entering this command in the terminal: sudo apt-get install --yes nodejs This will automatically install Node.js and npm as npm is bundled within Node.js. To check whether Node.js has been installed successfully, type the following command in the terminal: node –v Upon successful installation, it will display the version of Node.js. Now, to check whether npm has been installed successfully, type the following command in the terminal: npm –v Upon successful installation, it will show the version of npm. Installation of npm on Windows To install npm on Windows, follow these steps: Download the Windows Installer (.msi) file by going to https://nodejs.org/en/download/. Double-click on the downloaded file and keep clicking on Next to install the software. To validate the successful installation of Node.js, right-click and select GIT Bash. In GIT Bash, enter this: node –v Upon successful installation, you will be shown the version of Node.js. To validate the successful installation of npm, right-click and select GIT Bash. In GIT Bash, enter the following line: npm –v Upon successful installation, it will show the version of npm. Installing elasticdump In this section, elasticdump will be installed. It will be used to import a JSON file into Elasticsearch. It requires npm and Node.js installed. This section covers the installation on Ubuntu and Windows separately. Installing elasticdump on Ubuntu 14.04 Perform these steps to install elasticdump on Ubuntu: Install elasticdump by typing the following command in the terminal: sudo npm install elasticdump -g Then run elasticdump by typing this command in the terminal: elasticdump Import a sample data (JSON) file into Elasticsearch, which can be downloaded from https://github.com/guptayuvraj/Kibana_Essentials and is named tweet.json. It will be imported into Elasticsearch using the following command in the terminal: elasticdump --bulk=true --input="/home/yuvraj/Desktop/tweet.json" --output=http://localhost:9200/ Here, input provides the location of the file, as shown in the following screenshot: As you can see, data is being imported to Elasticsearch from the tweet.json file, and the dump complete message is displayed when all the records are imported to Elasticsearch successfully. Elasticsearch should be running while importing the sample file. Installing elasticdump on Windows To install elasticdump on Windows, perform the following steps: Install elasticdump by typing the following command in GIT Bash: npm install elasticdump -g                                                                                                           Then run elasticdump by typing this command in GIT Bash: elasticdump Import the sample data (JSON) file into Elasticsearch, which can be downloaded from https://github.com/guptayuvraj/Kibana_Essentials and is named tweet.json. It will be imported to Elasticsearch using the following command in GIT Bash: elasticdump --bulk=true --input="C:UsersyguptaDesktoptweet.json" --output=http://localhost:9200/ Here, input provides the location of the file. The preceding screenshot shows data being imported to Elasticsearch from the tweet.json file, and the dump complete message is displayed when all the records are imported to Elasticsearch successfully. Elasticsearch should be running while importing the sample file. To verify that the data has been imported to Elasticsearch, open http://localhost:5601 in your browser, and this is what you should see: When Kibana is opened, you have to configure an index pattern. So, if data has been imported, you can enter the index name, which is mentioned in the tweet.json file as index: tweet. After the page loads, you can see to the left under Index Patterns the name of the index that has been imported (tweet). Now mention the index name as tweet. It will then automatically detect the timestamped field and will provide you with an option to select the field. If there are multiple fields, then you can select them by clicking on Time-field name, which will provide a drop-down list of all fields available, as shown here: Finally, click on Create to create the index in Kibana. After you have clicked on Create, it will display the various fields present in this index. If you do not get the options of Time-field name and Create after entering the index name as tweet, it means that the data has not been imported into Elasticsearch. Summary In this article, you learned about Kibana, along with the basic concepts of Elasticsearch. These help in the easy understanding of Kibana. We also looked at the prerequisites for installing Kibana, followed by a detailed explanation of how to install each component individually in Ubuntu and Windows. Resources for Article: Further resources on this subject: Understanding Ranges [article] Working On Your Bot [article] Welcome to the Land of BludBorne [article]
Read more
  • 0
  • 0
  • 5214

article-image-creating-view-mysql-query-browser
Packt
23 Oct 2009
2 min read
Save for later

Creating a View with MySQL Query Browser

Packt
23 Oct 2009
2 min read
Please refer to an earlier article by the author to learn how to build queries visually. Creating a View from an Existing Query To create a view from a query, you must have executed the query successfully. To be more precise, the view is created from the latest successfully executed query, not necessarily from the query currently in the Query Area. To further clarify, the following three examples are cases where the view is not created from the current query: Your current query fails, and immediately after you create a view from the query. The view created is not from the failed query. If the failed query is the first query in your MySQL Query Browser session, you can’t create any view. You have just moved forward or backward the query in the Query Area without executing it, and then your current query is not the latest successfully executed. You open a saved query that you have never executed successfully in your active Resultset. Additionally, if you’re changing your Resultset, the view created is from the latest successfully executed query that uses the currently active Resultset to display its output. To make sure your view is from the query you want, select the query, confirm it as written in the Query Area, execute the query, and then, immediately create its view. You create a view from an existing query by selecting Query | Create View from Select from the Menu bar. Type in the name you want to give to the view, and then click Create View. MySQL Query Browser creates the view. When successfully created, you can see the view in the Schemata. You can modify a view by editing it: Right-click the view and select Edit View. You can edit the CREATE view statement by right-clicking it and select Edit View. The CREATE view statement opens in its Script tab. When you finish editing, you can execute the modified view. If successful, the existing view is replaced with the modified one. To replace the view you’re editing with the modified view, change the name of the view before you execute it. If you want to keep the view you’re editing, remove the DROP VIEW statement.
Read more
  • 0
  • 0
  • 5211

article-image-including-charts-and-graphics-pentaho-reports-part-1
Packt
09 Nov 2009
8 min read
Save for later

Including Charts and Graphics in Pentaho Reports (Part 1)

Packt
09 Nov 2009
8 min read
Supported charts Pentaho Reporting relies on JFreeChart, an open source Java chart library, for charting visualization within reports. From within Report Designer, many chart types are supported. In the chart editor, two areas of properties appear when editing a chart. The first area of properties is related to chart rendering, and the second tabbed area of properties is related to the data that populates a chart. Following is the screenshot of the chart editor within Pentaho Report Designer: All chart types receive their data from three general types of datasets. The first type is known as a Category Dataset , where the dataset series and values are grouped by categories. A series is like a sub-group. If the exact category and series appear, the chart will sum the values into a single result. The following table is a simple example of a category dataset: Category Series Sale Price Store 1 Sales Cash $14 Store 1 Sales Credit $12 Store 2 Sales Cash $100 Store 2 Sales Credit $120 Pentaho Reporting builds a Category Dataset using the CategorySetDataCollector. Also available is the PivotCategorySetCollector, which pivots the category and series data. Collector classes implement Pentaho Reporting’s Function API. The second type of dataset is known as an XY Series Dataset, which is a two dimensional group of values that may be plotted in various forms. In this dataset, the series may be used to draw different lines, and so on. Here is a simple example of an XY series dataset : Series Cost of Goods (X) Sale Price (Y) Cash 10 14 Credit 11 12 Cash 92 100 Credit 105 120 Note that X is often referred to as the domain, and Y is referred to as the range. Pentaho Reporting builds an XY Series Dataset using the XYSeriesCollector. The XYZSeriesCollector also exists for three dimensional data. The third type of dataset is known as a Time Series Dataset , which is a two dimensional group of values that are plotted based on a time and date. The Time Series Dataset is more like an XY Series than a Category Dataset, as the time scale is displayed in a linear fashion with appropriate distances between the different time references. Time Series Sale Price May 05, 2009 11:05pm Cash $14 June 07, 2009 12:42pm Credit $12 June 14, 2009 4:20pm Cash $100 June 01, 2009 1:22pm Credit $120 Pentaho Reporting builds a Time Series Dataset using the TimeSeriesCollector. Common chart rendering properties Most charts share a common set of properties. The following properties are common across most charts. Any exceptions are mentioned as part of the specific chart type. Required Property Group Property name Description name The name of the chart object within the report. This is not displayed during rendering, but must be unique in the report. A default name is generated for each chart added to the report. data-source The dataset name for the chart, which is automatically populated with the name of the dataset in the Primary DataSource panel of the chart editor. no-data-message The message to display if no data is available to render the chart. Title Property Group Property name Description chart-title The title of the chart, which is rendered in the report. chart-title-field A field representing the chart title. title-font The chart title's font family, size, and style. Options Property Group Property name Description horizontal If set to True, the chart's X and Y axis are rotated horizontally. The default value is set to False. series-colors The color in which to render each series. The default for the first three series colors are red, blue, and green. General Property Group Property name Description 3-D If set to True, renders the chart in a 3D perspective. The default value is set to False. anti-alias If set to True, renders chart fonts as anti-aliased. The default value is set to True. bg-color Sets the background around the chart to the specified color. If not set, defaults to gray. bg-image Sets the background of the chart area to the specified image. If not set, the background of the chart area defaults to white. The chart area is the area within the axes of the chart. Supported image types include PNG, JPG, and GIF file formats. show-border If set to True, displays a border around the chart. The default value is set to True. border-color Sets the border to the specified color. If not set, defaults to black. plot-border If set to False, clears the default rendering value of the chart border. plot-bg-color Sets the plot background color to the specified color. If not set, defaults to white. plot-fg-alpha Sets the alpha value of the plot foreground colors relative to the plot background. The default value is set to 1.0. plot-bg-alpha Sets the alpha value of the plot background color relative to the chart background color. The default value is set to 1.0.   Legend Property Group Property name Description show-legend If set to True, displays the legend for the chart. The default value is set to False. location The location of the legend in relation to the chart, which may be set to top, bottom, left, or right. The default location is bottom. legend-border If set to True, renders a border around the legend. The default value is set to True. legend-font The type of Java font to render the legend labels in. legend-bg-color Sets the legend background color. If not set, defaults to white. legend-font-color Sets the legend font color. If not set, defaults to black. Advanced Property Group Property name Description dependencyLevel The dependency level field informs the reporting engine what order the chart should be executed in relation to other items in the report. This is useful if you are using special functions that may need to execute prior to generating the chart. The default value is set to 0. Negative values execute before 0, and positive values execute after 0.   Common category series rendering properties The following properties appear in charts that render category information: Options Property Group Property name Description stacked If set to True, the series values will appear layered on top of one another instead of being displayed relative to one another. stacked-percent If set to True, determines the percentages of each series, and renders the bar height based on those percentages. The property stacked must be set to True for this property to have an effect. General Property Group Property name Description gridlines If set to True, displays category grid lines. This value is set to True by default. X-Axis Property Group Property name Description label-rotation If set, adjusts the inline item label rotation value. The value should be specified in degrees. If not specified, labels are rendered horizontally. You must have show-labels set to true for this value to be relevant. date-format If the item value is a date, a Java date format string may be provided to format the date appropriately. Please see Java's SimpleDateFormat JavaDoc for formatting details. numeric-format If the item value is a decimal number, a Java decimal format string may be provided to format the number appropriately. Please see Java's DecimalFormat JavaDoc for formatting details. text-format The label format used for displaying category items within the chart. This property is required if you would like to display the category item values. The following parameters may be defined in the format string to access details of the item: {0}: To access the Series Name detail of an item {1}: To access the Category detail of an item {2}: To access the Item value details of an item To display just the item value, set the format string to "{2}". x-axis-title If set, displays a label describing the category axis. show-labels If set to true, displays x-axis labels in the chart. x-axis-label-width Sets the maximum category label width ratio, which determines the maximum length each category label should render in. This might be useful if you have really long category names. x-axis-label-rotation If set, adjusts the category item label rotation value. The value should be specified in degrees. If not specified, labels are rendered horizontally. x-font The font to render the category axis title and labels in. Y-Axis Property Group Property name Description y-axis-title If set, displays a label along the value axis of the chart. label-rotation If set, determines the upward angle position of the label, where the value passed into JFreeChart is the mathematical pie over the value. Unfortunately, this property is not very flexible and you may find it difficult to use. y-tick-interval The numeric interval value to separate range ticks in the chart. y-font The font to render the range axis title in. y-sticky-0 If the range includes zero in the axis, making it sticky will force truncation of the axis to zero if set to True. The default value of this property is True. y-incl-0 If set to True, the range axis will force zero to be included in the axis. y-min The minimum value to render in the range axis. y-max The maximum value to render in the range axis. y-tick-font The font to render the range tick value in. y-tick-fmt-str The DecimalFormat string to render the numeric range tick value. enable-log-axis If set to true, displays the y-axis as a logarithmic scale. log-format If set to true, will present the logarithmic scale in a human readable view.
Read more
  • 0
  • 0
  • 5208

article-image-getting-started-with-pentaho-data-integration-and-pentaho-bi-suite
Vijin Boricha
24 Feb 2018
9 min read
Save for later

Getting Started with Pentaho Data Integration and Pentaho BI Suite

Vijin Boricha
24 Feb 2018
9 min read
[box type="note" align="" class="" width=""]This article is a book excerpt from Learning Pentaho Data Integration 8 CE - Third Edition written by María Carina Roldán.  In this book you will explore the features and capabilities of Pentaho Data Integration 8 Community Edition.[/box] In today’s tutorial, we will introduce you to Pentaho Data Integration (PDI) and learn to use it in real world scenario. Pentaho Data Integration (PDI) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading (also known as ETL processes). The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are: Analysis: The analysis engine serves multidimensional analysis. It's provided by the Mondrian OLAP server. Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on), from different kinds of sources. In the Enterprise Edition of Pentaho, you can also generate interactive Reports. Data mining: Data mining is used for running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to Weka project. Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). CTools is a set of tools and components created to help the user to build custom dashboards on top of Pentaho. There are specific CTools for different purposes, including a Community Dashboard Editor (CDE), a very powerful charting library (CCC), and a plugin for accessing data with great flexibility (CDA), among others. While the Ctools allow to develop advanced and custom dashboards, there is a Dashboard Designer, available only in Pentaho Enterprise Edition, that allows to build dashboards in an easy way. Data integration: Data integration is used to integrate scattered information from different sources (for example, applications, databases, and files) and make the integrated information available to the final user. PDI—the tool that we will learn to use throughout the book—is the engine that provides this functionality. PDI also interacts with the rest of the tools, as, for example, reading OLAP cubes, generating Pentaho Reports, and doing data mining with R Executor Script and the CPython Script Executor. All of these tools can be used standalone but also integrated. Pentaho tightly couples data integration with analytics in a modern platform: the PDI and Business Analytics Platform. This solution offers critical services, for example: Authentication and authorization Scheduling Security Web services Scalability and failover This set of software and services forms a complete BI Suite, which makes Pentaho the world's leading open source BI option on the market. Note: You can find out more about the platform at https://community.hds.com/community/products-and-solutions/pentaho/. There is also an Enterprise Edition with additional features and support. You can find more on this at http://www.pentaho.com/. Introducing Pentaho Data Integration Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception; Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle. By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project. From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the user's improvements in performance and existing functionality, new functionality, and ease of use, along with great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho: June 2006: PDI 2.3 was released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included, among other changes, enhancements for large-scale environments and multilingual capabilities. November 2007: PDI 3.0 emerged totally redesigned. Its major library changed to gain massive performance improvements. The look and feel had also changed completely. April 2009: PDI 3.2 was released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements,and a huge amount of bug fixes. June 2010: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements. November 2013: PDI 5.0 was released, offering better previewing of data, easier looping, a lot of big data improvements, an improved plugin marketplace, and  hundreds of bug fixes and features enhancements, as in all releases. In its Enterprise version, it offered interesting low-level features, such as step load balancing, Job transactions, and restartability. December 2015: PDI 6.0 was released with new features such as data services, data lineage, bigger support for Big Data, and several changes in the graphical designer for improving the PDI user experience. Some months later, PDI 6.1 was released including metadata injection, a feature that enables the user to modify Transformations at runtime. Metadata injection had been available in earlier versions, but it was in 6.1 that Pentaho started to put in a big effort in implementing this powerful feature. November 2016: PDI 7.0 emerged with many improvements in the enterprise version, including data inspection capabilities, more support for Big Data technologies, and improved repository management. In the community version, the main change was an expanded metadata injection support. November 2017: Pentaho 8.0 is released. The highlights of this latest version are the optimization of processing resources, a better user experience, and the enhancement of the connectivity to streaming data sources—real-time processing. Using PDI in real-world scenarios Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data. In fact, PDI does not only serve as a data integrator or an ETL tool. PDI is such a powerful tool that it is common to see it being used for these and for many other purposes. Here you have some examples. Loading data warehouses or data marts The loading of a data warehouse or a data mart involves many steps, and there are many variants depending on business area or business rules. However, in every case, with no exception, the process involves the following steps: Extracting information from one or more databases, text files, XML files, and other sources. The extract process may include the task of validating and discarding data that doesn't match expected patterns or rules. Transforming the obtained data to meet the business and technical needs required on the target. Transforming includes such tasks such as converting data types, doing some calculations, filtering irrelevant data, and summarizing. Loading the transformed data into the target database or file store. Depending on the requirements, the loading may overwrite the existing information or may add new information each time it is executed. Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with the tool: Integrating data Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main Enterprise Resource Planning (ERP) application and a Customer Relationship Management (CRM) application, though they're not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data; some conversions, validation, and transfer of data have to be done. PDI is meant to do all these tasks. Data cleansing Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don't follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform to minimum and maximum values, and so on. These are tasks that Kettle makes possible, thanks to its vast set of transformation and validation capabilities. Migrating information Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch or type the information by hand. Kettle makes the migration possible, thanks to its ability to interact with most kind of sources and destinations, such as plain files, commercial and free databases, and spreadsheets, among others. Exporting data Data may need to be exported for numerous reasons: To create detailed business reports To allow communication between different departments within the same company To deliver data from your legacy systems to obey government regulations, and so on Kettle has the power to take raw data from the source and generate these kinds of ad hoc reports. Integrating PDI along with other Pentaho tools The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a data flow. Some examples are preprocessing data for an online report, sending emails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on. Installing PDI In order to work with PDI, you need to install the software. Following are the instructions to install the PDI software, irrespective of the operating system you may be using: Go to the Download page at http://sourceforge.net/projects/pentaho/files/DataIntegration. Choose the newest stable release. At this time, it is 8.0, as shown in the following Screenshot: Download the available zip file, which will serve you for all platforms. Unzip the downloaded file in a folder of your choice, as, for example, c:/util/kettle or /home/pdi_user/kettle. And that's all. You have installed the tool in just few minutes. We learnt about installing and using PDI. You can know more about extending PDI functionality and Launching the PDI Graphical Designer from Learning Pentaho Data Integration 8 CE - Third Edition.        
Read more
  • 0
  • 0
  • 5206
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-excel-2010-financials-identifying-profitability-investment
Packt
12 Jul 2011
5 min read
Save for later

Excel 2010 Financials: Identifying the Profitability of an Investment

Packt
12 Jul 2011
5 min read
  Excel 2010 Financials Cookbook Powerful techniques for financial organization, analysis, and presentation in Microsoft Excel         Read more about this book       (For more resources on this subject, see here.) Calculating the depreciation of assets Assets within a business are extremely important for a number of reasons. Assets can become investments for growth or investments in another line of business. Assets can also take on many forms such as computer equipment, vehicles, furniture, buildings, land, and so on. Assets are not only important within the business for which they are used, but they are also used as a method of reducing the tax burden on a business. As a financial manager, you are tasked with calculating the depreciation expense for a laptop computer with a useful life of five years. In this recipe, you will learn to calculate the depreciation of an asset over the life of the asset. Getting ready There are several different methods of depreciation. A business may use straight-line depreciation, declining depreciation, double-declining depreciation, or a variation of these methods. Excel has the functionality to calculate each of the methods with a slight variation to the function; however, in this recipe, we will use a straight-line depreciation. Straight-line depreciation provides equal reduction of an asset over its life. Getting ready We will first need to set up the Excel worksheet to hold the depreciation values for the laptop computer: In cell A5 list Year and in cell B5 list 1. This will account for the depreciation for the year that the asset was purchased. Continue this list until all five years are listed: In cell B2, list the purchase price of the Laptop computer; the purchase price is $2500: In cell B4, we will need to enter the salvage value of the asset. The salvage value will be the estimated resale value of the asset when it is useful life, as determined by generally accepted accounting principles, has elapsed. Enter $500 in cell B3: In cell B5 enter the formula =SLN($B$2,$B$3,5) and press Enter: Copy the formula from cell C5 and paste it through cell C9: Excel now has listed the straight-line depreciation expense for each of the five years. As you can see in this schedule, the depreciation expense remains consistent through each year of the asset's useful life. How it works... Straight-line depreciation calculates the value of the purchase price minus the salvage price, and divides the remainder across the useful life. The function accepts inputs as follows =SLN(purchase price, salvage price, useful life). There's more... Other depreciation methods are as follows: Calculating the future versus current value of your money When working within finance, accounting, or general business it is important to know how much money you have. However, knowing how much money you have now is only a portion of the whole financial picture. You must also know how much your money will be worth in the future. Knowing future value allows you to know truly how much your money is worth, and with this knowledge, you can decide what you need to do with it. As a financial manager, you must provide feedback on whether to introduce a new product line. As with any new venture, there will be several related costs including start-up costs, operational costs, and more. Initially, you must spend $20,000 to account for most start-up costs and you will potentially, for the sake of the example, earn a profit of $5500 for five years. You also know due to expenditures, you expect your cost of capital to be 10%. In this recipe, you will learn to use Excel functions to calculate the future value of the venture and whether this proves to be profitable. How to do it... We will first need to enter all known values and variables into the worksheet: In cell B2, enter the initial cost of the business venture: In cell B3, enter the discount rate, or the cost of capital of 10%: In cells B4 through B8, enter the five years' worth of expected net profit from the business venture: In cell B10, we will calculate the net present value: Enter the formula =NPV(B3,B4:B8) and press Enter We now see that accounting for future inflows, the net present value of the business venture is $20,849.33. Our last step is to account for the initial start-up costs and determine the overall profitability. In cell B11, enter the formula =B10/B2 and press Enter: As a financial manager, we now see that for every $1 invested in this venture, you will receive $1.04 in present value inflows. How it works... NPV or net present value is calculated in Excel using all of the inflow information that was entered across the estimated period. For the five years used in this recipe, the venture shows a profit of $5500 for each year. This number cannot be used directly, because there is a cost of making money. The cost in this instance pertains to taxes and other expenditures. In the NPV formula, we did not include the initial cost of start-up because this cost is exempt from the cost of capital; however, it must be used at the end of the formula to account for the outflow compared to the inflows. There's more... The $1.04 value calculated at the end of this recipe is also known as the profitability index. When this index is greater than one, the venture is said to be a positive investment.
Read more
  • 0
  • 0
  • 5172

article-image-article-ibm-cognos-10-bi-business-insight-dashboard
Packt
16 Jul 2012
7 min read
Save for later

IBM Cognos 10 BI dashboarding components

Packt
16 Jul 2012
7 min read
Introducing IBM Cognos 10 BI Cognos Connection In this recipe we will be exploring Cognos Connection, which is the user interface presented to the user when he/she logs in to IBM Cognos 10 BI for the first time. IBM Cognos 10 BI, once installed and configured, can be accessed through the Web using supported web browsers. For a list of supported web browsers, refer to the Installation and Configuration Guide shipped with the product. Getting ready As stated earlier, make sure that IBM Cognos 10 BI is installed and configured. Install and configure the GO Sales and GO Data Warehouse samples. Use the gateway URI to log on to the web interface called Cognos Connection. How to do it... To explore Cognos Connection, perform the following steps: Log on to Cognos Connection using the gateway URI that may be similar to http://<HostName>:<PortNumber>/ibmcognos/cgi-bin/cognos.cgi. Take note of the Cognos Connection interface. It has the GO Sales and GO Data Warehouse samples visible. Note the blue-colored folder icon, shown as in the preceding screenshot. It represents metadata model packages that are published to Cognos Connection using the Cognos Framework Manager tool. These packages have objects that represent business data objects, relationships, and calculations, which can be used to author reports and dashboards. Refer to the book, IBM Cognos TM1 Cookbook by Packt Publishing to learn how to create metadata models packages. From the toolbar, click on Launch. This will open a menu, showing different studios, each having different functionality, as shown in the following screenshot: We will use Business Insight and Business Insight Advanced, which are the first two choices in the preceding menu. These are the two components used to create and view dashboards. For other options, refer to the corresponding books by the same publisher. For instance, refer to the book, IBM Cognos 8 Report Studio Cookbook to know more about creating and distributing complex reports. Query Studio and Analysis Studio are meant to provide business users with the facility to slice and dice business data themselves. Event Studio is meant to define business situations and corresponding actions. Coming back to Cognos Connection, note that a yellow-colored folder icon, which is shown as represents a user-defined folder, which may or may not contain other published metadata model packages, reports, dashboards, and other content. In our case, we have a user-defined folder called Samples. This was created when we installed and configured samples shipped with the product. Click on the New Folder icon, which is represented by , on the toolbar to create a user-defined folder. Other options are also visible here, for instance to create a new dashboard.   Click on the user-defined folder—Samples to view its contents, as shown in the following screenshot: As shown in the preceding screenshot, it has more such folders, each having its own content. The top part of the pane shows the navigation path. Let's navigate deeper into Models | Business Insight Samples to show some sample dashboards, created using IBM Cognos Business Insight, as shown in the following screenshot: Click on one of these links to view the corresponding dashboard. For instance, click on Sales Dashboard (Interactive) to view the dashboard, as shown in the following screenshot: The dashboard can also be opened in the authoring tool, which is IBM Cognos Business Insight, in this case by clicking on the icon shown as on extreme right, on Cognos Connection. It will show the same result as shown in the preceding screenshot. We will see the Business Insight interface in detail later in this article. How it works... Cognos Connection is the primary user interface that user sees when he/she logs in for the first time. Business data has to be first identified and imported from the metadata model using the Cognos Framework Manager tool. Relationships (inner/outer joins) and calculations are then created, and the resultant metadata model package is published to the IBM Cognos 10 BI Server. This becomes available on Cognos Connection. Users are given access to appropriate studios on Cognos Connection, according to their needs. Analysis, reports, and dashboards are then created and distributed using one of these studios. The preceding sample has used Business Insight, for instance. Later sections in this article will look more into Business Insight and Business Insight Advanced. The next section focuses on the Business Insight interface details from the navigation perspective. Exploring IBM Cognos Business Insight User Interface In this recipe we will explore IBM Cognos Business Insight User Interface in more detail. We will explore various areas of the UI, each dedicated to perform different actions. Getting ready As stated earlier, we will be exploring different sections of Cognos Business Insight. Hence, make sure that IBM Cognos 10 BI installation is open and samples are set up properly. We will start the recipe assuming that the IBM Cognos Connection window is already open on the screen. How to do it... To explore IBM Cognos Business Insight User Interface, perform the following steps: In the IBM Cognos Connection window, navigate to Business Insight Samples, as shown in the following screenshot: Click on one of the dashboards, for instance Marketing Dashboard to open the dashboard in Business Insight. Different areas are labeled, as shown in the following figure: The overall layout is termed as Dashboard. The topmost toolbar is called Application bar . The Application bar contains different icons to manage the dashboard as a whole. For instance, we can create, open, e-mail, share, or save the dashboard using one of the icons on the Application bar. The user can explore different icons on the Application bar by hovering the mouse pointer over them. Hovering displays the tooltip, which has a brief but self-explanatory help text. Similarly, it has a Widget toolbar for every widget, which gets activated when the user clicks on the corresponding widget. When the mouse is focused away from the widget, the Widget toolbar disappears. It has various options, for instance to refresh the widget data, print as PDF, resize to ? t content, and so on. It also provides the user with the capability to change the chart type as well as to change the color palette. However, all these options have help text associated with them, which is activated on mouse hover. Content tab and Content pane show the list of objects available on the Cognos Connection. Directory structure on Cognos Connection can be navigated using Content pane and Content tab, and hence, available objects can be added to or removed from the dashboard. The drag-and-drop functionality has been provided as a result of which creating and editing a dashboard has become as simple as moving objects between the Dashboard area and Cognos Connection. The Toolbox tab displays additional widgets. The Slider Filter and Select Value Filter widgets allow the user to filter report content. The other toolbox widgets allow user to add more report content to the dashboard, such as HTML content, images, RSS feeds, and rich text. How it works... In the preceding section, we have seen basic areas of Business Insight. More than one user can log on to the IBM Cognos 10 BI server, and create various objects on Cognos Connection. These objects include packages, reports, cubes, templates, and statistics to name a few. These objects can be created using one or more tools available to users. For instance, reports can be created using one of the studios available. Cubes can be created using IBM Cognos TM1 or IBM Cognos Transformer and published on Cognos Connection. Metadata model packages can be created using IBM Cognos Framework Manager and published on Cognos Connection. These objects can then be dragged, dropped, and formatted as standalone objects in Cognos Business Insight, and hence, dashboards can be created.
Read more
  • 0
  • 0
  • 5159

article-image-some-basic-concepts-theano
Packt
21 Feb 2018
13 min read
Save for later

Some Basic Concepts of Theano

Packt
21 Feb 2018
13 min read
 In this article by Christopher Bourez, the author of the book Deep Learning with Theano, presents Theano as a compute engine, and the basics for symbolic computing with Theano. Symbolic computing consists in building graphs of operations that will be optimized later on for a specific architecture, using the computation libraries available for this architecture. (For more resources related to this topic, see here.) Although this article might sound far from practical application. Theano may be defined as a library for scientific computing; it has been available since 2007 and is particularly suited for deep learning. Two important features are at the core of any deep learning library: tensor operations, and the capability to run the code on CPU or GPU indifferently. These two features enable us to work with massive amount of multi-dimensional data. Moreover, Theano proposes automatic differentiation, a very useful feature to solve a wider range of numeric optimizations than deep learning problems. The content of the article covers the following points: Theano install and loading Tensors and algebra Symbolic programming Need for tensor Usually, input data is represented with multi-dimensional arrays: Images have three dimensions: The number of channels, the width and height of the image Sounds and times series have one dimension: The time length Natural language sequences can be represented by two dimensional arrays: The time length and the alphabet length or the vocabulary length In Theano, multi-dimensional arrays are implemented with an abstraction class, named tensor, with many more transformations available than traditional arrays in a computer language like Python. At each stage of a neural net, computations such as matrix multiplications involve multiple operations on these multi-dimensional arrays. Classical arrays in programming languages do not have enough built-in functionalities to address well and fastly multi-dimensional computations and manipulations. Computations on multi-dimensional arrays have known a long history of optimizations, with tons of libraries and hardwares. One of the most important gains in speed has been permitted by the massive parallel architecture of the Graphical Computation Unit (GPU), with computation ability on a large number of cores, from a few hundreds to a few thousands. Compared to the traditional CPU, for example a quadricore, 12-core or 32-core engine, the gain with GPU can range from a 5x to a 100x times speedup, even if part of the code is still being executed on the CPU (data loading, GPU piloting, result outputing). The main bottleneck with the use of GPU is usually the transfer of data between the memory of the CPU and the memory of the GPU, but still, when well programmed, the use of GPU helps bring a significant increase in speed of an order of magnitude. Getting results in days rather than months, or hours rather than days, is an undeniable benefit for experimentation. Theano engine has been designed to address these two challenges of multi-dimensional array and architecture abstraction from the beginning. There is another undeniable benefit of Theano for scientific computation: the automatic differentiation of functions of multi-dimensional arrays, a well-suited feature for model parameter inference via objective function minimization. Such a feature facilitates the experimentation by releasing the pain to compute derivatives, which might not be so complicated, but prone to many errors. Installing and loading Theano Conda package and environment manager The easiest way to install Theano is to use conda, a cross-platform package and environment manager. If conda is not already installed on your operating system, the fastest way to install conda is to download the miniconda installer from https://conda.io/miniconda.html. For example, for conda under Linux 64 bit and Python 2.7: wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh chmod +x Miniconda2-latest-Linux-x86_64.sh bash ./Miniconda2-latest-Linux-x86_64.sh   Conda enables to create new environments in which versions of Python (2 or 3) and the installed packages may differ. The conda root environment uses the same version of Python as the version installed on your system on which you installed conda. Install and run Theano on CPU Last, let’s install Theano: conda install theano Run a python session and try the following commands to check your configuration: >>> from theano import theano   >>> theano.config.device   'cpu'   >>> theano.config.floatX   'float64'   >>> print(theano.config) The last command prints all the configuration of Theano. The theano.config object contains keys to many configuration options. To infer the configuration options, Theano looks first at ~/.theanorc file, then at any environment variables available, which override the former options, last at the variable set in the code, that are first in the order of precedence: >>> theano.config.floatX='float32' Some of the properties might be read-only and cannot be changed in the code, but floatX property, that sets the default floating point precision for floats, is among properties that can be changed directly in the code. It is advised to use float32 since GPU have a long history without float64, float64 execution speed on GPU is slower, sometimes much slower (2x to 32x on latest generation Pascal hardware), and that float32 precision is enough in practice. GPU drivers and libraries Theano enables the use of GPU (graphic computation units), the units usually used to compute the graphics to display on the computer screen. To have Theano work on the GPU as well, a GPU backend library is required on your system. CUDA library (for NVIDIA GPU cards only) is the main choice for GPU computations. There exists also the OpenCL standard, which is opensource, but far less developed, and much more experimental and rudimentary on Theano. Most of the scientific computations still occur on NVIDIA cards today. If you have a NVIDIA GPU card, download CUDA from the NVIDIA website at https://developer.nvidia.com/cuda-downloads and install it. The installer will install the lastest version of the gpu drivers first if they are not already installed. It will install the CUDA library in /usr/local/cuda directory. Install the cuDNN library, a library by NVIDIA also, that offers faster implementations of some operations for the GPU To install it, I usually copy /usr/local/cuda directory to a new directory /usr/local/cuda-{CUDA_VERSION}-cudnn-{CUDNN_VERSION} so that I can choose the version of CUDA and cuDNN, depending on the deep learning technology I use, and its compatibility. In your .bashrc profile, add the following line to set $PATH and $LD_LIBRARY_PATH variables: export PATH=/usr/local/cuda-8.0-cudnn-5.1/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-8.0-cudnn-5.1/lib64::/usr/local/cuda-8.0-cudnn-5.1/lib:$LD_LIBRARY_PATH Install and run Theano on GPU N-dimensional GPU arrays have been implemented in Python under 6 different GPU library (Theano/CudaNdarray,PyCUDA/ GPUArray,CUDAMAT/ CUDAMatrix, PYOPENCL/GPUArray, Clyther, Copperhead), are a subset of NumPy.ndarray. Libgpuarray is a backend library to have them in a common interface with the same property. To install libgpuarray with conda: conda install pygpu To run Theano in GPU mode, you need to configure the config.device variable before execution since it is a read-only variable once the code is run. With the environment variable THEANO_FLAGS: THEANO_FLAGS="device=cuda,floatX=float32" python >>> import theano Using cuDNN version 5110 on context None Mapped name None to device cuda: Tesla K80 (0000:83:00.0) >>> theano.config.device 'gpu' >>> theano.config.floatX 'float32' The first return shows that GPU device has been correctly detected, and specifies which GPU it uses. By default, Theano activates CNMeM, a faster CUDA memory allocator, an initial preallocation can be specified with gpuarra.preallocate option. At the end, my launch command will be: THEANO_FLAGS="device=cuda,floatX=float32,gpuarray.preallocate=0.8" python >>> from theano import theano Using cuDNN version 5110 on context None Preallocating 9151/11439 Mb (0.800000) on cuda Mapped name None to device cuda: Tesla K80 (0000:83:00.0)   The first line confirms that cuDNN is active, the second confirms memory preallocation. The third line gives the default context name (that is None when the flag device=cuda is set) and the model of the GPU used, while the default context name for the CPU will always be cpu. It is possible to specify a different GPU than the first one, setting the device to cuda0, cuda1,... for multi-GPU computers. It is also possible to run a program on multiple GPU in parallel or in sequence (when the memory of one GPU is not sufficient), in particular when training very deep neural nets. In this case, the context flag contexts=dev0->cuda0;dev1->cuda1;dev2->cuda2;dev3->cuda3 activates multiple GPU instead of one, and designate the context name to each GPU device to be used in the code. For example, on a 4-GPU instance: THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1;dev2->cuda2;dev3->cuda3,floatX=float32,gpuarray.preallocate=0.8" python >>> import theano Using cuDNN version 5110 on context None Preallocating 9177/11471 Mb (0.800000) on cuda0 Mapped name dev0 to device cuda0: Tesla K80 (0000:83:00.0) Using cuDNN version 5110 on context dev1 Preallocating 9177/11471 Mb (0.800000) on cuda1 Mapped name dev1 to device cuda1: Tesla K80 (0000:84:00.0) Using cuDNN version 5110 on context dev2 Preallocating 9177/11471 Mb (0.800000) on cuda2 Mapped name dev2 to device cuda2: Tesla K80 (0000:87:00.0) Using cuDNN version 5110 on context dev3 Preallocating 9177/11471 Mb (0.800000) on cuda3 Mapped name dev3 to device cuda3: Tesla K80 (0000:88:00.0)   To assign computations to a specific GPU in this multi-GPU setting, the names we choose dev0, dev1, dev2, and dev3 have been mapped to each device (cuda0, cuda1, cuda2, cuda3). This name mapping enables to write codes that are independent of the underlying GPU assignments and libraries (CUDA or other). To keep the current configuration flags active at every Python session or execution without using environment variables, save your configuration in the ~/.theanorc file as: [global] floatX = float32 device = cuda0 [gpuarray] preallocate = 1 Now, you can simply run python command. You are now all set. Tensors In Python, some scientific libraries such as NumPy provide multi-dimensional arrays. Theano doesn't replace Numpy but works in concert with it. In particular, NumPy is used for the initialization of tensors. To perform the computation on CPU and GPU indifferently, variables are symbolic and represented by the tensor class, an abstraction, and writing numerical expressions consists in building a computation graph of Variable nodes and Apply nodes. Depending on the platform on which the computation graph will be compiled, tensors are replaced either: By a TensorType variable, which data has to be on CPU By a GpuArrayType variable, which data has to be on GPU That way, the code can be written indifferently of the platform where it will be executed. Here are a few tensor objects: Object class Number of dimensions Example theano.tensor.scalar 0-dimensional array 1, 2.5 theano.tensor.vector 1-dimensional array [0,3,20] theano.tensor.matrix 2-dimensional array [[2,3][1,5]] theano.tensor.tensor3 3-dimensional array [[[2,3][1,5]],[[1,2],[3,4]]] Playing with these Theano objects in the Python shell gives a better idea: >>> import theano.tensor as T   >>> T.scalar() <TensorType(float32, scalar)>   >>> T.iscalar() <TensorType(int32, scalar)>   >>> T.fscalar() <TensorType(float32, scalar)>   >>> T.dscalar() <TensorType(float64, scalar)> With a i, l, f, d letter in front of the object name, you initiate a tensor of a given type, integer32, integer64, floats32 or float64. For real-valued (floating point) data, it is advised to use the direct form T.scalar() instead of the f or d variants since the direct form will use your current configuration for floats: >>> theano.config.floatX = 'float64'   >>> T.scalar() <TensorType(float64, scalar)>   >>> T.fscalar() <TensorType(float32, scalar)>   >>> theano.config.floatX = 'float32'   >>> T.scalar() <TensorType(float32, scalar)> Symbolic variables either: Play the role of placeholders, as a starting point to build your graph of numerical operations (such as addition, multiplication): they receive the flow of the incoming data during the evaluation, once the graph has been compiled Represent intermediate or output results Symbolic variables and operations are both part of a computation graph that will be compiled either towards CPU or GPU for fast execution. Let's write a first computation graph consisting in a simple addition: >>> x = T.matrix('x')   >>> y = T.matrix('y')   >>> z = x + y   >>> theano.pp(z) '(x + y)'   >>> z.eval({x: [[1, 2], [1, 3]], y: [[1, 0], [3, 4]]}) array([[ 2., 2.],        [ 4., 7.]], dtype=float32) At first place, two symbolic variables, or Variable nodes are created, with names x and y, and an addition operation, an Apply node, is applied between both of them, to create a new symbolic variable, z, in the computation graph. The pretty print function pp prints the expression represented by Theano symbolic variables. Eval evaluates the value of the output variable z, when the first two variables x and y are initialized with two numerical 2-dimensional arrays. The following example explicit the difference between the variables x and y, and their names x and y: >>> a = T.matrix()   >>> b = T.matrix()   >>> theano.pp(a + b) '(<TensorType(float32, matrix)> + <TensorType(float32, matrix)>)' Without names, it is more complicated to trace the nodes in a large graph. When printing the computation graph, names significantly helps diagnose problems, while variables are only used to handle the objects in the graph: >>> x = T.matrix('x')   >>> x = x + x   >>> theano.pp(x) '(x + x)' Here the original symbolic variable, named x, does not change and stays part of the computation graph. x + x creates a new symbolic variable we assign to the Python variable x. Note also, that with names, the plural form initializes multiple tensors at the same time: >>> x, y, z = T.matrices('x', 'y', 'z') Now, let's have a look at the different functions to display the graph. Summary Thus, this article helps us to give a brief idea on how to download and install Theano on various platforms along with the packages such as NumPy and SciPy. Resources for Article:   Further resources on this subject: Introduction to Deep Learning [article] Getting Started with Deep Learning [article] Practical Applications of Deep Learning [article]
Read more
  • 0
  • 0
  • 5151

article-image-python-graphics-combining-raster-and-vector-pictures
Packt
23 Nov 2010
12 min read
Save for later

Python Graphics: Combining Raster and Vector Pictures

Packt
23 Nov 2010
12 min read
  Python 2.6 Graphics Cookbook Over 100 great recipes for creating and animating graphics using Python Create captivating graphics with ease and bring them to life using Python Apply effects to your graphics using powerful Python methods Develop vector as well as raster graphics and combine them to create wonders in the animation world Create interactive GUIs to make your creation of graphics simpler Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to accomplish the task of creation and animation of graphics as efficiently as possible Because we are not altering and manipulating the actual properties of the images we do not need the Python Imaging Library (PIL) in this chapter. We need to work exclusively with GIF format images because that is what Tkinter deals with. We will also see how to use "The GIMP" as a tool to prepare images suitable for animation. Simple animation of a GIF beach ball We want to animate a raster image, derived from a photograph. To keep things simple and clear we are just going to move a photographic image (in GIF format) of a beach ball across a black background. Getting ready We need a suitable GIF image of an object that we want to animate. An example of one, named beachball.gif has been provided. How to do it... Copy a .gif fle from somewhere and paste it into a directory where you want to keep your work-in-progress pictures. Ensure that the path in our computer's fle system leads to the image to be used. In the example below, the instruction, ball = PhotoImage(file = "constr/pics2/beachball.gif") says that the image to be used will be found in a directory (folder) called pics2, which is a sub-folder of another folder called constr. Then execute the following code. # photoimage_animation_1.py #>>>>>>>>>>>>>>>>>>>>>>>> from Tkinter import * root = Tk() cycle_period = 100 cw = 300 # canvas width ch = 200 # canvas height canvas_1 = Canvas(root, width=cw, height=ch, bg="black") canvas_1.grid(row=0, column=1) posn_x = 10 posn_y = 10 shift_x = 2 shift_y = 1 ball = PhotoImage(file = "/constr/pics2/beachball.gif") for i in range(1,100): # end the program after 100 position shifts. posn_x += shift_x posn_y += shift_y canvas_1.create_image(posn_x,posn_y,anchor=NW, image=ball) canvas_1.update() # This refreshes the drawing on the canvas. canvas_1.after(cycle_period) # This makes execution pause for 100 milliseconds. canvas_1.delete(ALL) # This erases everything on the canvas. root.mainloop() How it Works The image of the beach ball is shifted across a canvas. The photo type images always occupy a rectangular area of screen. The size of this box, called the bounding box, is the size of the image. We have used a black background so the black corners on the image of our beach ball cannot be seen. The vector walking creature We make a pair of walking legs using the vector graphics. We want to use these legs together with pieces of raster images and see how far we can go in making appealing animations. We import Tkinter, math, and time modules. The math is needed to provide the trigonometry that sustains the geometric relations that move the parts of the leg in relation to each other. Getting ready We will be using Tkinter and time modules to animate the movement of lines and circles. You will see some trigonometry in the code. If you do not like mathematics you can just cut and paste the code without the need to understand exactly how the maths works. However, if you are a friend of mathematics it is fun to watch sine, cosine, and tangent working together to make a child smile. How to do it... Execute the program as shown in the previous image. # walking_creature_1.py # >>>>>>>>>>>>>>>> from Tkinter import * import math import time root = Tk() root.title("The thing that Strides") cw = 400 # canvas width ch = 100 # canvas height #GRAVITY = 4 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 100 # time between new positions of the ball (milliseconds). base_x = 20 base_y = 100 hip_h = 40 thy = 20 #=============================================== # Hip positions: Nhip = 2 x Nstep, the number of steps per foot per stride. hip_x = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 60, 60] #15 hip_y = [0, 8, 12, 16, 12, 8, 0, 0, 0, 8, 12, 16, 12, 8, 0] #15 step_x = [0, 10, 20, 30, 40, 50, 60, 60] # 8 = Nhip step_y = [0, 35, 45, 50, 43, 32, 10, 0] # The merging of the separate x and y lists into a single sequence. #================================== # Given a line joining two points xy0 and xy1, the base of an isosceles triangle, # as well as the length of one side, "thy" . This returns the coordinates of # the apex joining the equal-length sides. def kneePosition(x0, y0, x1, y1, thy): theta_1 = math.atan2((y1 - y0), (x1 - x0)) L1 = math.sqrt( (y1 - y0)**2 + (x1 - x0)**2) if L1/2 < thy: # The sign of alpha determines which way the knees bend. alpha = -math.acos(L1/(2*thy)) # Avian #alpha = math.acos(L1/(2*thy)) # Mammalian else: alpha = 0.0 theta_2 = alpha + theta_1 x_knee = x0 + thy * math.cos(theta_2) y_knee = y0 + thy * math.sin(theta_2) return x_knee, y_knee def animdelay(): chart_1.update() # This refreshes the drawing on the canvas. chart_1.after(cycle_period) # This makes execution pause for 200 milliseconds. chart_1.delete(ALL) # This erases *almost* everything on the canvas. # Does not delete the text from inside a function. bx_stay = base_x by_stay = base_y for j in range(0,11): # Number of steps to be taken - arbitrary. astep_x = 60*j bstep_x = astep_x + 30 cstep_x = 60*j + 15 aa = len(step_x) -1 for k in range(0,len(hip_x)-1): # Motion of the hips in a stride of each foot. cx0 = base_x + cstep_x + hip_x[k] cy0 = base_y - hip_h - hip_y[k] cx1 = base_x + cstep_x + hip_x[k+1] cy1 = base_y - hip_h - hip_y[k+1] chart_1.create_line(cx0, cy0 ,cx1 ,cy1) chart_1.create_oval(cx1-10 ,cy1-10 ,cx1+10 ,cy1+10, fill="orange") if k >= 0 and k <= len(step_x)-2: # Trajectory of the right foot. ax0 = base_x + astep_x + step_x[k] ax1 = base_x + astep_x + step_x[k+1] ay0 = base_y - step_y[k] ay1 = base_y - step_y[k+1] ax_stay = ax1 ay_stay = ay1 if k >= len(step_x)-1 and k <= 2*len(step_x)-2: # Trajectory of the left foot. bx0 = base_x + bstep_x + step_x[k-aa] bx1 = base_x + bstep_x + step_x[k-aa+1] by0 = base_y - step_y[k-aa] by1 = base_y - step_y[k-aa+1] bx_stay = bx1 by_stay = by1 aknee_xy = kneePosition(ax_stay, ay_stay, cx1, cy1, thy) chart_1.create_line(ax_stay, ay_stay ,aknee_xy[0], aknee_xy[1], width = 3, fill="orange") chart_1.create_line(cx1, cy1 ,aknee_xy[0], aknee_xy[1], width = 3, fill="orange") chart_1.create_oval(ax_stay-5 ,ay1-5 ,ax1+5 ,ay1+5, fill="green") chart_1.create_oval(bx_stay-5 ,by_stay-5 ,bx_stay+5 ,by_stay+5, fill="blue") bknee_xy = kneePosition(bx_stay, by_stay, cx1, cy1, thy) chart_1.create_line(bx_stay, by_stay ,bknee_xy[0], bknee_xy[1], width = 3, fill="pink") chart_1.create_line(cx1, cy1 ,bknee_xy[0], bknee_xy[1], width = 3, fill="pink") animdelay() root.mainloop() How it works... Without getting bogged down in detail, the strategy in the program consists of defning the motion of a foot while walking one stride. This motion is defned by eight relative positions given by the two lists step_x (horizontal) and step_y (vertical). The motion of the hips is given by a separate pair of x- and y-positions hip_x and hip_y. Trigonometry is used to work out the position of the knee on the assumption that the thigh and lower leg are the same length. The calculation is based on the sine rule taught in high school. Yes, we do learn useful things at school! The time-animation regulation instructions are assembled together as a function animdelay(). There's more In Python math module, two arc-tangent functions are available for calculating angles given the lengths of two adjacent sides. atan2(y,x) is the best because it takes care of the crazy things a tangent does on its way around a circle - tangent ficks from minus infnity to plus infnity as it passes through 90 degrees and any multiples thereof. A mathematical knee is quite happy to bend forward or backward in satisfying its equations. We make the sign of the angle negative for a backward-bending bird knee and positive for a forward bending mammalian knee. More Info Section 1 This animated walking hips-and-legs is used in the recipes that follow this to make a bird walk in the desert, a diplomat in palace grounds, and a spider in a forest. Bird with shoes walking in the Karroo We now coordinate the movement of four GIF images and the striding legs to make an Apteryx (a fightless bird like the kiwi) that walks. Getting ready We need the following GIF images: A background picture of a suitable landscape A bird body without legs A pair of garish-colored shoes to make the viewer smile The walking avian legs of the previous recipe The images used are karroo.gif, apteryx1.gif, and shoe1.gif. Note that the images of the bird and the shoe have transparent backgrounds which means there is no rectangular background to be seen surrounding the bird or the shoe. In the recipe following this one, we will see the simplest way to achieve the necessary transparency. How to do it... Execute the program shown in the usual way. # walking_birdy_1.py # >>>>>>>>>>>>>>>> from Tkinter import * import math import time root = Tk() root.title("A Walking birdy gif and shoes images") cw = 800 # canvas width ch = 200 # canvas height #GRAVITY = 4 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 80 # time between new positions of the ball (milliseconds). im_backdrop = "/constr/pics1/karoo.gif" im_bird = "/constr/pics1/apteryx1.gif" im_shoe = "/constr/pics1/shoe1.gif" birdy =PhotoImage(file= im_bird) shoey =PhotoImage(file= im_shoe) backdrop = PhotoImage(file= im_backdrop) chart_1.create_image(0 ,0 ,anchor=NW, image=backdrop) base_x = 20 base_y = 190 hip_h = 70 thy = 60 #========================================== # Hip positions: Nhip = 2 x Nstep, the number of steps per foot per stride. hip_x = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 60, 60] #15 hip_y = [0, 8, 12, 16, 12, 8, 0, 0, 0, 8, 12, 16, 12, 8, 0] #15 step_x = [0, 10, 20, 30, 40, 50, 60, 60] # 8 = Nhip step_y = [0, 35, 45, 50, 43, 32, 10, 0] #============================================= # Given a line joining two points xy0 and xy1, the base of an isosceles triangle, # as well as the length of one side, "thy" this returns the coordinates of # the apex joining the equal-length sides. def kneePosition(x0, y0, x1, y1, thy): theta_1 = math.atan2(-(y1 - y0), (x1 - x0)) L1 = math.sqrt( (y1 - y0)**2 + (x1 - x0)**2) alpha = math.atan2(hip_h,L1) theta_2 = -(theta_1 - alpha) x_knee = x0 + thy * math.cos(theta_2) y_knee = y0 + thy * math.sin(theta_2) return x_knee, y_knee def animdelay(): chart_1.update() # Refresh the drawing on the canvas. chart_1.after(cycle_period) # Pause execution pause for X millise-conds. chart_1.delete("walking") # Erases everything on the canvas. bx_stay = base_x by_stay = base_y for j in range(0,13): # Number of steps to be taken - arbitrary. astep_x = 60*j bstep_x = astep_x + 30 cstep_x = 60*j + 15 aa = len(step_x) -1 for k in range(0,len(hip_x)-1): # Motion of the hips in a stride of each foot. cx0 = base_x + cstep_x + hip_x[k] cy0 = base_y - hip_h - hip_y[k] cx1 = base_x + cstep_x + hip_x[k+1] cy1 = base_y - hip_h - hip_y[k+1] #chart_1.create_image(cx1-55 ,cy1+20 ,anchor=SW, image=birdy, tag="walking") if k >= 0 and k <= len(step_x)-2: # Trajectory of the right foot. ax0 = base_x + astep_x + step_x[k] ax1 = base_x + astep_x + step_x[k+1] ay0 = base_y - 10 - step_y[k] ay1 = base_y - 10 -step_y[k+1] ax_stay = ax1 ay_stay = ay1 if k >= len(step_x)-1 and k <= 2*len(step_x)-2: # Trajectory of the left foot. bx0 = base_x + bstep_x + step_x[k-aa] bx1 = base_x + bstep_x + step_x[k-aa+1] by0 = base_y - 10 - step_y[k-aa] by1 = base_y - 10 - step_y[k-aa+1] bx_stay = bx1 by_stay = by1 chart_1.create_image(ax_stay-5 ,ay_stay + 10 ,anchor=SW, im-age=shoey, tag="walking") chart_1.create_image(bx_stay-5 ,by_stay + 10 ,anchor=SW, im-age=shoey, tag="walking") aknee_xy = kneePosition(ax_stay, ay_stay, cx1, cy1, thy) chart_1.create_line(ax_stay, ay_stay-15 ,aknee_xy[0], aknee_xy[1], width = 5, fill="orange", tag="walking") chart_1.create_line(cx1, cy1 ,aknee_xy[0], aknee_xy[1], width = 5, fill="orange", tag="walking") bknee_xy = kneePosition(bx_stay, by_stay, cx1, cy1, thy) chart_1.create_line(bx_stay, by_stay-15 ,bknee_xy[0], bknee_xy[1], width = 5, fill="pink", tag="walking") chart_1.create_line(cx1, cy1 ,bknee_xy[0], bknee_xy[1], width = 5, fill="pink", tag="walking") chart_1.create_image(cx1-55 ,cy1+20 ,anchor=SW, image=birdy, tag="walking") animdelay() root.mainloop() How it works... The same remarks concerning the trigonometry made in the previous recipe apply here. What we see here now is the ease with which vector objects and raster images can be combined once suitable GIF images have been prepared. There's more... For teachers and their students who want to make lessons on a computer, these techniques offer all kinds of possibilities like history tours and re-enactments, geography tours, and, science experiments. Get the students to do projects telling stories. Animated year books?
Read more
  • 0
  • 0
  • 5142
article-image-working-large-data-sources
Packt
08 Jul 2015
20 min read
Save for later

Working with large data sources

Packt
08 Jul 2015
20 min read
In this article, by Duncan M. McGreggor, author of the book Mastering matplotlib, we come across the use of NumPy in the world of matplotlib and big data, problems with large data sources, and the possible solutions to these problems. (For more resources related to this topic, see here.) Most of the data that users feed into matplotlib when generating plots is from NumPy. NumPy is one of the fastest ways of processing numerical and array-based data in Python (if not the fastest), so this makes sense. However by default, NumPy works on in-memory database. If the dataset that you want to plot is larger than the total RAM available on your system, performance is going to plummet. In the following section, we're going to take a look at an example that illustrates this limitation. But first, let's get our notebook set up, as follows: In [1]: import matplotlib        matplotlib.use('nbagg')        %matplotlib inline Here are the modules that we are going to use: In [2]: import glob, io, math, os         import psutil        import numpy as np        import pandas as pd        import tables as tb        from scipy import interpolate        from scipy.stats import burr, norm        import matplotlib as mpl        import matplotlib.pyplot as plt        from IPython.display import Image We'll use the custom style sheet that we created earlier, as follows: In [3]: plt.style.use("../styles/superheroine-2.mplstyle") An example problem To keep things manageable for an in-memory example, we're going to limit our generated dataset to 100 million points by using one of SciPy's many statistical distributions, as follows: In [4]: (c, d) = (10.8, 4.2)        (mean, var, skew, kurt) = burr.stats(c, d, moments='mvsk') The Burr distribution, also known as the Singh–Maddala distribution, is commonly used to model household income. Next, we'll use the burr object's method to generate a random population with our desired count, as follows: In [5]: r = burr.rvs(c, d, size=100000000) Creating 100 million data points in the last call took about 10 seconds on a moderately recent workstation, with the RAM usage peaking at about 2.25 GB (before the garbage collection kicked in). Let's make sure that it's the size we expect, as follows: In [6]: len(r) Out[6]: 100000000 If we save this to a file, it weighs in at about three-fourths of a gigabyte: In [7]: r.tofile("../data/points.bin") In [8]: ls -alh ../data/points.bin        -rw-r--r-- 1 oubiwann staff 763M Mar 20 11:35 points.bin This actually does fit in the memory on a machine with a RAM of 8 GB, but generating much larger files tends to be problematic. We can reuse it multiple times though, to reach a size that is larger than what can fit in the system RAM. Before we do this, let's take a look at what we've got by generating a smooth curve for the probability distribution, as follows: In [9]: x = np.linspace(burr.ppf(0.0001, c, d),                          burr.ppf(0.9999, c, d), 100)          y = burr.pdf(x, c, d) In [10]: (figure, axes) = plt.subplots(figsize=(20, 10))          axes.plot(x, y, linewidth=5, alpha=0.7)          axes.hist(r, bins=100, normed=True)          plt.show() The following plot is the result of the preceding code: Our plot of the Burr probability distribution function, along with the 100-bin histogram with a sample size of 100 million points, took about 7 seconds to render. This is due to the fact that NumPy handles most of the work, and we only displayed a limited number of visual elements. What would happen if we did try to plot all the 100 million points? This can be checked by the following code: In [11]: (figure, axes) = plt.subplots()          axes.plot(r)          plt.show() formatters.py:239: FormatterWarning: Exception in image/png formatter: Allocated too many blocks After about 30 seconds of crunching, the preceding error was thrown—the Agg backend (a shared library) simply couldn't handle the number of artists required to render all the points. But for now, this case clarifies the point that we stated a while back—our first plot rendered relatively quickly because we were selective about the data we chose to present, given the large number of points with which we are working. However, let's say we have data from the files that are too large to fit into the memory. What do we do about this? Possible ways to address this include the following: Moving the data out of the memory and into the filesystem Moving the data off the filesystem and into the databases We will explore examples of these in the following section. Big data on the filesystem The first of the two proposed solutions for large datasets involves not burdening the system memory with data, but rather leaving it on the filesystem. There are several ways to accomplish this, but the following two methods in particular are the most common in the world of NumPy and matplotlib: NumPy's memmap function: This function creates memory-mapped files that are useful if you wish to access small segments of large files on the disk without having to read the whole file into the memory. PyTables: This is a package that is used to manage hierarchical datasets. It is built on the top of the HDF5 and NumPy libraries and is designed to efficiently and easily cope with extremely large amounts of data. We will examine each in turn. NumPy's memmap function Let's restart the IPython kernel by going to the IPython menu at the top of notebook page, selecting Kernel, and then clicking on Restart. When the dialog box pops up, click on Restart. Then, re-execute the first few lines of the notebook by importing the required libraries and getting our style sheet set up. Once the kernel is restarted, take a look at the RAM utilization on your system for a fresh Python process for the notebook: In [4]: Image("memory-before.png") Out[4]: The following screenshot shows the RAM utilization for a fresh Python process: Now, let's load the array data that we previously saved to disk and recheck the memory utilization, as follows: In [5]: data = np.fromfile("../data/points.bin")        data_shape = data.shape        data_len = len(data)        data_len Out[5]: 100000000 In [6]: Image("memory-after.png") Out[6]: The following screenshot shows the memory utilization after loading the array data: This took about five seconds to load, with the memory consumption equivalent to the file size of the data. This means that if we wanted to build some sample data that was too large to fit in the memory, we'd need about 11 of those files concatenated, as follows: In [7]: 8 * 1024 Out[7]: 8192 In [8]: filesize = 763        8192 / filesize Out[8]: 10.73656618610747 However, this is only if the entire memory was available. Let's see how much memory is available right now, as follows: In [9]: del data In [10]: psutil.virtual_memory().available / 1024**2 Out[10]: 2449.1796875 That's 2.5 GB. So, to overrun our RAM, we'll just need a fraction of the total. This is done in the following way: In [11]: 2449 / filesize Out[11]: 3.2096985583224114 The preceding output means that we only need four of our original files to create a file that won't fit in memory. However, in the following section, we will still use 11 files to ensure that data, if loaded into the memory, will be much larger than the memory. How do we create this large file for demonstration purposes (knowing that in a real-life situation, the data would already be created and potentially quite large)? We can try to use numpy.tile to create a file of the desired size (larger than memory), but this can make our system unusable for a significant period of time. Instead, let's use numpy.memmap, which will treat a file on the disk as an array, thus letting us work with data that is too large to fit into the memory. Let's load the data file again, but this time as a memory-mapped array, as follows: In [12]: data = np.memmap(            "../data/points.bin", mode="r", shape=data_shape) The loading of the array to a memmap object was very quick (compared to the process of bringing the contents of the file into the memory), taking less than a second to complete. Now, let's create a new file to write the data to. This file must be larger in size as compared to our total system memory (if held on in-memory database, it will be smaller on the disk): In [13]: big_data_shape = (data_len * 11,)          big_data = np.memmap(              "../data/many-points.bin", dtype="uint8",              mode="w+", shape=big_data_shape) The preceding code creates a 1 GB file, which is mapped to an array that has the shape we requested and just contains zeros: In [14]: ls -alh ../data/many-points.bin          -rw-r--r-- 1 oubiwann staff 1.0G Apr 2 11:35 many-points.bin In [15]: big_data.shape Out[15]: (1100000000,) In [16]: big_data Out[16]: memmap([0, 0, 0, ..., 0, 0, 0], dtype=uint8) Now, let's fill the empty data structure with copies of the data we saved to the 763 MB file, as follows: In [17]: for x in range(11):              start = x * data_len              end = (x * data_len) + data_len              big_data[start:end] = data          big_data Out[17]: memmap([ 90, 71, 15, ..., 33, 244, 63], dtype=uint8) If you check your system memory before and after, you will only see minimal changes, which confirms that we are not creating an 8 GB data structure on in-memory. Furthermore, checking your system only takes a few seconds. Now, we can do some sanity checks on the resulting data and ensure that we have what we were trying to get, as follows: In [18]: big_data_len = len(big_data)          big_data_len Out[18]: 1100000000 In [19]: data[100000000 – 1] Out[19]: 63 In [20]: big_data[100000000 – 1] Out[20]: 63 Attempting to get the next index from our original dataset will throw an error (as shown in the following code), since it didn't have that index: In [21]: data[100000000] ----------------------------------------------------------- IndexError               Traceback (most recent call last) ... IndexError: index 100000000 is out of bounds ... But our new data does have an index, as shown in the following code: In [22]: big_data[100000000 Out[22]: 90 And then some: In [23]: big_data[1100000000 – 1] Out[23]: 63 We can also plot data from a memmaped array without having a significant lag time. However, note that in the following code, we will create a histogram from 1.1 million points of data, so the plotting won't be instantaneous: In [24]: (figure, axes) = plt.subplots(figsize=(20, 10))          axes.hist(big_data, bins=100)          plt.show() The following plot is the result of the preceding code: The plotting took about 40 seconds to generate. The odd shape of the histogram is due to the fact that, with our data file-hacking, we have radically changed the nature of our data since we've increased the sample size linearly without regard for the distribution. The purpose of this demonstration wasn't to preserve a sample distribution, but rather to show how one can work with large datasets. What we have seen is not too shabby. Thanks to NumPy, matplotlib can work with data that is too large for memory, even if it is a bit slow iterating over hundreds of millions of data points from the disk. Can matplotlib do better? HDF5 and PyTables A commonly used file format in the scientific computing community is Hierarchical Data Format (HDF). HDF is a set of file formats (namely HDF4 and HDF5) that were originally developed at the National Center for Supercomputing Applications (NCSA), a unit of the University of Illinois at Urbana-Champaign, to store and organize large amounts of numerical data. The NCSA is a great source of technical innovation in the computing industry—a Telnet client, the first graphical web browser, a web server that evolved into the Apache HTTP server, and HDF, which is of particular interest to us, were all developed here. It is a little known fact that NCSA's web browser code was the ancestor to both the Netscape web browser as well as a prototype of Internet Explorer that was provided to Microsoft by a third party. HDF is supported by Python, R, Julia, Java, Octave, IDL, and MATLAB, to name a few. HDF5 offers significant improvements and useful simplifications over HDF4. It uses B-trees to index table objects and, as such, works well for write-once/read-many time series data. Common use cases span fields such as meteorological studies, biosciences, finance, and aviation. The HDF5 files of multiterabyte sizes are common in these applications. Its typically constructed from the analyses of multiple HDF5 source files, thus providing a single (and often extensive) source of grouped data for a particular application. The PyTables library is built on the top of the Python HDF5 library and NumPy. As such, it not only provides access to one of the most widely used large data file formats in the scientific computing community, but also links data extracted from these files with the data types and objects provided by the fast Python numerical processing library. PyTables is also used in other projects. Pandas wraps PyTables, thus extending its convenient in-memory data structures, functions, and objects to large on-disk files. To use HDF data with Pandas, you'll want to create pandas.HDFStore, read from the HDF data sources with pandas.read_hdf, or write to one with pandas.to_hdf. Files that are too large to fit in the memory may be read and written by utilizing chunking techniques. Pandas does support the disk-based DataFrame operations, but these are not very efficient due to the required assembly on columns of data upon reading back into the memory. One project to keep an eye on under the PyData umbrella of projects is Blaze. It's an open wrapper and a utility framework that can be used when you wish to work with large datasets and generalize actions such as the creation, access, updates, and migration. Blaze supports not only HDF, but also SQL, CSV, and JSON. The API usage between Pandas and Blaze is very similar, and it offers a nice tool for developers who need to support multiple backends. In the following example, we will use PyTables directly to create an HDF5 file that is too large to fit in the memory (for an 8GB RAM machine). We will follow the following steps: Create a series of CSV source data files that take up approximately 14 GB of disk space Create an empty HDF5 file Create a table in the HDF5 file and provide the schema metadata and compression options Load the CSV source data into the HDF5 table Query the new data source once the data has been migrated Remember the temperature precipitation data for St. Francis, in Kansas, USA, from a previous notebook? We are going to generate random data with similar columns for the purposes of the HDF5 example. This data will be generated from a normal distribution, which will be used in the guise of the temperature and precipitation data for hundreds of thousands of fictitious towns across the globe for the last century, as follows: In [25]: head = "country,town,year,month,precip,tempn"          row = "{},{},{},{},{},{}n"          filename = "../data/{}.csv"          town_count = 1000          (start_year, end_year) = (1894, 2014)          (start_month, end_month) = (1, 13)          sample_size = (1 + 2                        * town_count * (end_year – start_year)                        * (end_month - start_month))          countries = range(200)          towns = range(town_count)          years = range(start_year, end_year)          months = range(start_month, end_month)          for country in countries:             with open(filename.format(country), "w") as csvfile:                  csvfile.write(head)                  csvdata = ""                  weather_data = norm.rvs(size=sample_size)                  weather_index = 0                  for town in towns:                    for year in years:                          for month in months:                              csvdata += row.format(                                  country, town, year, month,                                  weather_data[weather_index],                                  weather_data[weather_index + 1])                              weather_index += 2                  csvfile.write(csvdata) Note that we generated a sample data population that was twice as large as the expected size in order to pull both the simulated temperature and precipitation data at the same time (from the same set). This will take about 30 minutes to run. When complete, you will see the following files: In [26]: ls -rtm ../data/*.csv          ../data/0.csv, ../data/1.csv, ../data/2.csv,          ../data/3.csv, ../data/4.csv, ../data/5.csv,          ...          ../data/194.csv, ../data/195.csv, ../data/196.csv,          ../data/197.csv, ../data/198.csv, ../data/199.csv A quick look at just one of the files reveals the size of each, as follows: In [27]: ls -lh ../data/0.csv          -rw-r--r-- 1 oubiwann staff 72M Mar 21 19:02 ../data/0.csv With each file that is 72 MB in size, we have data that takes up 14 GB of disk space, which exceeds the size of the RAM of the system in question. Furthermore, running queries against so much data in the .csv files isn't going to be very efficient. It's going to take a long time. So what are our options? Well, to read this data, HDF5 is a very good fit. In fact, it is designed for jobs like this. We will use PyTables to convert the .csv files to a single HDF5. We'll start by creating an empty table file, as follows: In [28]: tb_name = "../data/weather.h5t"          h5 = tb.open_file(tb_name, "w")          h5 Out[28]: File(filename=../data/weather.h5t, title='', mode='w',              root_uep='/', filters=Filters(                  complevel=0, shuffle=False, fletcher32=False,                  least_significant_digit=None))          / (RootGroup) '' Next, we'll provide some assistance to PyTables by indicating the data types of each column in our table, as follows: In [29]: data_types = np.dtype(              [("country", "<i8"),              ("town", "<i8"),              ("year", "<i8"),              ("month", "<i8"),               ("precip", "<f8"),              ("temp", "<f8")]) Also, let's define a compression filter that can be used by PyTables when saving our data, as follows: In [30]: filters = tb.Filters(complevel=5, complib='blosc') Now, we can create a table inside our new HDF5 file, as follows: In [31]: tab = h5.create_table(              "/", "weather",              description=data_types,              filters=filters) With that done, let's load each CSV file, read it in chunks so that we don't overload the memory, and then append it to our new HDF5 table, as follows: In [32]: for filename in glob.glob("../data/*.csv"):          it = pd.read_csv(filename, iterator=True, chunksize=10000)          for chunk in it:              tab.append(chunk.to_records(index=False))            tab.flush() Depending on your machine, the entire process of loading the CSV file, reading it in chunks, and appending to a new HDF5 table can take anywhere from 5 to 10 minutes. However, what started out as a collection of the .csv files that weigh in at 14 GB is now a single compressed 4.8 GB HDF5 file, as shown in the following code: In [33]: h5.get_filesize() Out[33]: 4758762819 Here's the metadata for the PyTables-wrapped HDF5 table after the data insertion: In [34]: tab Out[34]: /weather (Table(288000000,), shuffle, blosc(5)) '' description := { "country": Int64Col(shape=(), dflt=0, pos=0), "town": Int64Col(shape=(), dflt=0, pos=1), "year": Int64Col(shape=(), dflt=0, pos=2), "month": Int64Col(shape=(), dflt=0, pos=3), "precip": Float64Col(shape=(), dflt=0.0, pos=4), "temp": Float64Col(shape=(), dflt=0.0, pos=5)} byteorder := 'little' chunkshape := (1365,) Now that we've created our file, let's use it. Let's excerpt a few lines with an array slice, as follows: In [35]: tab[100000:100010] Out[35]: array([(0, 69, 1947, 5, -0.2328834718674, 0.06810312195695),          (0, 69, 1947, 6, 0.4724989007889, 1.9529216219569),          (0, 69, 1947, 7, -1.0757216683235, 1.0415374480545),          (0, 69, 1947, 8, -1.3700249968748, 3.0971874991576),          (0, 69, 1947, 9, 0.27279758311253, 0.8263207523831),          (0, 69, 1947, 10, -0.0475253104621, 1.4530808932953),          (0, 69, 1947, 11, -0.7555493935762, -1.2665440609117),          (0, 69, 1947, 12, 1.540049376928, 1.2338186532516),          (0, 69, 1948, 1, 0.829743501445, -0.1562732708511),          (0, 69, 1948, 2, 0.06924900463163, 1.187193711598)],          dtype=[('country', '<i8'), ('town', '<i8'),                ('year', '<i8'), ('month', '<i8'),                ('precip', '<f8'), ('temp', '<f8')]) In [36]: tab[100000:100010]["precip"] Out[36]: array([-0.23288347, 0.4724989 , -1.07572167,                -1.370025 , 0.27279758, -0.04752531,                -0.75554939, 1.54004938, 0.8297435 ,                0.069249 ]) When we're done with the file, we do the same thing that we would do with any other file-like object: In [37]: h5.close() If we want to work with it again, simply load it, as follows: In [38]: h5 = tb.open_file(tb_name, "r")          tab = h5.root.weather Let's try plotting the data from our HDF5 file: In [39]: (figure, axes) = plt.subplots(figsize=(20, 10))          axes.hist(tab[:1000000]["temp"], bins=100)          plt.show() Here's a plot for the first million data points: This histogram was rendered quickly, with a much better response time than what we've seen before. Hence, the process of accessing the HDF5 data is very fast. The next question might be "What about executing calculations against this data?" Unfortunately, running the following will consume an enormous amount of RAM: tab[:]["temp"].mean() We've just asked for all of the data—all of its 288 million rows. We are going to end up loading everything into the RAM, grinding the average workstation to a halt. Ideally though, when you iterate through the source data and create the HDF5 file, you also crunch the numbers that you will need, adding supplemental columns or groups to the HDF5 file that can be used later by you and your peers. If we have data that we will mostly be selecting (extracting portions) and which has already been crunched and grouped as needed, HDF5 is a very good fit. This is why one of the most common use cases that you see for HDF5 is the sharing and distribution of the processed data. However, if we have data that we need to process repeatedly, then we will either need to use another method besides the one that will cause all the data to be loaded into the memory, or find a better match for our data processing needs. We saw in the previous section that the selection of data is very fast in HDF5. What about calculating the mean for a small section of data? If we've got a total of 288 million rows, let's select a divisor of the number that gives us several hundred thousand rows at a time—2,81,250 rows, to be more precise. Let's get the mean for the first slice, as follows: In [40]: tab[0:281250]["temp"].mean() Out[40]: 0.0030696632864265312 This took about 1 second to calculate. What about iterating through the records in a similar fashion? Let's break up the 288 million records into chunks of the same size; this will result in 1024 chunks. We'll start by getting the ranges needed for an increment of 281,250 and then, we'll examine the first and the last row as a sanity check, as follows: In [41]: limit = 281250          ranges = [(x * limit, x * limit + limit)              for x in range(2 ** 10)]          (ranges[0], ranges[-1]) Out[41]: ((0, 281250), (287718750, 288000000)) Now, we can use these ranges to generate the mean for each chunk of 281,250 rows of temperature data and print the total number of means that we generated to make sure that we're getting our counts right, as follows: In [42]: means = [tab[x * limit:x * limit + limit]["temp"].mean()              for x in range(2 ** 10)]          len(means) Out[42]: 1024 Depending on your machine, that should take between 30 and 60 seconds. With this work done, it's now easy to calculate the mean for all of the 288 million points of temperature data: In [43]: sum(means) / len(means) Out[43]: -5.3051780413782918e-05 Through HDF5's efficient file format and implementation, combined with the splitting of our operations into tasks that would not copy the HDF5 data into memory, we were able to perform calculations across a significant fraction of a billion records in less than a minute. HDF5 even supports parallelization. So, this can be improved upon with a little more time and effort. However, there are many cases where HDF5 is not a practical choice. You may have some free-form data, and preprocessing it will be too expensive. Alternatively, the datasets may be actually too large to fit on a single machine. This is when you may consider using matplotlib with distributed data. Summary In this article, we covered the role of NumPy in the world of big data and matplotlib as well as the process and problems in working with large data sources. Also, we discussed the possible solutions to these problems using NumPy's memmap function and HDF5 and PyTables. Resources for Article: Further resources on this subject: First Steps [article] Introducing Interactive Plotting [article] The plot function [article]
Read more
  • 0
  • 0
  • 5127

article-image-creating-themes-report-using-birt
Packt
17 Jul 2010
3 min read
Save for later

Creating Themes for a Report using BIRT

Packt
17 Jul 2010
3 min read
Creating themes Using the power of stylesheets and libraries, one has the ability to apply general formatting to an entire report design using themes and reuse these among different reports. Themes provide a simple mechanism to apply a wide range of styles to an entire report design without the need to manually apply them. The following example will move the styles that we have created in our library and will show how to apply them to our report using a theme. For each of the styles we have created, select them under the Outline tab and choose Export to Library…. Choose the ClassicCarsLibrary.rptLibrary file. All of the styles will reside under the defaultTheme themes section, so select this from the drop-down list that appears next to the Theme option. Repeat these steps for all styles we have created in Customer Orders.rptDesign. Delete all of the styles stored in the Customer Orders.rptDesign file. You will notice all the styles disappear from the report designer. In the Outline tab, under the Customer Orders.rptDesign file, select the root element titled Customer Orders.rptDesign. Right-click the element and select Refresh Library. The library should already be shared since we built the report using the library's data source and datasets. If it is not, open the Resource Explorer, choose ClassicCarsLibrary.rptLibrary, right-click and choose Use Library. Under the Property Editor, change the Themes drop down to ClassicCarsLibrarydefaultTheme. When we apply the theme, we will see the detail table header automatically apply the style for table-header. Apply the remaining custom styles to the two columns in the customer information section and the order detail row. Now, we know that we can create several different themes by grouping styles together in libraries. So, when developing reports, you can create several different looks that can be applied to reports, simply by applying themes to reports with the help of libraries. Using external CSS stylesheets Another stylesheet feature is the ability to use external stylesheets and simply link to them. This works out very well when report documents are embedded into existing web portals by using the portals stylesheets to keep a consistent look and feel. This creates a sense of uniformity in the overall site. Imagine that our graphics designer gives us a CSS file and asks us to design our reports around it. There are two ways one can use CSS files in BIRT: Importing CSS files Using CSS as a resource In the following examples we are going to illustrate both scenarios. I have a CSS file containing six styles—five styles that are for predefined elements in reports and one style that is a custom style. The following is the CSS stylesheet for the given report: .page { background-color: #FFFFFF; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 12px; line-height: 24px; color: #336699;}.table-group-footer-1 { font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 12px; line-height: 24px; color: #333333; background-color: #FFFFCC;}.title { font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 24px; line-height: 40px; background-color: #99CC00; color: #003333; font-weight: bolder;}.table-header { font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 20px; background-color: #669900; color: #FFFF33;}.table-footer { font-family: Arial, Helvetica, sans-serif; font-size: 14px; font-weight: bold; line-height: 22px; color: #333333; background-color: #CCFF99;}
Read more
  • 0
  • 0
  • 5108

article-image-deep-learning-and-regression-analysis
Packt
09 Jan 2017
6 min read
Save for later

Deep learning and regression analysis

Packt
09 Jan 2017
6 min read
In this article by Richard M. Reese and Jennifer L. Reese, authors of the book, Java for Data Science, We will discuss neural networks can be used to perform regression analysis. However, other techniques may offer a more effective solution. With regression analysis, we want to predict a result based on several input variables (For more resources related to this topic, see here.) We can perform regression analysis using an output layer that consists of a single neuron that sums the weighted input plus bias of the previous hidden layer. Thus, the result is a single value representing the regression. Preparing the data We will use a car evaluation database to demonstrate how to predict the acceptability of a car based on a series of attributes. The file containing the data we will be using can be downloaded from: http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data. It consists of car data such as price, number of passengers, and safety information, and an assessment of its overall quality. It is this latter element that we will try to predict. The comma-delimited values in each attribute are shown next, along with substitutions. The substitutions are needed because the model expects numeric data: Attribute Original value Substituted value Buying price vhigh, high, med, low 3,2,1,0 Maintenance price vhigh, high, med, low 3,2,1,0 Number of doors 2, 3, 4, 5-more 2,3,4,5 Seating 2, 4, more 2,4,5 Cargo space small, med, big 0,1,2 Safety low, med, high 0,1,2 There are 1,728 instances in the file. The cars are marked with four classes: Class Number of instances Percentage of instances Original value Substituted value Unacceptable 1210 70.023% unacc 0 Acceptable 384 22.222% acc 1 Good 69 3.99% good 2 Very good 65 3.76% v-good 3 Setting up the class We start with the definition of a CarRegressionExample class, as shown next: public class CarRegressionExample { public CarRegressionExample() { try { ... } catch (IOException | InterruptedException ex) { // Handle exceptions } } public static void main(String[] args) { new CarRegressionExample(); } } Reading and preparing the data The first task is to read in the data. We will use the CSVRecordReader class to get the data: RecordReader recordReader = new CSVRecordReader(0, ","); recordReader.initialize(new FileSplit(new File("car.txt"))); DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader, 1728, 6, 4); With this dataset, we will split the data into two sets. Sixty five percent of the data is used for training and the rest for testing: DataSet dataset = iterator.next(); dataset.shuffle(); SplitTestAndTrain testAndTrain = dataset.splitTestAndTrain(0.65); DataSet trainingData = testAndTrain.getTrain(); DataSet testData = testAndTrain.getTest(); The data now needs to be normalized: DataNormalization normalizer = new NormalizerStandardize(); normalizer.fit(trainingData); normalizer.transform(trainingData); normalizer.transform(testData); We are now ready to build the model. Building the model A MultiLayerConfiguration instance is created using a series of NeuralNetConfiguration.Builder methods. The following is the dice used. We will discuss the individual methods following the code. Note that this configuration uses two layers. The last layer uses the softmax activation function, which is used for regression analysis: MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder() .iterations(1000) .activation("relu") .weightInit(WeightInit.XAVIER) .learningRate(0.4) .list() .layer(0, new DenseLayer.Builder() .nIn(6).nOut(3) .build()) .layer(1, new OutputLayer .Builder(LossFunctions.LossFunction .NEGATIVELOGLIKELIHOOD) .activation("softmax") .nIn(3).nOut(4).build()) .backprop(true).pretrain(false) .build(); Two layers are created. The first is the input layer. The DenseLayer.Builder class is used to create this layer. The DenseLayer class is a feed-forward and fully connected layer. The created layer uses the six car attributes as input. The output consists of three neurons that are fed into the output layer and is duplicated here for your convenience: .layer(0, new DenseLayer.Builder() .nIn(6).nOut(3) .build()) The second layer is the output layer created with the OutputLayer.Builder class. It uses a loss function as the argument of its constructor. The softmax activation function is used since we are performing regression as shown here: .layer(1, new OutputLayer .Builder(LossFunctions.LossFunction .NEGATIVELOGLIKELIHOOD) .activation("softmax") .nIn(3).nOut(4).build()) Next, a MultiLayerNetwork instance is created using the configuration. The model is initialized, its listeners are set, and then the fit method is invoked to perform the actual training. The ScoreIterationListener instance will display information as the model trains which we will see shortly in the output of this example. Its constructor argument specifies the frequency that information is displayed: MultiLayerNetwork model = new MultiLayerNetwork(conf); model.init(); model.setListeners(new ScoreIterationListener(100)); model.fit(trainingData); We are now ready to evaluate the model. Evaluating the model In the next sequence of code, we evaluate the model against the training dataset. An Evaluation instance is created using an argument specifying that there are four classes. The test data is fed into the model using the output method. The eval method takes the output of the model and compares it against the test data classes to generate statistics. The getLabels method returns the expected values: Evaluation evaluation = new Evaluation(4); INDArray output = model.output(testData.getFeatureMatrix()); evaluation.eval(testData.getLabels(), output); out.println(evaluation.stats()); The output of the training follows, which is produced by the ScoreIterationListener class. However, the values you get may differ due to how the data is selected and analyzed. Notice that the score improves with the iterations but levels out after about 500 iterations: 12:43:35.685 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 0 is 1.443480901811554 12:43:36.094 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 100 is 0.3259061845624861 12:43:36.390 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 200 is 0.2630572026049783 12:43:36.676 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 300 is 0.24061281470878784 12:43:36.977 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 400 is 0.22955121170274934 12:43:37.292 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 500 is 0.22249920540161677 12:43:37.575 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 600 is 0.2169898450109222 12:43:37.872 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 700 is 0.21271599814600958 12:43:38.161 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 800 is 0.2075677126088741 12:43:38.451 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 900 is 0.20047317735870715 This is followed by the results of the stats method as shown next. The first part reports on how examples are classified and the second part displays various statistics: Examples labeled as 0 classified by model as 0: 397 times Examples labeled as 0 classified by model as 1: 10 times Examples labeled as 0 classified by model as 2: 1 times Examples labeled as 1 classified by model as 0: 8 times Examples labeled as 1 classified by model as 1: 113 times Examples labeled as 1 classified by model as 2: 1 times Examples labeled as 1 classified by model as 3: 1 times Examples labeled as 2 classified by model as 1: 7 times Examples labeled as 2 classified by model as 2: 21 times Examples labeled as 2 classified by model as 3: 14 times Examples labeled as 3 classified by model as 1: 2 times Examples labeled as 3 classified by model as 3: 30 times ==========================Scores======================================== Accuracy: 0.9273 Precision: 0.854 Recall: 0.8323 F1 Score: 0.843 ======================================================================== The regression model does a reasonable job with this dataset. Summary In this article, we examined deep learning and regression analysis. We showed how to prepare the data and class, build the model, and evaluate the model. We used sample data and displayed output statistics to demonstrate the relative effectiveness of our model. Resources for Article: Further resources on this subject: KnockoutJS Templates [article] The Heart of It All [article] Bringing DevOps to Network Operations [article]
Read more
  • 0
  • 0
  • 5104
article-image-did-unfettered-growth-kill-maker-media-financial-crisis-leads-company-to-shutdown-maker-faire-and-lay-off-all-staff
Savia Lobo
10 Jun 2019
5 min read
Save for later

Did unfettered growth kill Maker Media? Financial crisis leads company to shutdown Maker Faire and lay off all staff

Savia Lobo
10 Jun 2019
5 min read
Updated: On July 10, 2019, Dougherty announced the relaunch of Maker Faire and Maker Media with the new name “Make Community“. Maker Media Inc., the company behind Maker Faire, the popular event that hosts arts, science, and engineering DIY projects for children and their parents, has laid off all its employees--22 employees--and have decided to shut down due to financial troubles. In January 2005, the company first started off with MAKE, an American bimonthly magazine focused on do it yourself and/or DIWO projects involving computers, electronics, robotics, metalworking, woodworking, etc. for both adults and children. In 2006, the company first held its Maker Faire event, that lets attendees wander amidst giant, inspiring art and engineering installations. Maker Faire now includes 200 owned and licensed events per year in over 40 countries. The Maker movement gained momentum and popularity when MAKE magazine first started publishing 15 years ago.  The movement emerged as a dominant source of livelihood as individuals found ways to build small businesses using their creative activity. In 2014, The WhiteHouse blog posted an article stating, “Maker Faires and similar events can inspire more people to become entrepreneurs and to pursue careers in design, advanced manufacturing, and the related fields of science, technology, engineering and mathematics (STEM).” With funding from the Department of Labor, “the AFL-CIO and Carnegie Mellon University are partnering with TechShop Pittsburgh to create an apprenticeship program for 21st-century manufacturing and encourage startups to manufacture domestically.” Recently, researchers from Baylor University and the University of North Carolina, in their research paper, have highlighted opportunities for studying the conditions under which the Maker movement might foster entrepreneurship outcomes. Dale Dougherty, Maker Media Inc.’s founder and CEO, told TechCrunch, “I started this 15 years ago and it’s always been a struggle as a business to make this work. Print publishing is not a great business for anybody, but it works…barely. Events are hard . . . there was a drop off in corporate sponsorship”. “Microsoft and Autodesk failed to sponsor this year’s flagship Bay Area Maker Faire”, TechCrunch reports. Dougherty further told that the company is trying to keep the servers running. “I hope to be able to get control of the assets of the company and restart it. We’re not necessarily going to do everything we did in the past but I’m committed to keeping the print magazine going and the Maker Faire licensing program”, he further added. In 2016, the company laid off 17 of its employees, followed by 8 employees recently in March. “They’ve been paid their owed wages and PTO, but did not receive any severance or two-week notice”, TechCrunch reports. These layoffs may have hinted the staff of the financial crisis affecting the company. Maker Media Inc. had raised $10 million from Obvious Ventures, Raine Ventures, and Floodgate. Dougherty says, “It started as a venture-backed company but we realized it wasn’t a venture-backed opportunity. The company wasn’t that interesting to its investors anymore. It was failing as a business but not as a mission. Should it be a non-profit or something like that? Some of our best successes, for instance, are in education.” The company has a huge public following for its products. Dougherty told TechCrunch that despite the rain, Maker Faire’s big Bay Area event last week met its ticket sales target. Also, about 1.45 million people attended its events in 2016. “MAKE: magazine had 125,000 paid subscribers and the company had racked up over one million YouTube subscribers. But high production costs in expensive cities and a proliferation of free DIY project content online had strained Maker Media”, writes TechCrunch. Dougherty told TechCrunch he has been overwhelmed by the support shown by the Maker community. As of now, licensed Maker Faire events around the world will proceed as planned. “Dougherty also says he’s aware of Oculus co-founder Palmer Luckey’s interest in funding the company, and a GoFundMe page started for it”, TechCrunch reports. Mike Senese, Executive Editor, MAKE magazine, tweeted, “Nothing but love and admiration for the team that I got to spend the last six years with, and the incredible community that made this amazing part of my life a reality.” https://twitter.com/donttrythis/status/1137374732733493248 https://twitter.com/xeni/status/1137395288262373376 https://twitter.com/chr1sa/status/1137518221232238592 Former Mythbusters co-host Adam Savage, who was a regular presence at the Maker Faire, told The Verge, “Make Media has created so many important new connections between people across the world. It showed the power from the act of creation. We are the better for its existence and I am sad. I also believe that something new will grow from what they built. The ground they laid is too fertile to lie fallow for long.” On July 10, 2019, Dougherty announced he’ll relaunch Maker Faire and Maker Media with the new name “Make Community“. The official launch of Make Community will supposedly be next week. The company is also working on a new issue of Make Magazine that is planned to be published quarterly and the online archives of its do-it-yourself project guides will remain available. Dougherty told TechCrunch “with the goal that we can get back up to speed as a business, and start generating revenue and a magazine again. This is where the community support needs to come in because I can’t fund it for very long.” GitHub introduces ‘Template repository’ for easy boilerplate code management and distribution 12 Visual Studio Code extensions that Node.js developers will love [Sponsored by Microsoft] Shoshana Zuboff on 21st century solutions for tackling the unique complexities of surveillance capitalism
Read more
  • 0
  • 0
  • 5086

article-image-machine-learning-tasks
Packt
01 Apr 2016
16 min read
Save for later

Machine Learning Tasks

Packt
01 Apr 2016
16 min read
In this article written by David Julian, author of the book Designing Machine Learning Systems with Python, the author wants to state that, he will first introduce the basic machine learning tasks. Classification is probably the most common task, due in part to the fact that it is relatively easy, well understood, and solves a lot of common problems. Multiclass classification (for instance, handwriting recognition) can sometimes be achieved by chaining binary classification tasks. However, we lose information this way, and we become unable to define a single decision boundary. For this reason, multiclass classification is often treated separately from binary classification. (For more resources related to this topic, see here.) There are cases where we are not interested in discrete classes but rather a real number, for instance, a probability. These type of problems are regression problems. Both classification and regression require a training set of correctly labelled data. They are supervised learning problems. Originating from these basic machine tasks are a number of derived tasks. In many applications, this may simply be applying the learning model to a prediction to establish a causal relationship. We must remember that explaining and predicting are not the same. A model can make a prediction, but unless we know explicitly how it made the prediction, we cannot begin to form a comprehensible explanation. An explanation requires human knowledge of the domain. We can also use a prediction model to find exceptions from a general pattern. Here, we are interested in the individual cases that deviate from the predictions. This is often called anomaly detection and has wide applications in areas such as detecting bank fraud, noise filtering, and even in the search for extraterrestrial life. An important and potentially useful task is subgroup discovery. Our goal here is not, as in clustering, to partition the entire domain but rather to find a subgroup that has a substantially different distribution. In essence, subgroup discovery is trying to find relationships between a dependent target variable and many independent explaining variables. We are not trying to find a complete relationship but rather a group of instances that are different in ways that are important in the domain. For instance, establishing the subgroups, smoker = true and family history =true, for a target variable of heart disease =true. Finally, we consider control type tasks. These act to optimize control setting to maximize a pay off is given different conditions. This can be achieved in several ways. We can clone expert behavior; the machine learns directly from a human and makes predictions of actions given different conditions. The task is to learn a prediction model for the expert's actions. This is similar to reinforcement learning, where the task is to learn about the relationship between conditions and optimal action. Clustering, on the other hand, is the task of grouping items without any information on that group; this is an unsupervised learning task. Clustering is basically making a measurement of similarity. Related to clustering is association, which is an unsupervised task to find a certain type of pattern in the data. This task is behind movie recommender systems, and customers who bought this also bought .. on checkout pages of online stores. Data for machine learning When considering raw data for machine learning applications, there are three separate aspects: The volume of the data The velocity of the data The variety of the data Data volume The volume problem can be approached from three different directions: efficiency, scalability, and parallelism. Efficiency is about minimizing the time it takes for an algorithm to process a unit of information. A component of this is the underlying processing power of the hardware. The other component, and one that we have more control over, is ensuring our algorithms are not wasting precious processing cycles on unnecessary tasks. Scalability is really about brute force, and throwing as much hardware at a problem as you can. With Moore's law, which predicts the trend of computer power doubling every two years and reaching its limit, it is clear that scalability is not, by its self, going to be able to keep pace with the ever increasing amounts of data. Simply adding more memory and faster processors is not, in many cases, going to be a cost effective solution. Parallelism is a growing area of machine learning, and it encompasses a number of different approaches from harnessing capabilities of multi core processors, to large scale distributed computing on many different platforms. Probably, the most common method is to simply run the same algorithm on many machines, each with a different set of parameters. Another method is to decompose a learning algorithm into an adaptive sequence of queries, and have these queries processed in parallel. A common implementation of this technique is known as MapReduce, or its open source version, Hadoop. Data velocity The velocity problem is often approached in terms of data producers and data consumers. The data transfer rate between the two is its velocity, and it can be measured in interactive response times. This is the time it takes from a query being made to its response being delivered. Response times are constrained by latencies such as hard disk read and write times, and the time it takes to transmit data across a network. Data is being produced at ever greater rates, and this is largely driven by the rapid expansion of mobile networks and devices. The increasing instrumentation of daily life is revolutionizing the way products and services are delivered. This increasing flow of data has led to the idea of streaming processing. When input data is at a velocity that makes it impossible to store in its entirety, a level of analysis is necessary as the data streams, in essence, deciding what data is useful and should be stored and what data can be thrown away. An extreme example is the Large Hadron Collider at CERN, where the vast majority of data is discarded. A sophisticated algorithm must scan the data as it is being generated, looking at the information needle in the data haystack. Another instance where processing data streams might be important is when an application requires an immediate response. This is becoming increasingly used in applications such as online gaming and stock market trading. It is not just the velocity of incoming data that we are interested in. In many applications, particularly on the web, the velocity of a system's output is also important. Consider applications such as recommender systems, which need to process large amounts of data and present a response in the time it takes for a web page to load. Data variety Collecting data from different sources invariably means dealing with misaligned data structures, and incompatible formats. It also often means dealing with different semantics and having to understand a data system that may have been built on a pretty different set of logical principles. We have to remember that, very often, data is repurposed for an entirely different application than the one it was originally intended for. There is a huge variety of data formats and underlying platforms. Significant time can be spent converting data into one consistent format. Even when this is done, the data itself needs to be aligned such that each record consists of the same number of features and is measured in the same units. Models The goal in machine learning is not to just solve an instance of a problem, but to create a model that will solve unique problems from new data. This is the essence of learning. A learning model must have a mechanism for evaluating its output, and in turn, changing its behavior to a state that is closer to a solution. A model is essentially a hypothesis: a proposed explanation for a phenomenon. The goal is to apply a generalization to the problem. In the case of supervised learning, problem knowledge gained from the training set is applied to the unlabeled test. In the case of an unsupervised learning problem, such as clustering, the system does not learn from a training set. It must learn from the characteristics of the dataset itself, such as degree of similarity. In both cases, the process is iterative. It repeats a well-defined set of tasks, that moves the model closer to a correct hypothesis. There are many models and as many variations on these models as there are unique solutions. We can see that the problems that machine learning systems solve (regression, classification, association, and so on) come up in many different settings. They have been used successfully in almost all branches of science, engineering, mathematics, commerce, and also in the social sciences; they are as diverse as the domains they operate in. This diversity of models gives machine learning systems great problem solving powers. However, it can also be a bit daunting for the designer to decide what is the best model, or models, for a particular problem. To complicate things further, there are often several models that may solve your task, or your task may need several models. The most accurate and efficient pathway through an original problem is something you simply cannot know when you embark upon such a project. There are several modeling approaches. These are really different perspectives that we can use to help us understand the problem landscape. A distinction can be made regarding how a model divides up the instance space. The instance space can be considered all possible instances of your data, regardless of whether each instance actually appears in the data. The data is a subset of the instance space. There are two approaches to dividing up this space: grouping and grading. The key difference between the two is that grouping models divide the instance space into fixed discrete units called segments. Each segment has a finite resolution and cannot distinguish between classes beyond this resolution. Grading, on the other hand, forms a global model over the entire instance space, rather than dividing the space into segments. In theory, the resolution of a grading model is infinite, and it can distinguish between instances no matter how similar they are. The distinction between grouping and grading is not absolute, and many models contain elements of both. Geometric models One of the most useful approaches to machine learning modeling is through geometry. Geometric models use the concept of instance space. The most obvious example is when all the features are numerical and can become coordinates in a Cartesian coordinate system. When we only have two or three features, they are easy to visualize. Since many machine learning problems have hundreds or thousands of features, and therefore dimensions, visualizing these spaces is impossible. Importantly, many of the geometric concepts, such as linear transformations, still apply in this hyper space. This can help us better understand our models. For instance, we expect many learning algorithms to be translation invariant, which means that it does not matter where we place the origin in the coordinate system. Also, we can use the geometric concept of Euclidean distance to measure similarity between instances; this gives us a method to cluster alike instances and form a decision boundary between them. Probabilistic models Often, we will want our models to output probabilities rather than just binary true or false. When we take a probabilistic approach, we assume that there is an underlying random process that creates a well-defined, but unknown, probability distribution. Probabilistic models are often expressed in the form of a tree. Tree models are ubiquitous in machine learning, and one of their main advantages is that they can inform us about the underlying structure of a problem. Decision trees are naturally easy to visualize and conceptualize. They allow inspection and do not just give an answer. For example, if we have to predict a category, we can also expose the logical steps that gave rise to a particular result. Also, tree models generally require less data preparation than other models and can handle numerical and categorical data. On the down side, tree models can create overly complex models that do not generalize very well to new data. Another potential problem with tree models is that they can become very sensitive to changes in the input data, and as we will see later, this problem can be mitigated by using them as ensemble learners. Linear models A key concept in machine learning is that of the linear model. Linear models form the foundation of many advanced nonlinear techniques such as support vector machines and neural networks. They can be applied to any predictive task such as classification, regression, or probability estimation. When responding to small changes in the input data, and provided that our data consists of entirely uncorrelated features, linear models tend to be more stable than tree models. Tree models can over-respond to small variation in training data. This is because splits at the root of a tree have consequences that are not recoverable further down a branch, potentially making the rest of the tree significantly different. Linear models, on the other hand, are relatively stable, being less sensitive to initial conditions. However, as you would expect, this has the opposite effect of making it less sensitive to nuanced data. This is described by the terms variance (for over fitting models) and bias (for under fitting models). A linear model is typically low variance and high bias. Linear models are generally best approached from a geometric perspective. We know we can easily plot two dimensions of space in a Cartesian co-ordinate system, and we can use the illusion of perspective to illustrate a third. We have also been taught to think of time as being a fourth dimension, but when we start speaking of n dimensions, a physical analogy breaks down. Intriguingly, we can still use many of the mathematical tools that we intuitively apply to three dimensions of space. While it becomes difficult to visualize these extra dimensions, we can still use the same geometric concepts (such as lines, planes, angles, and distance) to describe them. With geometric models, we describe each instance as having a set of real-valued features, each of which is a dimension in a space. Model ensembles Ensemble techniques can be divided broadly into two types. The Averaging Method: With this method, several estimators are run independently, and their predictions are averaged. This includes the random forests and bagging methods. The Boosting Methods: With this method, weak learners are built sequentially using weighted distributions of the data, based on the error rates. Ensemble methods use multiple models to obtain better performance than any single constituent model. The aim is to not only build diverse and robust models, but also to work within limitations such as processing speed and return times. When working with large datasets and quick response times, this can be a significant developmental bottleneck. Troubleshooting and diagnostics are important aspects of working with all machine learning models, but they are especially important when dealing with models that might take days to run. The types of machine learning ensembles that can be created are as diverse as the models themselves, and the main considerations revolve around three things: how we divide our data, how we select the models, and the methods we use to combine their results. This simplistic statement actually encompasses a very large and diverse space. Neural nets When we approach the problem of trying to mimic the brain, we are faced with a number of difficulties. Considering all the different things the brain does, we might first think that it consists of a number of different algorithms, each specialized to do a particular task, and each hard wired into different parts of the brain. This approach translates to considering the brain as a number of subsystems, each with its own program and task. For example, the auditory cortex for perceiving sound has its own algorithm that, say, does a Fourier transform on an incoming sound wave to detect the pitch. The visual cortex, on the other hand, has its own distinct algorithm for decoding the signals from the optic nerve and translating them into the sensation of sight. There is, however, growing evidence that the brain does not function like this at all. It appears, from biological studies, that brain tissue in different parts of the brain can relearn how to interpret inputs. So, rather than consisting of specialized subsystems that are programmed to perform specific tasks, the brain uses the same algorithm to learn different tasks. This single algorithm approach has many advantages, not least of which is that it is relatively easy to implement. It also means that we can create generalized models and then train them to perform specialized tasks. Like in real brains, using a singular algorithm to describe how each neuron communicates with the other neurons around it allows artificial neural networks to be adaptable and able to carry out multiple higher-level tasks. Much of the most important work being done with neural net models, and indeed machine learning in general, is through the use of very complex neural nets with many layers and features. This approach is often called deep architecture or deep learning. Human and animal learning occurs at a rate and depth that no machine can match. Many of the elements of biological learning still remain a mystery. One of the key areas of research, and one of the most useful in application, is that of object recognition. This is something quite fundamental to living systems, and higher animals have evolved to possessing an extraordinary ability to learn complex relationships between objects. Biological brains have many layers; each synaptic event exists in a long chain of synaptic processes. In order to recognize complex objects, such as people's faces or handwritten digits, a fundamental task is to create a hierarchy of representation from the raw input to higher and higher levels of abstraction. The goal is to transform raw data, such as a set of pixel values, into something that we can describe as, say, a person riding bicycle. Resources for Article: Further resources on this subject: Python Data Structures [article] Exception Handling in MySQL for Python [article] Python Data Analysis Utilities [article]
Read more
  • 0
  • 0
  • 5074
Modal Close icon
Modal Close icon