Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-creating-sample-web-scraper
Packt
11 Sep 2013
10 min read
Save for later

Creating a sample web scraper

Packt
11 Sep 2013
10 min read
(For more resources related to this topic, see here.) As web scrapers like to say: "Every website has an API. Sometimes it's just poorly documented and difficult to use." To say that web scraping is a useful skill is an understatement. Whether you're satisfying a curiosity by writing a quick script in an afternoon or building the next Google, the ability to grab any online data, in any amount, in any format, while choosing how you want to store it and retrieve it, is a vital part of any good programmer's toolkit. By virtue of reading this article, I assume that you already have some idea of what web scraping entails, and perhaps have specific motivations for using Java to do it. Regardless, it is important to cover some basic definitions and principles. Very rarely does one hear about web scraping in Java—a language often thought to be solely the domain of enterprise software and interactive web apps of the 90's and early 2000's. However, there are many reasons why Java is an often underrated language for web scraping: Java's excellent exception-handling lets you compile code that elegantly handles the often-messy Web Reusable data structures allow you to write once and use everywhere with ease and safety Java's concurrency libraries allow you to write code that can process other data while waiting for servers to return information (the slowest part of any scraper) The Web is big and slow, but the Java RMI allows you to write code across a distributed network of machines, in order to collect and process data quickly There are a variety of standard libraries for getting data from servers, as well as third-party libraries for parsing this data, and even executing JavaScript (which is needed for scraping some websites) In this article, we will explore these, and other benefits of Java in web scraping, and build several scrapers ourselves. Although it is possible, and recommended, to skip to the sections you already have a good grasp of, keep in mind that some sections build up the code and concepts of other sections. When this happens, it will be noted in the beginning of the section. How is this legal? Web scraping has always had a "gray hat" reputation. While websites are generally meant to be viewed by actual humans sitting at a computer, web scrapers find ways to subvert that. While APIs are designed to accept obviously computer-generated requests, web scrapers must find ways to imitate humans, often by modifying headers, forging POST requests and other methods. Web scraping often requires a great deal of problem solving and ingenuity to figure out how to get the data you want. There are often few roadmaps or tried-and-true procedures to follow, and you must carefully tailor the code to each website—often riding between the lines of what is intended and what is possible. Although this sort of hacking can be fun and challenging, you have to be careful to follow the rules. Like many technology fields, the legal precedent for web scraping is scant. A good rule of thumb to keep yourself out of trouble is to always follow the terms of use and copyright documents on websites that you scrape (if any). There are some cases in which the act of web crawling is itself in murky legal territory, regardless of how the data is used. Crawling is often prohibited in the terms of service of websites where the aggregated data is valuable (for example, a site that contains a directory of personal addresses in the United States), or where a commercial or rate-limited API is available. Twitter, for example, explicitly prohibits web scraping (at least of any actual tweet data) in its terms of service: "crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited" Unless explicitly prohibited by the terms of service, there is no fundamental difference between accessing a website (and its information) via a browser, and accessing it via an automated script. The robots.txt file alone has not been shown to be legally binding, although in many cases the terms of service can be. Writing a simple scraper (Simple) Wikipedia is not just a helpful resource for researching or looking up information but also a very interesting website to scrape. They make no efforts to prevent scrapers from accessing the site, and, with a very well-marked-up HTML, they make it very easy to find the information you're looking for. In this project, we will scrape an article from Wikipedia and retrieve the first few lines of text from the body of the article. Getting ready It is recommended that you have some working knowledge of Java, and the ability to create and execute Java programs at this point. As an example, we will use the article from the following Wikipedia link: http://en.wikipedia.org/wiki/Java Note that this article is about the Indonesian island nation Java, not the programming language. Regardless, it seems appropriate to use it as a test subject. We will be using the jsoup library to parse HTML data from websites and convert it into a manipulatable object, with traversable, indexed values (much like an array). In this xercise, we will show you how to download, install, and use Java libraries. In addition, we'll also be covering some of the basics of the jsoup library in particular. How to do it... Now that we're starting to get into writing scrapers, let's create a new project to keep them all bundled together. Carry out the following steps for this task: Open Eclipse and create a new Java project called Scraper. Packages are still considered to be handy for bundling collections of classes together within a single project (projects contain multiple packages, and packages contain multiple classes). You can create a new package by highlighting the Scraper project in Eclipse and going to File | New | Package . By convention, in order to prevent programmers from creating packages with the same name (and causing namespace problems), packages are named starting with the reverse of your domain name (for example, com.mydomain.mypackagename). For the rest of the article, we will begin all our packages with com.packtpub.JavaScraping appended with the package name. Let's create a new package called com.packtpub.JavaScraping.SimpleScraper. Create a new class, WikiScraper, inside the src folder of the package. Download the jsoup core library, the first link, from the following URL: http://jsoup.org/download Place the .jar file you downloaded into the lib folder of the package you just created. In Eclipse, right-click in the Package Explorer window and select Refresh. This will allow Eclipse to update the Package Explorer to the current state of the workspace folder. Eclipse should show your jsoup-1.7.2.jar file (this file may have a different name depending on the version you're using) in the Package Explorer window. Right-click on the jsoup JAR file and select Build Path | Add to Build Path. In your WikiScraper class file, write the following code: package com.packtpub.JavaScraping.SimpleScraper; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import java.net.*; import java.io.*; public class WikiScraper { public static void main(String[] args) { scrapeTopic("/wiki/Python"); } public static void scrapeTopic(String url){ String html = getUrl("http://www.wikipedia.org/"+url); Document doc = Jsoup.parse(html); String contentText = doc.select("#mw-content-text > p").first().text(); System.out.println(contentText); } public static String getUrl(String url){ URL urlObj = null; try{ urlObj = new URL(url); } catch(MalformedURLException e){ System.out.println("The url was malformed!"); return ""; } URLConnection urlCon = null; BufferedReader in = null; String outputText = ""; try{ urlCon = urlObj.openConnection(); in = new BufferedReader(new InputStreamReader(urlCon.getInputStream())); String line = ""; while((line = in.readLine()) != null){ outputText += line; } in.close(); }catch(IOException e){ System.out.println("There was an error connecting to the URL"); return ""; } return outputText; } } Assuming you're connected to the internet, this should compile and run with no errors, and print the first paragraph of text from the article. How it works... Unlike our HelloWorld example, there are a number of libraries needed to make this code work. We can incorporate all of these using the import statements before the class declaration. There are a number of jsoup objects needed, along with two Java libraries, java.io and java.net , which are needed for creating the connection and retrieving information from the Web. As always, our program starts out in the main method of the class. This method calls the scrapeTopic method, which will eventually print the data that we are looking for (the first paragraph of text in the Wikipedia article) to the screen. scrapeTopic requires another method, getURL, in order to do this. getUrl is a function that we will be using throughout the article. It takes in an arbitrary URL and returns the raw source code as a string. Essentially, it creates a Java URL object from the URL string, and calls the openConnection method on that URL object. The openConnection method returns a URLConnection object, which can be used to create a BufferedReader object. BufferedReader objects are designed to read from, potentially very long, streams of text, stopping at a certain size limit, or, very commonly, reading streams one line at a time. Depending on the potential size of the pages you might be reading (or if you're reading from an unknown source), it might be important to set a buffer size here. To simplify this exercise, however, we will continue to read as long as Java is able to. The while loop here retrieves the text from the BufferedReader object one line at a time and adds it to outputText, which is then returned. After the getURL method has returned the HTML string to scrapeTopic, jsoup is used. jsoup is a Java library that turns HTML strings (such as the string returned by our scraper) into more accessible objects. There are many ways to access and manipulate these objects, but the function you'll likely find yourself using most often is the select function. The jsoup select function returns a jsoup object (or an array of objects, if more than one element matches the argument to the select function), which can be further manipulated, or turned into text, and printed. The crux of our script can be found in this line: String contentText = doc.select("#mw-content-text > p").first().text(); This finds all the elements that match #mw-content-text > p (that is, all p elements that are the children of the elements with the CSS ID mw-content-text), selects the first element of this set, and turns the resulting object into plain text (stripping out all tags, such as <a> tags or other formatting that might be in the text). The program ends by printing this line out to the console. There's more... Jsoup is a powerful library that we will be working with in many applications throughout this article. For uses that are not covered in the article, I encourage you to read the complete documentation at http://jsoup.org/apidocs/. What if you find yourself working on a project where jsoup's capabilities aren't quite meeting your needs? There are literally dozens of Java-based HTML parsers on the market. Summary Thus in this article we took the first step towards web scraping with Java, and also learned how to scrape an article from Wikipedia and retrieve the first few lines of text from the body of the article. Resources for Article : Further resources on this subject: Making a simple cURL request (Simple) [Article] Web Scraping with Python [Article] Generating Content in WordPress Top Plugins—A Sequel [Article]  
Read more
  • 0
  • 0
  • 9647

article-image-facebooks-ceo-mark-zuckerberg-summoned-for-hearing-by-uk-and-canadian-houses-of-commons
Bhagyashree R
01 Nov 2018
2 min read
Save for later

Facebook's CEO, Mark Zuckerberg summoned for hearing by UK and Canadian Houses of Commons

Bhagyashree R
01 Nov 2018
2 min read
Yesterday, the chairs of the UK and Canadian Houses of Commons issued a letter calling for Mark Zuckerberg, Facebook’s CEO to appear before them. The primary aim of this hearing is to get a clear idea of what measures Facebook is taking to avoid the spreading of disinformation on the social media platform and to protect user data. It is scheduled to happen at the Westminster Parliament on Tuesday 27th November. The committee has already gathered evidence regarding several data breaches and process failures including the Cambridge Analytica scandal and is now seeking answers from Mark Zuckerberg on what led to all of these incidents. Mark last attended a hearing in April with the Senate's Commerce and Judiciary committees this year in which he was asked about the company’s failure to protect its user data, its perceived bias against conservative speech, and its use for selling illegal material like drugs. After which he has not attended any of the hearings and instead sent other senior representatives such as Sheryl Sandberg, COO at Facebook. The letter pointed out: “You have chosen instead to send less senior representatives, and have not yourself appeared, despite having taken up invitations from the US Congress and Senate, and the European Parliament.” Throughout this year we saw major security and data breaches involving Facebook. The social media platform faced a security issue last month which impacted almost 50 million user accounts. Its engineering team discovered that hackers were able to find a way to exploit a series of bugs related to the View As Facebook feature. Earlier this year, Facebook witnessed a backlash for the Facebook-Cambridge Analytica data scandal. It was a major political scandal about Cambridge Analytica using personal data of millions of Facebook users for political purposes without their permission. The reports of this hearing will be shared in December if at all Zuckerberg agrees to attend it. The committee has requested his response till 7th November. Read the full letter issued by the committee. Facebook is at it again. This time with Candidate Info where politicians can pitch on camera Facebook finds ‘no evidence that hackers accessed third party Apps via user logins’, from last week’s security breach How far will Facebook go to fix what it broke: Democracy, Trust, Reality
Read more
  • 0
  • 0
  • 9560

article-image-controlling-relevancy
Packt
18 Jan 2016
19 min read
Save for later

Controlling Relevancy

Packt
18 Jan 2016
19 min read
In this article written by Bharvi Dixit, author of the book Elasticsearch Essentials, we understand that getting a search engine to behave can be very hard. It does not matter if you are a newbie or have years of experience with Elasticsearch or Solr, you must have definitely struggled with low-quality search results in your application. The default algorithm of Lucene does not come close to meeting your requirements, and there is always a struggle to deliver the relevant search results. We will be covering the following topics: (For more resources related to this topic, see here.) Introducing relevant search Out of the Box Tools from Elasticsearch Controlling relevancy with custom scoring Introducing relevant search Relevancy is the root of a search engine's value proposition and can be defined as the art of ranking content for a user's search based on how much that content satisfies the needs of the user or the business. In an application, it does not matter how beautiful your user interface looks or how many functionalities you are providing to the user; search relevancy cannot be avoided at any cost. So, despite of the mystical behavior of search engines, you have to find a solution to get the relevant results. The relevancy becomes more important because a user does not care about the whole bunch of documents that you have. The user enters his keywords, selects filters, and focuses on a very small amount of data—the relevant results. And if your search engine fails to deliver according to expectations, the user might be annoyed, which might be a loss for your business. A search engine like Elasticsearch comes with a built-in intelligence. You enter the keyword and within a blink of an eye, it returns to you the results that it thinks are relevant according to its intelligence. However, Elasticsearch does not a built-in intelligence according to your application domain. The relevancy is not defined by a search engine; rather it is defined by your users, their business needs, and the domains. Take an example of Google or Twitter, they have put in years of engineering experience, but still fail occasionally while providing relevancy. Don't they? Further, the challenges of search differ with the domain: the search on an e-commerce platform is about driving sales and bringing positive customer outcomes, whereas in fields such as medicine, it is about the matter of life and death. The lives of search engineers become more complicated because they do not have domain-specific knowledge, which can be used to understand the semantics of user queries. However, despite of all the challenges, the implementation of search relevancy is up to you, and it depends on what information you can extract from the users, their queries, and the content they see. We continuously take feedbacks from the users, create funnels, or enable loggings to capture the search behavior of the users so that we can improve our algorithms to provide the relevant results. The Elasticsearch out-of-the-box tools Elasticsearch primarily works with two models of information retrieval: the Boolean model and the Vector Space model. In addition to these, there are other scoring algorithms available in Elasticsearch as well, such as Okapi BM25, Divergence from Randomness (DFR), and Information Based (IB). Working with these three models requires an extensive mathematical knowledge and needs some extra configurations in Elasticsearch. The Boolean model uses the AND, OR, and NOT conditions in a query to find all the matching documents. This Boolean model can be further combined with the Lucene scoring formula, TF/IDF, to rank documents. The Vector Space model works differently from the Boolean model, as it represents both queries and documents as vectors. In the vector space model, each number in the vector is the weight of a term that is calculated using TF/IDF. The queries and documents are compared using a cosine similarity in which angles between two vectors are compared to find the similarity, which ultimately leads to finding the relevancy of the documents. An example: why defaults are not enough Let's build an index with sample documents to understand the examples in a better way. First, create an index with the name profiles: curl -XPUT 'localhost:9200/profiles' Then, put the mapping with the document type as candidate: curl -XPUT 'localhost:9200/profiles/candidate' {  "properties": {    "geo_code": {      "type": "geo_point",      "lat_lon": true    }  } } Please note that in preceding mapping, we are putting mapping only for the geo data type. The rest of the fields will be indexed dynamically. Now, you can create a data.json file with the following content in it: { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 1 }} { "name" : "Sam", "geo_code" : "12.9545163,77.3500487", "total_experience":5, "skills":["java","python"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 2 }} { "name" : "Robert", "geo_code" : "28.6619678,77.225706", "total_experience":2, "skills":["java"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 3 }} { "name" : "Lavleen", "geo_code" : "28.6619678,77.225706", "total_experience":4, "skills":["java","Elasticsearch"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 4 }} { "name" : "Bharvi", "geo_code" : "28.6619678,77.225706", "total_experience":3, "skills":["java","lucene"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 5 }} { "name" : "Nips", "geo_code" : "12.9545163,77.3500487", "total_experience":7, "skills":["grails","python"] } { "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 6 }} { "name" : "Shikha", "geo_code" : "28.4250666,76.8493508", "total_experience":10, "skills":["c","java"] }  If you are indexing skills, which are separated by spaces or which include non-English characters, that is, c++, c#, or core java, you need to create mapping for the skills field as not_analyzed in advance to have exact term matching. Once the file is created, execute the following command to put the data inside the index we have just created: curl -XPOST 'localhost:9200' --data-binary @data.json If you look carefully at the example, the documents contain the data of the candidates who might be looking for jobs. For hiring candidates, a recruiter can have the following criteria: Candidates should know about Java Candidate should have an experience between 3 to 5 years Candidate should fall in the distance range of 100 kilometers from the office of the recruiter. You can construct a simple bool query in combination with a term query on the skills field along with geo_distance and range filters on the geo_code and total_experience fields respectively. However, does this give a relevant set of results? The answer would be NO. The problem is that if you are restricting the range of experience and distance, you might even get zero results or no suitable candidate. For example, you can put a range of 0 to 100 kilometers of distance but your perfect candidate might be at a distance of 101 kilometers. At the same time, if you define a wide range, you might get a huge number of non-relevant results. The other problem is that if you search for candidates who know Java, there are chances that a person who knows only Java and not any other programming language will be at the top, while a person who knows other languages apart from Java will be at the bottom. This happens because during the ranking of documents with TF/IDF, the lengths of the fields are taken into account. If the length of a field is small, the document is more relevant. Elasticsearch is not intelligent enough to understand the semantic meaning of your queries but for these scenarios, it offers you the full power to redefine how scoring and document ranking should be done. Controlling relevancy with custom scoring In most cases, you are good to go with the default scoring algorithms of Elasticsearch to return the most relevant results. However, some cases require you to have more control on the calculation of a score. This is especially required while implementing a domain-specific logic such as finding the relevant candidates for a job, where you need to implement a very specific scoring formula. Elasticsearch provides you with the function_score query to take control of all these things. Here we cover the code examples only in Java because a Python client gives you the flexibility to pass the query inside the body parameter of a search function. Python programmers can simply use the example queries in the same way. There is no extra module required to execute these queries. function_score query Function score query allows you to take the complete control of how a score needs to be calculated for a particular query: Syntax of a function_score query: {   "query": {"function_score": {     "query": {},     "boost": "boost for the whole query",     "functions": [       {}     ],     "max_boost": number,     "score_mode": "(multiply|max|...)",     "boost_mode": "(multiply|replace|...)",     "min_score" : number   }} } The function_score query has two parts: the first is the base query that finds the overall pool of results you want. The second part is the list of functions, which are used to adjust the scoring. These functions can be applied to each document that matches the main query in order to alter or completely replace the original query _score. In a function_score query, each function is composed of an optional filter that tells Elasticsearch which records should have their scores adjusted (defaults to "all records") and a description of how to adjust the score. The other parameters that can be used with a functions_score query are as follows: boost: An optional parameter that defines the boost for the entire query. max_boost: The maximum boost that will be applied by a function score. boost_mode: An optional parameter, which defaults to multiply. Score mode defines how the combined result of the score functions will influence the final score together with the subquery score. This can be replace (only the function score is used, the query score is ignored), max (the maximum of the query score and the function score), min (the minimum of the query score and the function score), sum (the query score and the function score are added), avg, or multiply (the query score and the function score are multiplied). score_mode: This parameter specifies how the results of individual score functions will be aggregated. The possible values can be first (the first function that has a matching filter is applied), avg, max, sum, min, and multiply. min_score: The minimum score to be used. Excluding Non-Relevant Documents with min_score To exclude documents that do not meet a certain score threshold, the min_score parameter can be set to the desired score threshold. The following are the built-in functions that are available to be used with the function score query: weight field_value_factor script_score The decay functions—linear, exp, and gauss Let's see them one by one and then you will learn how to combine them in a single query. weight A weight function allows you to apply a simple boost to each document without the boost being normalized: a weight of 2 results in 2 * _score. For example: GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "term": {           "skills": {             "value": "java"           }         }       },       "functions": [         {           "filter": {             "term": {               "skills": "python"             }           },           "weight": 2         }       ],       "boost_mode": "replace"     }   } } The preceding query will match all the candidates who know Java, but will give a higher score to the candidates who also know Python. Please note that boost_mode is set to replace, which will cause _score to be calculated by a query that is to be overridden by the weight function for our particular filter clause. The query output will contain the candidates on top with a _score of 2 who know both Java and Python. Java example The previous query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore.FunctionScoreQueryBuilder; import org.elasticsearch.index.query.functionscore.ScoreFunctionBuilders; Then the following code snippets can be used to implement the query: FunctionScoreQueryBuilder functionQuery = new FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))     .add(QueryBuilders.termQuery("skills", "python"),   ScoreFunctionBuilders.weightFactorFunction(2)).boostMode("replace");   SearchResponse response = client.prepareSearch().setIndices(indexName)         .setTypes(docType).setQuery(functionQuery)         .execute().actionGet(); field_value_factor It uses the value of a field in the document to alter the _score: GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "term": {           "skills": {             "value": "java"           }         }       },       "functions": [         {           "field_value_factor": {             "field": "total_experience"           }         }       ],       "boost_mode": "multiply"     }   } } The preceding query finds all the candidates with java in their skills, but influences the total score depending on the total experience of the candidate. So, the more experience the candidate will have, the higher ranking he will get. Please note that boost_mode is set to multiply, which will yield the following formula for the final scoring: _score = _score * doc['total_experience'].value However, there are two issues with the preceding approach: first are the documents that have the total experience value as 0 and will reset the final score to 0. Second, Lucene _score usually falls between 0 and 10, so a candidate with an experience of more than 10 years will completely swamp the effect of the full text search score. To get rid of this problem, apart from using the field parameter, the field_value_factor function provides you with the following extra parameters to be used: factor: This is an optional factor to multiply the field value with. This defaults to 1. modifier: This is a mathematical modifier to apply to the field value. This can be :none, log, log1p, log2p, ln, ln1p, ln2p, square, sqrt, or reciprocal. It defaults to none. Java example The preceding query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore*; Then the following code snippets can be used to implement the query: FunctionScoreQueryBuilder functionQuery = new FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))     .add(new FieldValueFactorFunctionBuilder("total_experience")).boostMode("multiply");   SearchResponse response = client.prepareSearch().setIndices("profiles")         .setTypes("candidate").setQuery(functionQuery)         .execute().actionGet(); script_score script_score is the most powerful function available in Elasticsearch. It uses a custom script to take complete control of the scoring logic. You can write a custom script to implement the logic you need. Scripting allows you to write from a simple to very complex logic. Scripts are cached, too, to allow faster executions of repetitive queries. Let's see an example: {   "script_score": {     "script": "doc['total_experience'].value"   } } Look at the special syntax to access the field values inside the script parameter. This is how the value of the fields is accessed using groovy scripting language. Scripting is, by default, disabled in Elasticsearch, so to use script score functions, first you need to add this line in your elasticsearch.yml file: script.inline: on To see some of the power of this function, look at the following example: GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "term": {           "skills": {             "value": "java"           }         }       },       "functions": [         {           "script_score": {             "params": {               "skill_array_provided": [                 "java",                 "python"               ]             },             "script": "final_score=0; skill_array = doc['skills'].toArray(); counter=0; while(counter<skill_array.size()){for(skill in skill_array_provided){if(skill_array[counter]==skill){final_score = final_score+doc['total_experience'].value};};counter=counter+1;};return final_score"           }         }       ],       "boost_mode": "replace"     }   } } Let's understand the preceding query: params is the placeholder where you can pass the parameters to your function, similar to how you use parameters inside a method signature in other languages. Inside the script parameter, you write your complete logic. This script iterates through each document that has Java mentioned in the skills, and for each document, it fetches all the skills and stores them inside the skill_array variable. Finally, each skill that we have passed inside the params section is compared with the skills inside skill_array. If this matches, the value of the final_score variable is incremented with the value of the total_experience field of that document. The score calculated by the script score will be used to rank the documents because boost_mode is set to replace the original _score value. Do not try to work with the analyzed fields while writing the scripts. You might get weird results. This is because, had our skills field contained a value such as "core java", you could not have got the exact matching for it inside the script section. So, the fields with space-separated values need to be set as not_analyzed or the keyword has to be analyzed in advance. To write these script functions, you need to have some command over groovy scripting. However, if you find it complex, you can write these scripts in other languages, such as python, using the language plugin of Elasticsearch. More on this can be found here: https://github.com/elastic/elasticsearch-lang-python For a fast performance, use Groovy or Java functions. Python and JavaScript code requires the marshalling and unmarshalling of values that kill performances due to more CPU/memory usage. Java example The previous query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore.*; import org.elasticsearch.script.Script; Then, the following code snippets can be used to implement the query: String script = "final_score=0; skill_array =            doc['skills'].toArray(); "         + "counter=0; while(counter<skill_array.size())"         + "{for(skill in skill_array_provided)"         + "{if(skill_array[counter]==skill)"         + "{final_score =     final_score+doc['total_experience'].value};};"         + "counter=counter+1;};return final_score";   ArrayList<String> skills = new ArrayList<String>();   skills.add("java");   skills.add("python");   Map<String, Object> params = new HashMap<String, Object>();   params.put("skill_array_provided",skills);   FunctionScoreQueryBuilder functionQuery = new   FunctionScoreQueryBuilder(QueryBuilders.termQuery("skills", "java"))     .add(new ScriptScoreFunctionBuilder(new Script(script,   ScriptType.INLINE, "groovy", params))).boostMode("replace");   SearchResponse response =   client.prepareSearch().setIndices(indexName)         .setTypes(docType).setQuery(functionQuery)         .execute().actionGet(); As you can see, the script logic is a simple string that is used to instantiate the Script class constructor inside ScriptScoreFunctionBuilder. Decay functions - linear, exp, gauss We have seen the problems of restricting the range of experience and distance that could result in getting zero results or no suitable candidates. May be a recruiter would like to hire a candidate from a different province because of a good candidate profile. So, instead of completely restricting with the range filters, we can incorporate sliding-scale values such as geo_location or dates into _score to prefer documents near a latitude/longitude point or recently published documents. Function score provide to work with this sliding scale with the help of three decay functions: linear, exp (that is, exponential), and gauss (that is, Gaussian). All three functions take the same parameter as shown in the following code and are required to control the shape of the curve created for the decay function: origin, scale, decay, and offset. The point of origin is used to calculate distance. For date fields, the default is the current timestamp. The scale parameter defines the distance from the origin at which the computed score will be equal to the decay parameter. The origin and scale parameters can be thought of as your min and max that define a bounding box within which the curve will be defined. If we want to give more boosts to the documents that have been published in the past10 days, it would be best to define the origin as the current timestamp and the scale as 10d. The offset specifies that the decay function will only compute the decay function of the  documents with a distance greater that the defined offset. The default is 0. Finally, the decay option alters how severely the document is demoted based on its position. The default decay value is 0.5. All three decay functions work only on numeric, date, and geo-point fields. GET profiles/candidate/_search {   "query": {     "function_score": {       "query": {         "match_all": {}       },       "functions": [         {           "exp": {             "geo_code": {               "origin": {                 "lat": 28.66,                 "lon": 77.22               },               "scale": "100km"             }           }         }       ],"boost_mode": "multiply"     }   } } In the preceding query, we have used the exponential decay function that tells Elasticsearch to start decaying the score calculation after a distance of 100 km from the given origin. So, the candidates who are at a distance of greater than 100km from the given origin will be ranked low, but not discarded. These candidates can still get a higher rank if we combine other functions score queries such as weight or field_value_factor with the decay function and combine the result of all the functions together. Java example: The preceding query can be implemented in Java in the following way: First, you need to import the following classes into your code: import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.Client; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.index.query.functionscore.*; Then, the following code snippets can be used to implement the query: Map<String, Object> origin = new HashMap<String, Object>();     String scale = "100km";     origin.put("lat", "28.66");     origin.put("lon", "77.22"); FunctionScoreQueryBuilder functionQuery = new     FunctionScoreQueryBuilder()     .add(new ExponentialDecayFunctionBuilder("geo_code",origin,     scale)).boostMode("multiply"); //For Linear Decay Function use below syntax //.add(new LinearDecayFunctionBuilder("geo_code",origin,   scale)).boostMode("multiply"); //For Gauss Decay Function use below syntax //.add(new GaussDecayFunctionBuilder("geo_code",origin,   scale)).boostMode("multiply");     SearchResponse response = client.prepareSearch().setIndices(indexName)         .setTypes(docType).setQuery(functionQuery)         .execute().actionGet(); In the preceding example, we have used the exp decay function but, the commented lines show examples of how other decay functions can be used. At last, as always, remember that Elasticsearch lets  you use multiple functions in a single function_score query to calculate a score that combines the results of each function. Summary Overall we covered the most important aspects of search engines, that is, relevancy. We discussed the powerful scoring capabilities available in Elasticsearch and the practical examples to show how you can control the scoring process according to your needs. Despite the relevancy challenges faced while working with search engines, the out–of-the-box features such as functions scores and custom scoring always allow us to tackle challenges with ease. Resources for Article:   Further resources on this subject: An Introduction to Kibana [article] Extending Chef [article] Introduction to Hadoop [article]
Read more
  • 0
  • 0
  • 9544

article-image-techniques-for-creating-a-multimedia-database
Packt
17 May 2013
37 min read
Save for later

Techniques for Creating a Multimedia Database

Packt
17 May 2013
37 min read
(For more resources related to this topic, see here.) Tier architecture The rules surrounding technology are constantly changing. Decisions and architectures based on current technology might easily become out of date with hardware changes. To best understand how multimedia and unstructured data fit and can adapt to the changing technology, it's important to understand how and why we arrived at our different current architectural positions. In some cases we have come full circle and reinvented concepts that were in use 20 years ago. Only by learning from the lessons of the past can we see how to move forward to deal with this complex environment. In the past 20 years a variety of architectures have come about in an attempt to satisfy some core requirements: Allow as many users as possible to access the system Ensure those users had good performance for accessing the data Enable those users to perform DML (insert/update/delete) safely and securely (safely implies ability to restore data in the event of failure) The goal of a database management system was to provide an environment where these points could be met. The first databases were not relational. They were heavily I/O focused as the computers did not have much memory and the idea of caching data was deemed to be too expensive. The servers had kilobytes and then eventually, megabytes of memory. This memory was required foremost by the programs to run in them. The most efficient architecture was to use pointers to link the data together. The architecture that emerged naturally was hierarchical and a program would navigate the hierarchy to find rows related to each other. Users connected in via a dumb terminal. This was a monitor with a keyboard that could process input and output from a basic protocol and display it on the screen. All the processing of information, including how the screen should display it (using simple escape sequence commands), was controlled in the server. Traditional no tier The mainframes used a block mode structure, where the user would enter a screen full of data and press the Enter key. After doing this the whole screen of information was sent to the server for processing. Other servers used asynchronous protocols, where each letter, as it was typed, was sent to the server for processing. This method was not as efficient as block mode because it required more server processing power to handle the data coming in. It did provide a friendlier interface for data entry as mistakes made could be relayed immediately back to the user. Block mode could only display errors once the screen of data was sent, processed, and returned. As more users started using these systems, the amount of data in them began to grow and the users wanted to get more intelligence out of the data entered. Requirements for reporting appeared as well as the ability to do ad hoc querying. The databases were also very hard to maintain and enhance as the pointer structure linked everything together tightly. It was very difficult to perform maintenance and changes to code. In the 1970s the relational database concept was formulated and it was based on sound mathematical principles. In the early 1980s the first conceptual relational databases appeared in the marketplace with Oracle leading the way. The relational databases were not received well. They performed poorly and used a huge amount of server resources. Though they achieved a stated goal of being flexible and adaptable, enabling more complex applications to be built quicker, the performance overheads of performing joins proved to be a major issue. Benefits could be seen in them, but they could never be seen as being able to be used in any environment that required tens to hundreds or thousands of concurrent users. The technology wasn't there to handle them. To initially achieve better performance the relational database vendors focused on using a changing hardware feature and that was memory. By the late 1980s the computer servers were starting to move from 16 bit to 32 bit. The memory was increasing and there was drop in the price. By adapting to this the vendors managed to take advantage of memory and improved join performance. The relational databases in effect achieved a balancing act between memory and disk I/O. Accessing a disk was about a thousand times slower than accessing memory. Memory was transient, meaning if there was a power failure and if there was data stored in memory, it would be lost. Memory was also measured in megabytes, but disk was measured in gigabytes. Disk was not transient and generally reliable, but still required safeguards to be put in place to protect from disk failure. So the balancing act the databases performed involved caching data in memory that was frequently accessed, while ensuring any modifications made to that data were always stored to disk. Additionally, the database had to ensure no data was lost if a disk failed. To improve join performance the database vendors came up with their own solutions involving indexing, optimization techniques, locking, and specialized data storage structures. Databases were judged on the speed at which they could perform joins. The flexibility and ease in which applications could be updated and modified compared to the older systems soon made the relational database become popular and must have. As all relational databases conformed to an international SQL standard, there was a perception that a customer was never locked into a propriety system and could move their data between different vendors. Though there were elements of truth to this, the reality has shown otherwise. The Oracle Database key strength was that you were not locked into the hardware and they offered the ability to move a database between a mainframe to Windows to Unix. This portability across hardware effectively broke the stranglehold a number of hardware vendors had, and opened up the competition enabling hardware vendors to focus on the physical architecture rather than the operating system within it. In the early 1990s with the rise in popularity of the Apple Macintosh, the rules changed dramatically and the concept of a user friendly graphical environment appeared. The Graphical User Interface (GUI) screen offered a powerful interface for the user to perform data entry. Though it can be argued that data entry was not (and is still not) as fast as data entry via a dumb terminal interface, the use of colors, varying fonts, widgets, comboboxes, and a whole repository of specialized frontend data entry features made the interface easier to use and more data could be entered with less typing. Arguably, the GUI opened up the computer to users who could not type well. The interface was easier to learn and less training was needed to use the interface. Two tier The GUI interface had one major drawback; it was expensive to run on the CPU. Some vendors experimented with running the GUI directly on the server (the Solaris operating system offered this capability), but it become obvious that this solution would not scale. To address this, the two-tier architecture was born. This involved using the GUI, which was running on an Apple Macintosh or Microsoft Windows or other Windows environment (Microsoft Windows wasn't the only GUI to run on Intel platforms) to handle the display processing. This was achieved by moving the application displayed to the computer that the user was using. Thus splitting the GUI presentation layer and application from the database. This seemed like an ideal solution as the database could now just focus on handling and processing SQL queries and DML. It did not have to be burdened with application processing as well. As there were no agreed network protocols, a number had to be used, including named pipes, LU6.2, DECNET, and TCP/IP. The database had to handle language conversion as the data was moved between the client and the server. The client might be running on a 16-bit platform using US7ASCII as the character set, but the server might be running on 32-bit using EBCDIC as the character set. The network suddenly became very complex to manage. What proved to be the ultimate show stopper with the architecture had nothing to do with the scalability of client or database performance, but rather something which is always neglected in any architecture, and that is the scalability of maintenance. Having an environment of a hundred users, each with their own computer accessing the server, requires a team of experts to manage those computers and ensure the software on it is correct. Application upgrades meant upgrading hundreds of computers at the same time. This was a time-consuming and manual task. Compounded by this is that if the client computer is running multiple applications, upgrading one might impact the other applications. Even applying an operating system patch could impact other applications. Users also might install their own software on their computer and impact the application running on it. A lot of time was spent supporting users and ensuring their computers were stable and could correctly communicate with the server. Three tier Specialized software vendors tried to come to the rescue by offering the ability to lock down a client computer from being modified and allowing remote access to the computer to perform remote updates. Even then, the maintenance side proved very difficult to deal with and when the idea of a three tier architecture was pushed by vendors, it was very quickly adopted as the ideal solution to move towards because it critically addressed the maintenance issue. In the mid 1990s the rules changed again. The Internet started to gain in popularity and the web browser was invented. The browser opened up the concept of a smart presentation layer that is very flexible and configured using a simple mark up language. The browser ran on top of the protocol called HTTP, which uses TCP/IP as the underlying network protocol. The idea of splitting the presentation layer from the application became a reality as more applications appeared in the browser. The web browser was not an ideal platform for data entry as the HTTP protocol was stateless making it very hard to perform transactions in it. The HTTP protocol could scale. The actual usage involved the exact same concepts as block mode data entry performed on mainframe computers. In a web browser all the data is entered on the screen, and then sent in one go to the application handling the data. The web browser also pushed the idea that the operating system the client is running on is immaterial. The web browsers were ported to Apple computers, Windows, Solaris, and Unix platforms. The web browser also introduced the idea of standard for the presentation layer. All vendors producing a web browser had to conform to the agreed HTML standard. This ensured that anyone building an application that confirmed to HTML would be able to run on any web browser. The web browser pushed the concept that the presentation layer had to run on any client computer (later on, any mobile device as well) irrespective of the operating system and what else was installed on it. The web browser was essentially immune from anything else running on the client computer. If all the client had to use was a browser, maintenance on the client machine would be simplified. HTML had severe limitations and it was not designed for data entry. To address this, the Java language came about and provided the concept of an applet which could run inside the browser, be safe, and provide an interface to the user for data entry. Different vendors came up with different architectures for splitting their two tier application into a three tier one. Oracle achieved this by taking their Oracle Forms product and moving it to the middle application tier, and providing a framework where the presentation layer would run as a Java applet inside the browser. The Java applet would communicate with a process on the application server and it would give it its own instructions for how to draw the display. When the Forms product was replaced with JDeveloper, the same concept was maintained and enhanced. The middle tier became more flexible and multiple middle application tiers could be configured enabling more concurrent users. The three tier architecture has proven to be an ideal environment for legacy systems, giving them a new life and enabling them be put in an environment where they can scale. The three tier environment has a major flaw preventing it from truly scaling. The flaw is the bottleneck between the application layer and the database. The three tier environment also is designed for relational databases. It is not designed for multimedia databases.In the architecture if the digital objects are stored in the database, then to be delivered to the customer they need to pass through the application-database network (exaggerating the bottleneck capacity issues), and from there passed to the presentation layer. Those building in this environment naturally lend themselves to the concept that the best location for the digital objects is the middle tier. This then leads to issues of security, backing up, management, and all the issues previously cited for why storing the digital objects in the database is ideal. The logical conclusion to this is to move the database to the middle tier to address this. In reality, the logical conclusion is to move the application tier back into the database tier. Virtualized architecture In the mid 2000s the idea of a virtualization began to appear in the marketplace. A virtualization was not really a new idea and the concept has existed on the IBM MVS environment since the late 1980s. What made this virtualization concept powerful was that it could run Windows, Linux, Solaris, and Mac environments within them. A virtualized environment was basically the ability to run a complete operating system within another operating system. If the computer server had sufficient power and memory, it could run multiple virtualizations (VMs). We can take the snapshot of a VM, which involves taking a view of the disk and memory and storing it. It then became possible to rollback to the snapshot. A VM could be easily cloned (copied) and backed up. VMs could also be easily transferred to different computer servers. The VM was not tied to a physical server and the same environment could be moved to new servers as their capacity increased. A VM environment became attractive to administrators simply because they were easy to manage. Rather than running five separate servers, an administrator could have the one server with five virtualizations in it. The VM environment entered at a critical moment in the evolution of computer servers. Prior to 2005 most computer servers had one or two CPUs in them. The advanced could have as many as 64 (for example, the Sun E10000), but generally, one or two was the simplest solution. The reason was that computer power was doubling every two years following Moore's law. By around 2005 the market began to realize that there was a limit to the speed of an individual CPU due to physical limitations in the size of the transistors in the chips. The solution was to grow the CPUs sideways and the concept of cores came about. A CPU could be broken down into multiple cores, where each one acted like a separate CPU but was contained in one chip. With the introduction of smart threading, the number of virtual cores increased. A single CPU could now simulate eight or more CPUs. This concept has changed the rules. A server can now run with a large number of cores whereas 10 years ago it was physically limited to one or two CPUs. If a process went wild and consumed all the resources of one CPU, it impacted all users. In the multicore CPU environment, a rogue process will not impact the others. In a VM the controlling operating system (which is also called a hypervisor, and can be hardware, firmware, or software centric) can enable VMs to be constrained to certain cores as well as CPU thresholds within that core. This allows a VM to be fenced in. This concept was taken by Amazon and the concept of the cloud environment formed. This architecture is now moving into a new path where users can now use remote desktop into their own VM on a server. The user now needs a simple laptop (resulting in the demise of the tower computer) to use remote desktop (or equivalent) into the virtualization. They then become responsible for managing their own laptop, and in the event of an issue, it can be replaced or wiped and reinstalled with a base operating system on it. This simplifies the management. As all the business data and application logic is in the VM, the administrator can now control it, easily back it up, and access it. Though this VM cloud environment seems like a good solution to resolving the maintenance scalability issue, a spanner has been thrown in the works at the same time as VMs are becoming popular, so was the evolution of the mobile into a portable hand held device with applications running on it. Mobile applications architecture The iPhone, iPad, Android, Samsung, and other devices have caused a disruption in the marketplace as to how the relationship between the user and the application is perceived and managed. These devices are simpler and on the face of it employ a variety of architectures including two tier and three tier. Quality control of the application is managed by having an independent and separate environment, where the user can obtain their application for the mobile device. The strict controls Apple employs for using iTunes are primarily to ensure that the Trojan code or viruses are not embedded in the application, resulting in a mobile device not requiring a complex and constantly updating anti-virus software. Though the interface is not ideal for heavy data entry, the applications are naturally designed to be very friendly and use touch screen controls. The low cost combined with their simple interface has made them an ideal product for most people and are replacing the need for a laptop in a number of cases. Application vendors that have applications that naturally lend themselves to this environment are taking full advantage of it to provide a powerful interface for clients to use. The result is that there are two architectures today that exist and are moving in different directions. Each one is popular and resolves certain issues. Each has different interfaces and when building and configuring a storage repository for digital objects, both these environments need to be taken into consideration. For a multimedia environment the ideal solution to implement the application is based on the Web. This is because the web environment over the last 15 years has evolved into one which is very flexible and adaptable for dealing with the display of those objects. From the display of digital images to streaming video, the web browser (with sometimes plugins to improve the display) is ideal. This includes the display of documents. The browser environment though is not strong for the editing of these digital objects. Adobe Photoshop, Gimp, Garage Band, Office, and a whole suite of other products are available that are designed to edit each type of digital object perfectly. This means that currently the editing of those digital objects requires a different solution to the loading, viewing and delivery of those digital objects. There is no right solution for the tier architecture to manage digital objects. The N-Tier model moves the application and database back into the database tier. An HTTP server can also be located in this tier or for higher availability it can be located externally. Optimal performance is achieved by locating the application as close to the database as possible. This reduces the network bottleneck. By locating the application within the database (in Oracle this is done by using PL/SQL or Java) an ideal environment is configured where there is no overhead between the application and database. The N-Tier model also supports the concept of having the digital objects stored outside the environment and delivered using other methods. This could include a streaming server. The N-Tier model also supports the concept of transformation servers. Scalability is achieved by adding more tiers and spreading the database between them. The model also deals with the issue of the connection to the Internet becoming a bottleneck. A database server in the tier is moved to another network to help balance the load. For Oracle this can be done using RAC to achieve a form of transparent scalability. In most situations, Tuning, scalability at the server is achieved using manual methods using a form of application partitioning. Basic database configuration concepts When a database administrator first creates a database that they know will contain digital objects, they will be confronted with some basic database configuration questions covering key sizing features of the database. When looking at the Oracle Database there are a number of physical and logical structures built inside the database. To avoid confusion with other database management systems, it's important to note that an Oracle Database is a collection of schemas, whereas in other database management the terminology for a database equates to exactly one schema. This confusion has caused a lot of issues in the past. An Oracle Database administrator will say it can take 30 minutes to an hour to create a database, whereas a SQL Server administrator will say it takes seconds to create a database. In Oracle to create a schema (the same as a SQL Server database) also takes seconds to perform. For the physical storage of tables, the Oracle Database is composed of logical structures called tablespaces. The tablespace is designed to provide a transparent layer between the developer creating a table and the physical disk system and to ensure the two are independent. Data in a table that resides in a tablespace can span multiple disks and disk subsystem or a network storage system. A subsystem equating to a Raid structure has been covered in greater detail at the end of this article. A tablespace is composed of many physical datafiles. Each datafile equates to one physical file on the disk. The goal when creating a datafile is to ensure its allocation of storage is contiguous in that the operating system and doesn't split its location into different areas on the disk (Raid and NAS structures store the data in different locations based on their core structure so this rule does not apply to them). A contiguous file will result in less disk activity being performed when full tablespace scans are performed. In some cases, especially, when reading in very large images, this can improve performance. A datafile is fragmented (when using locally managed tablespaces, the default in Oracle) into fixed size extents. Access to the extents is controlled via a bitmap which is managed in the header of the tablespace (which will reside on a datafile). An extent is based on the core Oracle block size. So if the extent is 128 KB and the database block size is 8 KB, 16 Oracle blocks will exist within the extent. An Oracle block is the smallest unit of storage within the database. Blocks are read into memory for caching, updated, and changes stored in the redo logs. Even though the Oracle block is the smallest unit of storage, as a datafile is an operating system file, based on the type of server filesystem (UNIX can be UFS and Windows can be NTFS), the unit of storage at this level can change. The default in Windows was once 512 bytes, but with NTFS can be as high as 64 KB. This means every time a request is made to the disk to retrieve data from the filesystem it does a read to return this amount of data. So if the Oracle block's size was 8 KB in size and the filesystem block size was 64 KB, when Oracle requests a block to be read in, the filesystem will read in 64 KB, return the 8 KB requested, and reject the rest. Most filesystems cache this data to improve performance, but this example highlights how in some cases not balancing the database block size with the filesystem block size can result in wasted I/O. The actual answer to this is operating system and filesystem dependent, and it also depends on whether Oracle is doing read aheads (using the init.ora parameter db_file_multiblock_read_count). When Oracle introduced the Exadata they put forward the idea of putting smarts into the disk layer. Rather than the database working out how best to retrieve the physical blocks of data, the database passes a request for information to the disk system. As the Exadata knows about its own disk performance, channel speed, and I/O throughput, it is in a much better position for working out the optimal method for extracting the data. It then works out the best way of retrieving it based on the request (which can be a query). In some cases it might do a full table scan because it can process the blocks faster than if it used an index. It now becomes a smart disk system rather than a dumb/blind one. This capability has changed the rules for how a database works with the underlying storage system. ASM—Automated Storage Management In Oracle 10G, Oracle introduced ASM primarily to improve the performance of Oracle RAC (clustered systems, where multiple separate servers share the same database on the same disk). It replaces the server filesystem and can handle mirroring and load balancing of datafiles. ASM takes the filesystem and operating system out of the equation and enables the database administrator to have a different degree of control over the management of the disk system. Block size The database block size is the fundamental unit of storage within an Oracle Database. Though the database can support different block sizes, a tablespace is restricted to one fixed block size. The block sizes available are 4 KB, 8 KB, 16 KB, and 32 KB (a 32 KB block size is valid only on 64-bit platforms). The current tuning mentality says it's best to have one block size for the whole database. This is based on the idea that the one block size makes it easier to manage the SGA and ensure that memory isn't wasted. If multiple block sizes are used, the database administrator has to partition the SGA into multiple areas and assign each a block size. So if the administrator decided to have the database at 8 KB and 16 KB, they would have to set up a database startup parameter indicating the size of each: DB_8K_CACHE_SIZE = 2GDB_16K_CACHE_SIZE = 1G The problem that an administrator faces is that it can be hard to judge memory usage with table usage. In the above scenario the tables residing in the 8 KB block might be accessed a lot more than 16 KB ones, meaning the memory needs to be adjusted to deal with that. This balancing act of tuning invariably results in the decision that unless exceptional situations warrant its use, it's best to keep to the same database blocks size across the whole database. This makes the job of tuning simpler. As is always the case when dealing with unstructured data, the rules change. The current thinking is that it's more efficient to store the data in a large block size. This ensures there is less wasted overhead and fewer block reads to read in a row of data. The challenge is that the size of the unstructured data can vary dramatically. It's realistic for an image thumbnail to be under 4 KB in size. This makes it an ideal candidate to be stored in the row with the other relational data. Even if an 8 KB block size is used, the thumbnail and other relational data might happily exist in the one block. A photo might be 10 MB in size requiring a large number of blocks to be used to store it. If a 16 KB block size is used, it requires about 64 blocks to store 1 MB (assuming there is some overhead that requires overall extra storage for the block header). An 8 KB block size requires about 130 blocks. If you have to store 10 MB, the number of blocks increases 10 times. For an 8 KB block that is over 1300 reads is sufficient for one small-sized 10 MB image. With images now coming close to 100 MB in size, this figure again increases by a factor of 10. It soon becomes obvious that a very large block size is needed. When storing video at over 4 GB in size, even a 32 KB block size seems too small. As is covered later in the article, unstructured data stored in an Oracle blob does not have to be cached in the SGA. In fact, it's discouraged because in most situations the data is not likely to be accessed on a frequent basis. This generally holds true but there are cases, especially with video, where this does not hold true and this situation is covered later. Under the assumption that the thumbnails are accessed frequently and should be cached and the originals are accessed infrequently and should not be cached, the conclusion is that it now becomes practical to split the SGA in two. The unstructured, uncached data is stored in a tablespace using a large block size (32 KB) and the remaining data is stored in a more acceptable and reasonable 8 KB block. The SGA for the 32 KB is kept to a bare minimum as it will not be used, thus bypassing the issue of perceived wasted memory by splitting the SGA in two. In the following table a simple test was done using three tablespace block sizes. The aim was to see if the block size would impact load and read times. The load involved reading in 67 TIF images totaling 3 GB in size. The result was that the tablespace block size made no statistical significant difference. The test was done using a 50-MB extent size and as shown shown in the next segment, this size will impact performance. So to correctly understand how important block size can be, one has to look at not only the block size but also the extent size. Details of the environment used to perform these tests CREATE TABLESPACE tbls_name BLOCKSIZE 4096/8192/16384 EXTENTMANAGEMENT LOCAL UNIFORM SIZE 50M segment space management autodatafile 'directory/datafile' size 5G reuse; The following table compares the various block sizes: Tablespace block size Blocks Extents Load time Read time 4 KB 819200 64 3.49 minutes 1.02 minutes 8 KB 403200 63 3.46 minutes 0.59 minutes 16 KB 201600 63 3.55 minutes 0.59 minutes UNIFORM extent size and AUTOALLOCATE When creating a tablespace to store the unstructured data, the next step after the block size is determined is to work out what the most efficient extent size will be. As a table might contain data ranging from hundreds of gigabytes to terabytes determining the extent size is important. The larger the extent, the potential to possible waste space if the table doesn't use it all is greater. The smaller the extent size the risk is that the table will grow into tens or hundreds of thousands of extents. As a locally managed tablespace uses a bitmap to manage the access to the extents and is generally quite fast, having it manage tens of thousands of extents might be pushing its performance capabilities. There are two methods available to the administrator when creating a tablespace. They can manually specify the fragment size using the UNIFORM extent size clause or they can let the Oracle Database calculate it using the AUTOALLOCATE clause. Tests were done to determine what the optimal fragment size was when AUTOALLOCATE was not used. The AUTOALLOCATE is a more set-and-forget method and one goal was to see if this clause was as efficient as manually setting it. Locally managed tablespace UNIFORM extent size Covers testing performed to try to find an optimal extent and block size. The results showed that a block size of 16384 (16 KB) is ideal, though 8192 (8 KB) is acceptable. The block size of 32 KB was not tested. The administrator, who might be tempted to think the larger the extent size, the better the performance, would be surprised that the results show that this is not always the case and an extent size between 50 MB-200 MB is optimal. For reads with SECUREFILES the number of extents was not a major performance factor but it was for writes. When compared to the AUTOALLOCATE clause, it was shown there was no real performance improvement or loss when used. The testing showed that an administrator can use this clause knowing they will get a good all round result when it comes to performance. The syntax for configuration is as follows: EXTENT MANAGEMENT LOCAL AUTOALLOCATE segment space management auto Repeated tests showed that this configuration produced optimal read/write times without the database administrator having to worry about what the extent size should be. For a 300 GB tablespace it produced a similar number of extents as when a 50M extent size was used. As has been covered, once an image is loaded it is rare that it is updated. A relational database fragmentation within a tablespace is caused by repeated creation/dropping of schema objects and extents of different sizes, resulting in physical storage gaps, which are not easily reused. Storage is lost. This is analogous to the Microsoft Windows environment with its disk storage. After a period of time, the disk becomes fragmented making it hard to find contiguous storage and locate similar items together. Locating all the pieces in a file as close together as possible can dramatically reduce the number of disk reads required to read it in. With NTFS (a Microsoft disk filesystem format) the system administrator can on creation determine whether extents are autoallocated or fragmented. This is similar in concept to the Oracle tablespace creation. Testing was not done to check if the fragmentation scenario is avoided with the AUTOALLOCATE clause. The database administrator should therefore be aware of the tablespace usage and whether it is likely going to be stable once rows are added (in which case AUTOALLOCATE can be used simplifying storage management). If it is volatile, the UNIFORM clause might be considered as a better option. Temporary tablespace For working with unstructured data, the primary uses of the TEMPORARY tablespace is to hold the contents of temporary tables and temporary lobs. A temporary lob is used for processing a temporary multimedia object. In the following example, a temporary blob is created. It is not cached in memory. A multimedia image type is created and loaded into it. Information is extracted and the blob is freed. This is useful if images are stored temporarily outside the database. This is not the same case as using a bfile which Oracle Multimedia supports. The bfile is a permanent pointer to an image stored outside the database. SQL>declareimage ORDSYS.ORDImage;ctx raw(4000);beginimage := ordsys.ordimage.init();dbms_lob.createtemporary(image.source.localdata,FALSE);image.importfrom(ctx, 'file', 'LOADING_DIR', 'myimg.tif');image.setProperties;dbms_output.put_line( 'width x height = ' || image.width ||'x' || image.height);dbms_lob.freetemporary(image.source.localdata);end;/width x height = 2809x4176 It's important when using this tablespace to ensure that all code, especially on failure, performs a dbms_lob.freetemporary function, to ensure that storage leakage doesn't occur. This will result in the tablespace continuing to grow until it runs out of room. In this case the only way to clean it up is to either stop all database processes referencing, then resize the datafile (or drop and recreate the temporary tablespace after creating another interim one), or to restart the database and mount it. The tablespace can then be resized or dropped and recreated. UNDO tablespace The UNDO tablespace is used by the database to store sufficient information to rollback a transaction. In a database containing a lot of digital objects, the size of the database just for storage of the objects can exceed terabytes. In this situation the UNDO tablespace can be sized larger giving added opportunity for the database administrator to perform flashback recovery from user error. It's reasonable to size the UNDO tablespace at 50 GB even growing it to 100 GB in size. The larger the UNDO tablespace the further back in time the administrator can go and the greater the breathing space between user failure, user failure detected and reported, and the database administrator doing the flash back recovery. The following is an example flashback SQL statement. The as of timestamp clause tells Oracle to find rows that match the timestamp from the current time going back so that we can have a look at a table an hour ago: select t.vimg.source.srcname || '=' ||dbms_lob.getlength(t.vimg.source.localdata)from test_load as of timestamp systimestamp - (1/24) t; SYSTEM tablespace The SYSTEM tablespace contains the data dictionary. In Oracle 11g R2 it also contains any compiled PL/SQL code (where PLSQL_CODE_TYPE=NATIVE). The recommended initial starting size of the tablespace should be 1500 MB. Redo logs The following test results highlight how important it is to get the size and placement of the redo logs correct. The goal was to determine what combination of database parameters and redo/undo size were optimal. In addition, an SSD was used as a comparison. Based on the result of each test, the parameters and/or storage was modified to see whether it would improve the results. When it appeared an optimal parameter/storage setting was found, it was locked in while the other parameters were tested further. This enabled multiple concurrent configurations to be tested and an optimal result to be calculated. The test involved loading 67 images into the database. Each image varied in size between 40 to 80 MB resulting in 2.87 GB of data being loaded. As the test involved only image loading, no processing such as setting properties or extraction of metadata was performed. Archiving on the database was not enabled. All database files resided on hard disk unless specified. In between each test a full database reboot was done. The test was run at least three times with the range of results shown as follows: Database parameter descriptions used:Redo Buffer Size = LOG_BUFFERMultiblock Read Count = db_file_multiblock_read_count Source disk Redo logs Database parameters Fastest time Slowest time Hard disk Hard disk 3 x 50 MB Redo buffer size = 4 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 3 minutes and 22 sec 3 minutes and 53 sec Hard disk Hard disk 3 x 1 GB Redo buffer size = 4 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 2 minutes and 49 sec 2 minutes and 57 sec Hard disk SSD 3 x 1 GB Redo buffer size = 4 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 30 sec 1 minute and 41 sec Hard disk SSD 3 x 1 GB Redo buffer size = 64 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 23 sec 1 minute and 48 sec Hard disk SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 18 sec 1 minute and 29 sec Hard disk SSD 3 x 1 GB Redo buffer size = 16 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 19 sec 1 minute and 27 sec Hard disk SSD 3 x 1 GB Redo buffer size = 16 MB Multiblock read count = 256 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 27 sec 1 minute and 41 sec Hard disk SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace = 1 GB on SSD Table datafile on HD 1 minute and 21 sec 1 minute and 49 sec SSD SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace = 1 GB on SSD Table datafile on HD 53 sec 54 sec SSD SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace = 1 GB on SSD Table datafile on SSD 1 minute and 20 sec 1 minute and 20 sec Analysis The tests show a huge improvement when the redo logs were moved to a Solid State Drive (SSD). Though the conclusion that can be drawn is this: the optimal step to perform it might be self defeating. A number of manufacturers of SSD acknowledge there are limitations with the SSD when it comes to repeated writes. The Mean Time to Failure (MTF) might be 2 million hours for reads; for writes the failure rate can be very high. Modern SSD and flash cards offer much improved wear leveling algorithms to reduce failures and make performance more consistent. No doubt improvements will continue in the future. A redo log by its nature is constant and has heavy writes. So, moving the redo logs to the SSD might quickly result in it becoming damaged and failing. For an organization that on configuration performs one very large load of multimedia, the solution might be to initially keep the redo logs on SSD, and once the load is finished, to move the redo logs to a hard drive. Increasing the size of the redo logs from 50 MB to 1 GB improves performance and all database containing unstructured data should have a redo log size of at least 1 GB. The number of logs should be at least 10; preferred is from 50 to 100. As is covered later, disk is cheaper today than it once was, and 100 GB of redo logs is not that large a volume of data as it once was. The redo logs should always be mirrored. The placement or size of the UNDO tablespace makes no difference with performance. The redo buffer size (LOG_BUFFER) showed a minor improvement when it was increased in size, but the results were inconclusive as the figures varied. A figure of LOG_BUFFER=8691712, showed the best results and database administrators might use this figure as a starting point for tuning. The changing of multiblock read count (DB_FILE_MULTIBLOCK_READ_COUNT) from the default value of 64 to 256 showed no improvement. As the default value (in this case 64) is set by the database as optimal for the platform, the conclusion that can be drawn is that the database has set this figure to be a good size. By moving the original images to an SSD showed another huge improvement in performance. This highlighted how the I/O bottleneck of reading from disk and the writing to disk (redo logs) is so critical for digital object loading. The final test involved moving the datafile containing the table to the SSD. It highlighted a realistic issue that DBAs face in dealing with I/O. The disk speed and seek time might not be critical in tuning if the bottleneck is the actual time it takes to transfer the data to and from the disk to the server. In the test case the datafile was moved to the same SSD as the redo logs resulting in I/O competition. In the previous tests the datafile was on the hard disk and the database could write to the disk (separate I/O channel) and to the redo logs (separate I/O channel) without one impacting the other. Even though the SSD is a magnitude faster in performance than the disk, it quickly became swamped with calls for reads and writes. The lesson is that it's better to have multiple smaller SSDs on different I/O channels into the server than one larger channel. Sites using a SAN will soon realize that even though SAN might offer speed, unless it offers multiple I/O channels into the server, its channel to the server will quickly become the bottleneck, especially if the datafiles and the images for loading are all located on the server. The original tuning notion of separating data fi les onto separate disks that was performed more than 15 years ago still makes sense when it comes to image loading into a multimedia database. It's important to stress that this is a tuning issue while dealing with image loading not when running the database in general. Tuning the database in general is a completely different story and might result in a completely different architecture.
Read more
  • 0
  • 0
  • 9514

article-image-working-aspnet-datalist-control
Packt
19 Feb 2010
8 min read
Save for later

Working With ASP.NET DataList Control

Packt
19 Feb 2010
8 min read
In this article by Joydip Kanjilal, we will discuss the ASP.NET DataList control which can be used to display a list of repeated data items. We will learn about the following: Using the DataList control Binding images to a DataList control dynamically Displaying data using the DataList control Selecting, editing and deleting data using this control Handling the DataList control events The ASP.NET DataList Control The DataList control like the Repeater control is a template driven, light weight control, and acts as a container of repeated data items. The templates in this control are used to define the data that it will contain. It is flexible in the sense that you can easily customize the display of one or more records that are displayed in the control. You have a property in the DataList control called RepeatDirection that can be used to customize the layout of the control. The RepeatDirection property can accept one of two values, that is, Vertical or Horizontal. The RepeatDirection is Vertical by default. However, if you change it to Horizontal, rather than displaying the data as rows and columns, the DataList control will display them as a list of records with the columns in the data rendered displayed as rows. This comes in handy, especially in situations where you have too many columns in your database table or columns with larger widths of data. As an example, imagine what would happen if there is a field called Address in our Employee table having data of large size and you are displaying the data using a Repeater, a DataGrid, or a GridView control. You will not be able to display columns of such large data sizes with any of these controls as the display would look awkward. This is where the DataList control fits in. In a sense, you can think the DataList control as a combination of the DataGrid and the Repeater controls. You can use templates with it much as you did with a Repeater control and you can also edit the records displayed in the control, much like the DataGrid control of ASP.NET. The next section compares the features of the three controls that we have mentioned so far, that is, the Repeater, the DataList, and the DataGrid control of ASP.NET. When the web page is in execution with the data bound to it using the Page_Load event, the data in the DataList control is rendered as DataListItem objects, that is, each item displayed is actually a DataListItem. Similar to the Repeater control, the DataList control does not have Paging and Sorting functionalities build into it. Using the DataList Control To use this control, drag and drop the control in the design view of the web form onto a web form from the toolbox. Refer to the following screenshot, which displays a DataList control on a web form: The following list outlines the steps that you can follow to add a DataList control in a web page and make it working: Drag and drop a DataList control in the web form from the toolbox. Set the DataSourceID property of the control to the data source that you will use to bind data to the control, that is, you can set this to an SQL Data Source control. Open the .aspx file, declare the <ItemTemplate> element and define the fields as per your requirements. Use data binding syntax through the Eval() method to display data in these defined fields of the control. You can bind data to the DataList control in two different ways, that is, using the DataSourceID and the DataSource properties. You can use the inbuilt features like selecting and updating data when using the DataSourceID property. Note that you need to write custom code for selecting and updating data to any data source that implements the ICollection and IEnumerable data sources. We will discuss more on this later. The next section discusses how you can handle the events in the DataList control. Displaying Data Similar to the Repeater control, the DataList control contains a template that is used to display the data items within the control. Since there are no data columns associated with this control, you use templates to display data. Every column in a DataList control is rendered as a <span> element. A DataList control is useless without templates. Let us now lern what templates are, the types of templates, and how to work with them. A template is a combination of HTML elements, controls, and embedded server controls, and can be used to customize and manipulate the layout of a control. A template comprises HTML tags and controls that can be used to customize the look and feel of controls like Repeater, DataGrid, or DataList. There are seven templates and seven styles in all. You can use templates for the DataList control in the same way you did when using the Repeater control. The following is the list of templates and their associated styles in the DataList control The Templates are as follows: ItemTemplate AlternatingItemTemplate EditItemTemplate FooterTemplate HeaderTemplate SelectedItemTemplate SeparatorTemplate The following screenshot illustrates the different templates of this control. As you can see from this figure, the templates are grouped under three broad categories. These are: Item Templates Header and Footer Templates Separator Template Note that out of the templates given above, the ItemTemplate is the one and only mandatory template that you have to use when working with a DataList control. Here is a sample of how your DataList control's templates are arranged: < asp:DataList id="dlEmployee" runat="server"><HeaderTemplate>...</HeaderTemplate><ItemTemplate>...</ItemTemplate><AlternatingItemTemplate>...</AlternatingItemTemplate><FooterTemplate>...</FooterTemplate></asp:DataList> The following screenshot displays a DataList control populated with data and with its templates indicated. Customizing a DataList control at run timeYou can customize the DataList control at run time using the ListItemType property in the ItemCreated event of this control as follows: private void DataList1_ItemCreated(objectsender, ...........System.Web.UI.WebControls.DataListItemEventArgs e){ switch (e.Item.ItemType) { case System.Web.UI.WebControls.ListItemType.Item : e.Item.BackColor = Color.Red; break; case System.Web.UI.WebControls.ListItemType. AlternatingItem : e.Item.BackColor = Color.Blue; break; case System.Web.UI.WebControls.ListItemType. SelectedItem : e.Item.BackColor = Color.Green; break; default : break; }} The Styles that you can use with the DataList control to customize the look and feel are: AlternatingItemStyle EditItemStyle FooterStyle HeaderStyle ItemStyle SelectedItemStyle SeparatorStyle You can use any of these styles to format the control, that is, format the HTML code that is rendered. You can also use layouts of the DataList control for formatting, that is, further customization of your user interface. The available layouts are as follows: FlowLayout TableLayout VerticalLayout HorizontalLayout You can specify your desired flow or table format at design time by specifying the following in the .aspx file. RepeatLayout = "Flow" You can also do the same at run time by specifying your desired layout using the RepeatLayout property of the DataList control as shown in the following code snippet: DataList1.RepeatLayout = RepeatLayout.Flow In the code snippet, it is assumed that the name of the DataList control is DataList1. Let us now understand how we can display data using the DataList control. For this, we would first drag and drop a DataList control in our web form and specify the templates for displaying data. The code in the .aspx file is as follows: <asp:DataList ID="DataList1" runat="server"> <HeaderTemplate> <table border="1"> <tr> <th> Employee Code </th> <th> Employee Name </th> <th> Basic </th> <th> Dept Code </th> </tr> </HeaderTemplate> <ItemTemplate> <tr bgcolor="#0xbbbb"> <td> <%# DataBinder.Eval(Container.DataItem, "EmpCode")%> </td> <td> <%# DataBinder.Eval(Container.DataItem, "EmpName")%> </td> <td> <%# DataBinder.Eval(Container.DataItem, "Basic")%> </td> <td> <%# DataBinder.Eval(Container.DataItem, "DeptCode")%> </td> </tr> </ItemTemplate> <FooterTemplate> </FooterTemplate></asp:DataList> The DataList control is populated with data in the Page_Load event of the web form using the DataManager class as usual. protected void Page_Load(object sender, EventArgs e) { DataManager dataManager = new DataManager(); DataList1.DataSource = dataManager.GetEmployees(); DataList1.DataBind(); } Note that the DataBinder.Eval() method has been used as usual to display the values of the corresponding fields from the data container in the DataList control. The data container in our case is the DataSet instance that is returned by the GetEmployees () method of the DataManager class. When you execute the application, the output is as follows:
Read more
  • 0
  • 1
  • 9507

article-image-new-programming-video-courses-for-march-2019
Richard Gall
04 Mar 2019
6 min read
Save for later

New programming video courses for March 2019

Richard Gall
04 Mar 2019
6 min read
It’s not always easy to know what to learn next if you’re a programmer. Industry shifts can be subtle but they can sometimes be dramatic, making it incredibly important to stay on top of what’s happening both in your field and beyond. No one person can make that decision for you. All the thought leadership and mentorship in the world isn’t going to be able to tell you what’s right for you when it comes to your career. But this list of videos, released last month, might give you a helping hand as to where to go next when it comes to your learning… New data science and artificial intelligence video courses for March Apache Spark is carving out a big presence as the go-to software for big data. Two videos from February focus on Spark - Distributed Deep Learning with Apache Spark and Apache Spark in 7 Days. If you’re new to Spark and want a crash course on the tool, then clearly, our video aims to get you up and running quickly. However, Distributed Deep Learning with Apache Spark offers a deeper exploration that shows you how to develop end to end deep learning pipelines that can leverage the full potential of cutting edge deep learning techniques. While we’re on the subject of machine learning, other choice video courses for March include TensorFlow 2.0 New Features (we’ve been eagerly awaiting it and it finally looks like we can see what it will be like), Hands On Machine Learning with JavaScript (yes, you can now do machine learning in the browser), and a handful of interesting videos on artificial intelligence and finance: AI for Finance Machine Learning for Algorithmic Trading Bots with Python Hands on Python for Finance Elsewhere, a number of data visualization video courses prove that communicating and presenting data remains an urgent challenge for those in the data space. Tableau remains one of the definitive tools - you can learn the latest version with Tableau 2019.1 for Data Scientists and Data Visualization Recipes with Python and Matplotlib 3.   New app and web development video courses for March 2019 There are a wealth of video courses for web and app developers to choose from this month. True, Hands-on Machine Learning for JavaScript is well worth a look, but moving past the machine learning hype, there are a number of video courses that take a practical look at popular tools and new approaches to app and web development. Angular’s death has been greatly exaggerated - it remains a pillar of the JavaScript world. While the project’s versioning has arguably been lacking some clarity, if you want to get up to speed with where the framework is today, try Angular 7: A Practical Guide. It’s a video that does exactly what it says on the proverbial tin - it shows off Angular 7 and demonstrates how to start using it in web projects. We’ve also been seeing some uptake of Angular by ASP.NET developers, as it offers a nice complement to the Microsoft framework on the front end side. Our latest video on the combination, Hands-on Web Development with ASP.NET Core and Angular, is another practical look at an effective and increasingly popular approach to full-stack development. Other picks for March include Building Mobile Apps with Ionic 4, a video that brings you right up to date with the recent update that launched in January (interestingly, the project is now backed by web components, not Angular), and a couple of Redux videos - Mastering Redux and Redux Recipes. Redux is still relatively new. Essentially, it’s a JavaScript library that helps you manage application state - because it can be used with a range of different frameworks and libraries, including both Angular and React, it’s likely to go from strength to strength in 2019. Infrastructure, admin and security video courses for March 2019 Node.js is becoming an important library for infrastructure and DevOps engineers. As we move to a cloud native world, it’s a great tool for developing lightweight and modular services. That’s why we’re picking Learn Serverless App Development with Node.js and Azure Functions as one of our top videos for this month. Azure has been growing at a rapid rate over the last 12 months, and while it’s still some way behind AWS, Microsoft’s focus on developer experience is making Azure an increasingly popular platform with developers. For Node developers, this video is a great place to begin - it’s also useful for anyone who simply wants to find out what serverless development actually feels like. Read next: Serverless computing wars: AWS Lambda vs. Azure Functions A partner to this, for anyone beginning Node, is the new Node.js Design Patterns video. In particular, if Node.js is an important tool in your architecture, following design patterns is a robust method of ensuring reliability and resilience. Elsewhere, we have Modern DevOps in Practice, cutting through the consultancy-speak to give you useful and applicable guidance on how to use DevOps thinking in your workflows and processes, and DevOps with Azure, another video that again demonstrates just how impressive Azure is. For those not Azure-inclined, there’s AWS Certified Developer Associate - A Practical Guide, a video that takes you through everything you need to know to pass the AWS Developer Associate exam. There’s also a completely cloud-agnostic video course in the form of Creating a Continuous Deployment Pipeline for Cloud Platforms that’s essential for infrastructure and operations engineers getting to grips with cloud native development.     Learn a new programming language with these new video courses for March Finally, there are a number of new video courses that can help you get to grips with a new programming language. So, perfect if you’ve been putting off your new year’s resolution to learn a new language… Java 11 in 7 Days is a new video that brings you bang up to date with everything in the latest version of Java, while Hands-on Functional Programming with Java will help you rethink and reevaluate the way you use Java. Together, the two videos are a great way for Java developers to kick start their learning and update their skill set.  
Read more
  • 0
  • 0
  • 9473
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-scientific-computing-apis-python
Packt
20 Aug 2015
23 min read
Save for later

Scientific Computing APIs for Python

Packt
20 Aug 2015
23 min read
In this article, by Hemant Kumar Mehta author of the book Mastering Python Scientific Computing we will have comprehensive discussion of features and capabilities of various scientific computing APIs and toolkits in Python. Besides the basics, we will also discuss some example programs for each of the APIs. As symbolic computing is relatively different area of computerized mathematics, we have kept a special sub section within the SymPy section to discuss basics of computerized algebra system. In this article, we will cover following topics: Scientific numerical computing using NumPy and SciPy Symbolic Computing using SymPy (For more resources related to this topic, see here.) Numerical Scientific Computing in Python The scientific computing mainly demands for facility of performing calculations on algebraic equations, matrices, differentiations, integrations, differential equations, statistics, equation solvers and much more. By default Python doesn't come with these functionalities. However, development of NumPy and SciPy has enabled us to perform these operations and much more advanced functionalities beyond these operations. NumPy and SciPy are very powerful Python packages that enable the users to efficiently perform the desired operations for all types of scientific applications. NumPy package NumPy is the basic Python package for the scientific computing. It provides facility of multi-dimensional arrays and basic mathematical operations such as linear algebra. Python provides several data structure to store the user data, while the most popular data structures are lists and dictionaries. The list objects may store any type of Python object as an element. These elements can be processed using loops or iterators. The dictionary objects store the data in key, value format. The ndarrays data structure The ndaarays are also similar to the list but highly flexible and efficient. The ndarrays is an array object to represent multidimensional array of fixed-size items. This array should be homogeneous. It has an associated object of dtype to define the data type of elements in the array. This object defines type of the data (integer, float, or Python object), size of data in bytes, byte ordering (big-endian or little-endian). Moreover, if the type of data is record or sub-array then it also contains details about them. The actual array can be constructed using any one of the array, zeros or empty methods. Another important aspect of ndarrays is that the size of arrays can be dynamically modified. Moreover, if the user needs to remove some elements from the arrays then it can be done using the module for masked arrays. In a number of situations, scientific computing demands deletion/removal of some incorrect or erroneous data. The numpy.ma module provides the facility of masked array to easily remove selected elements from arrays. A masked array is nothing but the normal ndarrays with a mask. Mask is another associated array with true or false values. If for a particular position mask has true value then the corresponding element in the main array is valid and if the mask is false then the corresponding element in the main array is invalid or masked. In such case while performing any computation on such ndarrays the masked elements will not be considered. File handling Another important aspect of scientific computing is storing the data into files and NumPy supports reading and writing on both text as well as binary files. Mostly, text files are good way for reading, writing and data exchange as they are inherently portable and most of the platforms by default have capabilities to manipulate them. However, for some of the applications sometimes it is better to use binary files or the desired data for such application can only be stored in binary files. Sometimes the size of data and nature of data like image, sound etc. requires them to store in binary files. In comparison to text files binary files are harder to manage as they have specific formats. Moreover, the size of binary files are comparatively very small and the read/ write operations are very fast then the read/ write text files. This fast read/ write is most suitable for the application working on large datasets. The only drawback of binary files manipulated with NumPy is that they are accessible only through NumPy. Python has text file manipulation functions such as open, readlines and writelines functions. However, it is not performance efficient to use these functions for scientific data manipulation. These default Python functions are very slow in reading and writing the data in file. NumPy has high performance alternative that load the data into ndarrays before actual computation.  In NumPy, text files can be accessed using numpy.loadtxt and numpy.savetxt functions.  The loadtxt function can be used to load the data from text files to the ndarrays. NumPy also has a separate functions to manipulate the data in binary files. The function for reading and writing are numpy.load and numpy.save respectively. Sample NumPy programs The NumPy array can be created from a list or tuple using the array, this method can transform sequences of sequences into two dimensional array. import numpy as np x = np.array([4,432,21], int) print x                            #Output [  4 432  21] x2d = np.array( ((100,200,300), (111,222,333), (123,456,789)) ) print x2d Output: [  4 432  21] [[100 200 300] [111 222 333] [123 456 789]] Basic matrix arithmetic operation can be easily performed on two dimensional arrays as used in the following program.  Basically these operations are actually applied on elements hence the operand arrays must be of equal size, if the size is not matching then performing these operations will cause a runtime error. Consider the following example for arithmetic operations on one dimensional array. import numpy as np x = np.array([4,5,6]) y = np.array([1,2,3]) print x + y                      # output [5 7 9] print x * y                      # output [ 4 10 18] print x - y                       # output [3 3 3]    print x / y                       # output [4 2 2] print x % y                    # output [0 1 0] There is a separate subclass named as matrix to perform matrix operations. Let us understand matrix operation by following example which demonstrates the difference between array based multiplication and matrix multiplication. The NumPy matrices are 2-dimensional and arrays can be of any dimension. import numpy as np x1 = np.array( ((1,2,3), (1,2,3), (1,2,3)) ) x2 = np.array( ((1,2,3), (1,2,3), (1,2,3)) ) print "First 2-D Array: x1" print x1 print "Second 2-D Array: x2" print x2 print "Array Multiplication" print x1*x2   mx1 = np.matrix( ((1,2,3), (1,2,3), (1,2,3)) ) mx2 = np.matrix( ((1,2,3), (1,2,3), (1,2,3)) ) print "Matrix Multiplication" print mx1*mx2   Output: First 2-D Array: x1 [[1 2 3]  [1 2 3]  [1 2 3]] Second 2-D Array: x2 [[1 2 3]  [1 2 3]  [1 2 3]] Array Multiplication [[1 4 9]  [1 4 9]  [1 4 9]] Matrix Multiplication [[ 6 12 18]  [ 6 12 18]  [ 6 12 18]] Following is a simple program to demonstrate simple statistical functions given in NumPy: import numpy as np x = np.random.randn(10)   # Creates an array of 10 random elements print x mean = x.mean() print mean std = x.std() print std var = x.var() print var First Sample Output: [ 0.08291261  0.89369115  0.641396   -0.97868652  0.46692439 -0.13954144  -0.29892453  0.96177167  0.09975071  0.35832954] 0.208762357623 0.559388806817 0.312915837192 Second Sample Output: [ 1.28239629  0.07953693 -0.88112438 -2.37757502  1.31752476  1.50047537   0.19905071 -0.48867481  0.26767073  2.660184  ] 0.355946458357 1.35007701045 1.82270793415 The above programs are some simple examples of NumPy. SciPy package SciPy extends Python and NumPy support by providing advanced mathematical functions such as differentiation, integration, differential equations, optimization, interpolation, advanced statistical functions, equation solvers etc. SciPy is written on top of the NumPy array framework. SciPy has utilized the arrays and the basic operations on the arrays provided in NumPy and extended it to cover most of the mathematical aspects regularly required by scientists and engineers for their applications. In this article we will cover examples of some basic functionality. Optimization package The optimization package in SciPy provides facility to solve univariate and multivariate minimization problems. It provides solutions to minimization problems using a number of algorithms and methods. The minimization problem has wide range of application in science and commercial domains. Generally, we perform linear regression, search for function's minimum and maximum values, finding the root of a function, and linear programming for such cases. All these functionalities are supported by the optimization package.  Interpolation package A number of interpolation methods and algorithms are provided in this package as built-in functions. It provides facility to perform univariate and multivariate interpolation, one dimensional and two dimensional Splines etc. We use univariate interpolation when data is dependent of one variable and if data is around more than one variable then we use multivariate interpolation. Besides these functionalities it also provides additional functionality for Lagrange and Taylor polynomial interpolators. Integration and differential equations in SciPy Integration is an important mathematical tool for scientific computations. The SciPy integrations sub-package provides functionalities to perform numerical integration. SciPy provides a range of functions to perform integration on equations and data. It also has ordinary differential equation integrator. It provides various functions to perform numerical integrations using a number of methods from mathematics using numerical analysis. Stats module SciPy Stats module contains a functions for most of the probability distributions and wide range or statistical functions. Supported probability distributions include various continuous distribution, multivariate distributions and discrete distributions. The statistical functions range from simple means to the most of the complex statistical concepts, including skewness, kurtosis chi-square test to name a few. Clustering package and Spatial Algorithms in SciPy Clustering analysis is a popular data mining technique having wide range of application in scientific and commercial applications. In Science domain biology, particle physics, astronomy, life science, bioinformatics are few subjects widely using clustering analysis for problem solution. Clustering analysis is being used extensively in computer science for computerized fraud detection, security analysis, image processing etc. The clustering package provides functionality for K-mean clustering, vector quantization, hierarchical and agglomerative clustering functions. The spatial class has functions to analyze distance between data points using triangulations, Voronoi diagrams, and convex hulls of a set of points. It also has KDTree implementations for performing nearest-neighbor lookup functionality. Image processing in SciPy           SciPy provides support for performing various image processing operations including basic reading and writing of image files, displaying images, simple image manipulations operations such as cropping, flipping, rotating etc. It has also support for image filtering functions such as mathematical morphing, smoothing, denoising and sharpening of images. It also supports various other operations such as image segmentation by labeling pixels corresponding to different objects, Classification, Feature extraction for example edge detection etc. Sample SciPy programs In the subsequent subsections we will discuss some example programs using SciPy modules and packages. We start with a simple program performing standard statistical computations. After this, we will discuss a program performing finding a minimal solution using optimizations. At last we will discuss image processing programs. Statistics using SciPy The stats module of SciPy has functions to perform simple statistical operations and various probability distributions. The following program demonstrates simple statistical calculations using SciPy stats.describe function. This single function operates on an array and returns number of elements, minimum value, maximum value, mean, variance, skewness and kurtosis. import scipy as sp import scipy.stats as st s = sp.randn(10) n, min_max, mean, var, skew, kurt = st.describe(s) print("Number of elements: {0:d}".format(n)) print("Minimum: {0:3.5f} Maximum: {1:2.5f}".format(min_max[0], min_max[1])) print("Mean: {0:3.5f}".format(mean)) print("Variance: {0:3.5f}".format(var)) print("Skewness : {0:3.5f}".format(skew)) print("Kurtosis: {0:3.5f}".format(kurt)) Output: Number of elements: 10 Minimum: -2.00080 Maximum: 0.91390 Mean: -0.55638 Variance: 0.93120 Skewness : 0.16958 Kurtosis: -1.15542 Optimization in SciPY Generally, in mathematical optimization a non convex function called Rosenbrock function is used to test the performance of the optimization algorithm. The following program is demonstrating the minimization problem on this function. The Rosenbrock function of N variable is given by following equation and it has minimum value 0 at xi =1. The program for the above function is: import numpy as np from scipy.optimize import minimize   # Definition of Rosenbrock function def rosenbrock(x):      return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)   x0 = np.array([1, 0.7, 0.8, 2.9, 1.1]) res = minimize(rosenbrock, x0, method = 'nelder-mead', options = {'xtol': 1e-8, 'disp': True})   print(res.x) Output is: Optimization terminated successfully.          Current function value: 0.000000          Iterations: 516          Function evaluations: 827 [ 1.  1.  1.  1.  1.] The last line is the output of print(res.x) where all the elements of array are 1. Image processing using SciPy Following two programs are developed to demonstrate the image processing functionality of SciPy. First of these program is simply displaying the standard test image widely used in the field of image processing called Lena. The second program is applying geometric transformation on this image. It performs image cropping and rotation by 45 %. The following program is displaying Lena image using matplotlib API. The imshow method renders the ndarrays into an image and the show method displays the image. from scipy import misc l = misc.lena() misc.imsave('lena.png', l) import matplotlib.pyplot as plt plt.gray() plt.imshow(l) plt.show() Output: The output of the above program is the following screen shot: The following program is performing geometric transformation. This program is displaying transformed images and along with the original image as a four axis array. import scipy from scipy import ndimage import matplotlib.pyplot as plt import numpy as np   lena = scipy.misc.lena() lx, ly = lena.shape crop_lena = lena[lx/4:-lx/4, ly/4:-ly/4] crop_eyes_lena = lena[lx/2:-lx/2.2, ly/2.1:-ly/3.2] rotate_lena = ndimage.rotate(lena, 45)   # Four axes, returned as a 2-d array f, axarr = plt.subplots(2, 2) axarr[0, 0].imshow(lena, cmap=plt.cm.gray) axarr[0, 0].axis('off') axarr[0, 0].set_title('Original Lena Image') axarr[0, 1].imshow(crop_lena, cmap=plt.cm.gray) axarr[0, 1].axis('off') axarr[0, 1].set_title('Cropped Lena') axarr[1, 0].imshow(crop_eyes_lena, cmap=plt.cm.gray) axarr[1, 0].axis('off') axarr[1, 0].set_title('Lena Cropped Eyes') axarr[1, 1].imshow(rotate_lena, cmap=plt.cm.gray) axarr[1, 1].axis('off') axarr[1, 1].set_title('45 Degree Rotated Lena')   plt.show() Output: The SciPy and NumPy are core of Python's support for scientific computing as they provide solid functionality of numerical computing. Symbolic computations using SymPy Computerized computations performed over the mathematical symbols without evaluating or changing their meaning is called as symbolic computations. Generally the symbolic computing is also called as computerized algebra and such computerized system are called computer algebra system. The following subsection has a brief and good introduction to SymPy. Computer Algebra System (CAS) Let us discuss the concept of CAS. CAS is a software or toolkit to perform computations on mathematical expressions using computers instead of doing it manually. In the beginning, using computers for these applications was named as computer algebra and now this concept is called as symbolic computing. CAS systems may be grouped into two types. First is the general purpose CAS and the second type is the CAS specific to particular problem. The general purpose systems are applicable to most of the area of algebraic mathematics while the specialized CAS are the systems designed for the specific area such as group theory or number theory. Most of the time, we prefer the general purpose CAS to manipulate the mathematical expressions for scientific applications. Features of a general purpose CAS Various desired features of a general purpose computer algebra system for scientific applications are as: A user interface to manipulate mathematical expressions. An interface for programming  and debugging Such systems requires simplification of various mathematical expressions hence, a simplifier is a most essential component of such computerized algebra system. The general purpose CAS system must support exhaustive set of functions to perform various mathematical operations required by any algebraic computations Most of the applications perform extensive computations an efficient memory management is highly essential. The system must provide support to perform mathematical computations on high precision numbers and large quantities. A brief idea of SymPy SymPy is an open source and Python based implementation of computerized algebra system (CAS). The philosophy behind the SymPy development is to design and develop a CAS having all the desired features yet its code as simple as possible so that it will be highly and easily extensible. It is written completely in Python and do not requires any external library. The basic idea about using SymPy is the creation and manipulation of expressions. Using SymPy, the user represents mathematical expressions in Python language using SymPy classes and objects. These expressions are composed of numbers, symbols, operators, functions etc. The functions are the modules to perform a mathematical functionality such as logarithms, trigonometry etc. The development of SymPy was started by Ondřej Čertíkin August 2006. Since then, it has been grown considerably with the contributions more than hundreds of the contributors. This library now consists of 26 different integrated modules. These modules have capability to perform computations required for basic symbolic arithmetic, calculus, algebra, discrete mathematics, quantum physics, plotting and printing with the option to export the output of the computations to LaTeX and other formats. The capabilities of SymPy can be divided into two categories as core capability and advanced capabilities as SymPy library is divided into core module with several advanced optional modules. The various supported functionality by various modules are as follows: Core capabilities The core capability module supports basic functionalities required by any mathematical algebra operations to be performed. These operations include basic arithmetic like multiplications, addition, subtraction and division, exponential etc. It also supports simplification of expressions to simplify complex expressions. It provides the functionality of expansion of series and symbols. Core module also supports functions to perform operations related to trigonometry, hyperbola, exponential, roots of equations, polynomials, factorials and gamma functions, logarithms etc. and a number of special functions for B-Splines, spherical harmonics, tensor functions, orthogonal polynomials etc. There is strong support also given for pattern matching operations in the core module. Core capabilities of the SymPy also include the functionalities to support substitutions required by algebraic operations. It not only supports the high precision arithmetic operations over integers, rational and gloating point numbers but also non-commutative variables and symbols required in polynomial operations. Polynomials Various functions to perform polynomial operations belong to the polynomial module. These functions includes basic polynomial operations such as division, greatest common divisor (GCD) least common multiplier (LCM), square-free factorization, representation of  polynomials with symbolic coefficients, some special operations like computation of resultant, deriving trigonometric identities, Partial fraction decomposition, facilities for Gröbner basis over polynomial rings and fields. Calculus Various functionalities supporting different operations required by basic and advanced calculus are provided in this module. It supports functionalities required by limits, there is a limit function for this. It also supports differentiation and integrations and series expansion, differential equations and calculus of finite differences. SymPy is also having special support for definite integrals and integral transforms. In differential it supports numerical differential, composition of derivatives and fractional derivatives.  Solving equations Solver is the name of the SymPy module providing equations solving functionality. This module supports solving capabilities for complex polynomials, roots of polynomials and solving system of polynomial equations. There is a function to solve the algebraic equations. It not only provides support for solutions for differential equations including ordinary differential equations, some forms of partial differential equations, initial and boundary values problems etc. but also supports solution of difference equations. In mathematics, difference equation is also called recurrence relations, that is an equation that recursively defines a sequence or multidimensional array of values. Discrete math Discrete mathematics includes those mathematical structures which are discrete in nature rather than the continuous mathematics like calculus. It deals with the integers, graphs, statements from logic theory etc. This module has full support for binomial coefficient, products, summations etc. This module also supports various functions from number theory including residual theory, Euler's Totient, partition and a number of functions dealing with prime numbers and their factorizations. SymPy also supports creation and manipulations of logic expressions using symbolic and Boolean values. Matrices SymPy has a strong support for various operations related to the matrices and determinants. Matrix belongs to linear algebra category of mathematics. It supports creation of matrix, basic matrix operations like multiplication, addition, matrix of zeros and ones, creation of random matrix and performing operations on matrix elements. It also supports special functions line computation of Hessian matrix for a function, Gram-Schmidt process on set of vectors, Computation of Wronskian for matrix of functions etc. It has also full support for Eigenvalues/eigenvectors, matrix inversion, solution of matrix and determinants.  For computing determinants of the matrix, it also supports Bareis' fraction-free algorithm and berkowitz algorithms besides the other methods. For matrices it also supports nullspace calculation, cofactor expansion tools, derivative calculation for matrix elements, calculation of dual of matrix etc. Geometry SymPy is also having module that supports various operations associated with the two-dimensional (2-D) geometry. It supports creation of various 2-D entity or objects such as point, line, circle, ellipse, polygon, triangle, ray, segment etc. It also allows us to perform query on these entities such as area of some of the suitable objects line ellipse/ circle or triangle, intersection points of lines etc. It also supports other queries line tangency determination, finding similarity and intersection of entities. Plotting There is a very good module that allows us to draw two-dimensional and three-dimensional plots. At present, the plots are rendered using the matplotlib package. It also supports other packages such as TextBackend, Pyglet, textplot etc.  It has a very good interactive interface facility of customizations and plotting of various geometric entities. The plotting module has the following functions: Plotting 2-D line plots Plotting of 2-D parametric plots. Plotting of 2-D implicit and region plots. Plotting of 3-D plots of functions involving two variables. Plotting of 3-D line and surface plots etc. Physics There is a module to solve the problem from Physics domain. It supports functionality for mechanics including classical and quantum mechanics, high energy physics. It has functions to support Pauli Algebra, quantum harmonic oscillators in 1-D and 3-D. It is also having functionality for optics. There is a separate module that integrates unit systems into SymPy. This will allow users to select the specific unit system for performing his/ her computations and conversion between the units. The unit systems are composed of units and constant for computations. Statistics The statistics module introduced in SymPy to support the various concepts of statistics required in mathematical computations. Apart from supporting various continuous and discrete statistical distributions, it also supports functionality related to the symbolic probability. Generally, these distributions support functions for random number generations in SymPy. Printing SymPy is having a module for provide full support for Pretty-Printing. Pretty-print is the idea of conversions of various stylistic formatting into the text files such as source code, text files and markup files or similar content. This module produces the desired output by printing using ASCII and or Unicode characters. It supports various printers such as LATEX and MathML printer. It is also capable of producing source code in various programming languages such as c, Python or FORTRAN. It is also capable of producing contents using markup languages like HTML/ XML. SymPy modules The following list has formal names of the modules discussed in above paragraphs: Assumptions: assumption engine Concrete: symbolic products and summations Core: basic class structure: Basic, Add, Mul, Pow etc. functions: elementary and special functions galgebra: geometric algebra geometry: geometric entities integrals: symbolic integrator interactive: interactive sessions (e.g. IPython) logic: boolean algebra, theorem proving matrices: linear algebra, matrices mpmath: fast arbitrary precision numerical math ntheory: number theoretical functions parsing: Mathematica and Maxima parsers physics: physical units, quantum stuff plotting: 2D and 3D plots using Pyglet polys: polynomial algebra, factorization printing: pretty-printing, code generation series: symbolic limits and truncated series simplify: rewrite expressions in other forms solvers: algebraic, recurrence, differential statistics: standard probability distributions utilities: test framework, compatibility stuf There are numerous symbolic computing systems available in various mathematical toolkits. There are some proprietary software such as Maple/ Mathematica and there are some open source alternatives also such as Singular/ AXIOM. However, these products have their own scripting language, difficult to extend their functionality and having slow development cycle. Whereas SymPy is highly extensible, designed and developed in Python language and open source API that supports speedy development life cycle. Simple exemplary programs These are some very simple examples to get idea about the capacity of SymPy. These are less than ten lines of SymPy source codes which covers topics ranging from basis symbol manipulations to limits, differentiations and integrations. We can test the execution of these programs on SymPy live running SymPy online on Google App Engine available on http://live.sympy.org/. Basic symbol manipulation The following code is defines three symbols, an expression on these symbols and finally prints the expression. import sympy a = sympy.Symbol('a') b = sympy.Symbol('b') c = sympy.Symbol('c') e = ( a * b * b + 2 * b * a * b) + (a * a + c * c) print e Output: a**2 + 3*a*b**2 + c**2     (here ** represents power operation). Expression expansion in SymPy The following program demonstrates the concept of expression expansion. It defines two symbols and a simple expression on these symbols and finally prints the expression and its expanded form. import sympy a = sympy.Symbol('a') b = sympy.Symbol('b') e = (a + b) ** 4 print e print e.expand() Output: (a + b)**4 a**4 + 4*a**3*b + 6*a**2*b**2 + 4*a*b**3 + b**4 Simplification of expression or formula The SymPy has facility to simplify the mathematical expressions. The following program is having two expressions to simplify and displays the output after simplifications of the expressions. import sympy x = sympy.Symbol('x') a = 1/x + (x*exp(x) - 1)/x simplify(a) simplify((x ** 3 +  x ** 2 - x - 1)/(x ** 2 + 2 * x + 1)) Output: ex x – 1 Simple integrations The following program is calculates the integration of two simple functions. import sympy from sympy import integrate x = sympy.Symbol('x') integrate(x ** 3 + 2 * x ** 2 + x, x) integrate(x / (x ** 2 + 2 * x), x) Output: x**4/4+2*x**3/3+x**2/2 log(x + 2) Summary In this article, we have discussed the concepts, features and selective sample programs of various scientific computing APIs and toolkits. The article started with a discussion of NumPy and SciPy. After covering NymPy, we have discussed concepts associated with symbolic computing and SymPy. In the remaining article we have discussed the Interactive computing and data analysis & visualization alog with their APIs or toolkits. IPython is the python toolkit for interactive computing. We have also discussed the data analysis package Pandas and the data visualization API names Matplotlib. Resources for Article: Further resources on this subject: Optimization in Python [article] How to do Machine Learning with Python [article] Bayesian Network Fundamentals [article]
Read more
  • 0
  • 0
  • 9389

article-image-common-qlikview-script-errors
Packt
22 Nov 2013
3 min read
Save for later

Common QlikView script errors

Packt
22 Nov 2013
3 min read
(For more resources related to this topic, see here.) QlikView error messages displayed during the running of the script, during reload, or just after the script is run are key to understanding what errors are contained in your code. After an error is detected and the error dialog appears, review the error, and click on OK or Cancel on the Script Error dialog box. If you have the debugger open, click on Close, then click on Cancel on the Sheet Properties dialog. Re-enter the Script Editor and examine your script to fix the error. Errors can come up as a result of syntax, formula or expression errors, join errors, circular logic, or any number of issues in your script. The following are a few common error messages you will encounter when developing your QlikView script. The first one, illustrated in the following screenshot, is the syntax error we received when running the code that missed a comma after Sales. This is a common syntax error. It's a little bit cryptic, but the error is contained in the code snippet that is displayed. The error dialog does not exactly tell you that it expected a comma in a certain place, but with practice, you will realize the error quickly. The next error is a circular reference error. This error will be handled automatically by QlikView. You can choose to accept QlikView's fix of loosening one of the tables in the circular reference (view the data model in Table Viewer for more information on which table is loosened, or view the Document Properties dialog, Tables tab to find out which table is marked Loosely Coupled). Alternatively, you can choose another table to be loosely coupled in the Document Properties, Tables tab, or you can go back into the script and fix the circular reference with one of the methods. The following screenshot is a warning/error dialog displayed when you have a circular reference in a script: Another common issue is an unknown statement error that can be caused by an error in writing your script—missed commas, colons, semicolons, brackets, quotation marks, or an improperly written formula. In the case illustrated in the following screenshot, the error has encountered an unknown statement—namely, the Customers line that QlikView is attempting to interpret as Customers Load *…. The fix for this error is to add a colon after Customers in the following way: Customers: There are instances when a load script will fail silently. Attempting to store a QVD or CSV to a file that is locked by another user viewing it is one such error. Another example is when you have two fields with the same name in your load statement. The debugger can help you find the script lines in which the silent error is present. Summary In this article we learned about QlikView error messages displayed during the script execution. Resources for Article: Further resources on this subject: Meet QlikView [Article] Introducing QlikView elements [Article] Linking Section Access to multiple dimensions [Article]
Read more
  • 0
  • 0
  • 9388

article-image-getting-started-apache-spark-dataframes
Packt
22 Sep 2015
5 min read
Save for later

Getting Started with Apache Spark DataFrames

Packt
22 Sep 2015
5 min read
 In this article article about Arun Manivannan’s book Scala Data Analysis Cookbook, we will cover the following recipes: Getting Apache Spark ML – a framework for large-scale machine learning Creating a data frame from CSV (For more resources related to this topic, see here.) Getting started with Apache Spark Breeze is the building block of Spark MLLib, the machine learning library for Apache Spark. In this recipe, we'll see how to bring Spark into our project (using SBT) and look at how it works internally. The code for this recipe could be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/build.sbt. How to do it... Pulling Spark ML into our project is just a matter of adding a few dependencies on our build.sbt file: spark-core, spark-sql, and spark-mllib: Under a brand new folder (which will be our project root), we create a new file called build.sbt. Next, let's add to the project dependencies the Spark libraries: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion ) resolvers ++= Seq( "Apache HBase" at "https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/" ) How it works... Spark has four major higher level tools built on top of the Spark Core: Spark Streaming, Spark ML Lib (Machine Learning), Spark SQL (An SQL interface for accessing data), and GraphX (for graph processing). The Spark Core is the heart of Spark, providing higher level abstractions in various languages for data representation, serialization, scheduling, metrics, and so on. For this recipe, we skipped streaming and GraphX and added the remaining three libraries. There’s more… Apache Spark is a cluster computing platform that claims to run about 100 times faster than Hadoop (that's a mouthful). In our terms, we could consider that as a means to run our complex logic over a massive amount of data at a blazingly high speed. The other good thing about Spark is that the programs we write are much smaller than the typical Map Reduce classes that we write for Hadoop. So, not only do our programs run faster, but it also takes lesser time to write them in the first place. Creating a data frame from CSV In this recipe, we'll look at how to create a new data frame from a Delimiter Separated Values (DSV) file. The code for this recipe could be found athttps://github.com/arunma/ScalaDataAnalysisCookbook/tree/master/chapter1-spark-csv in the DataFrameCSV class. How to do it... CSV support isn't first-class in Spark but is available through an external library from databricks. So, let's go ahead and add that up in build.sbt: After adding the spark-csv dependency, our complete build.sbt looks as follows: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion, "com.databricks" %% "spark-csv" % "1.0.3" ) resolvers ++= Seq( "Apache HBase" at"https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/" ) fork := true Before we create the actual data frame, there are three steps that we ought to do: create the Spark configuration, create the Spark context, and create the SQL context. SparkConf holds all of the information for running this Spark cluster. For this recipe, we are running locally, and we intend to use only two cores in the machine—local[2]: val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]") For this recipe, we'll be running Spark on standalone mode. Now let's load our pipe-separated file: org.apache.spark.sql.DataFrame val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|') How it works... The csvFile function of sqlContext accepts the full filePath of the file to be loaded. If the CSV has a header, then the useHeader flag will read the first row as column names. The delimiter flag, as expected, defaults to a comma, but you can override the character as needed. Instead of using the csvFile function, you can also use the load function available in the SQL context. The load function accepts the format of the file (in our case, it is CSV) and options as a map. We can specify the same parameters that we specified earlier using Map, like this: val options=Map("header"->"true", "path"->"ModifiedStudent.csv") val newStudents=sqlContext.load("com.databricks.spark.csv",options) Summary In this article, you learned in detail Apache Spark ML, a framework for large-scale machine learning. Then we saw the creation of a data frame from CSV with the help of example code. Resources for Article: Further resources on this subject: Integrating Scala, Groovy, and Flex Development with Apache Maven[article] Ridge Regression[article] Reactive Data Streams [article]
Read more
  • 0
  • 0
  • 9371

article-image-learning-d3js-mapping
Packt
08 May 2015
3 min read
Save for later

Learning D3.js Mapping

Packt
08 May 2015
3 min read
What is D3.js? D3.js (Data-Driven Documents) is a JavaScript library used for data visualization. D3.js is used to display graphical representations of information in a browser using JavaScript. Because they are, essentially, a collection of JavaScript code, graphical elements produced by D3.js can react to changes made on the client or server side. D3.js has seen major adoption as websites include more dynamic charts, graphs, infographics and other forms of visualized data. (For more resources related to this topic, see here.) Why this book? This book by the authors, Thomas Newton and Oscar Villarreal, explores the JavaScript library, D3.js, and its ability to help us create maps and amazing visualizations. You will no longer be confined to third-party tools in order to get a nice looking map. With D3.js, you can build your own maps and customize them as you please. This book will go from the basics of SVG and JavaScript to data trimming and modification with TopoJSON. Using D3.js to glue together these three key ingredients, we will create very attractive maps that will cover many common use cases for maps, such as choropleths, data overlays on maps, and interactivity. Key features Dive into D3.js and apply its powerful data binding ability in order to create stunning visualizations Learn the key concepts of SVG, JavaScript, CSS, and the DOM in order to project images onto the browser Solve a wide range of problems faced while building interactive maps with this solution-based guide Authors Thomas Newton has 20 years of experience in the technical industry, working on everything from low-level system designs and data visualization to software design and architecture. Currently, he is creating data visualizations to solve analytical problems for clients. Oscar Villarreal has been developing interfaces for the past 10 years, and most recently, he has been focusing on data visualization and web applications. In his spare time, he loves to write on his blog, oscarvillarreal.com. In short You will find a few books on D3.js but they require intermediate-level developers who already know how to use D3.js, a few of them will cover a much wider range of D3 usage, whereas this book is focused exclusively on mapping; it fully explores this core task, and takes a solution-based approach. Recommendations and all the wealthy knowledge that authors have shared in the book is based on many years of experience and many projects delivered using D3. What this book covers Learn all the tools you need to create a map using D3 A high-level overview of Scalable Vector Graphics (SVG) presentation with explanation on how it operates and what elements it encompasses Exploring D3.js—producing graphics from data Step-by-step guide to build a map with D3 Get you started with interactivity in your D3 map visualizations Most important aspects of map visualization in detail via the use of TopoJSON Assistance for the long-term maintenance of your D3 code base and different techniques to keep it healthy over the lifespan of your project Summary So far, you may have got an idea of what all be covered in the book. This book is carefully designed to allow the reader to jump between chapters based on what they are planning to get out of the book. Every chapter is full of pragmatic examples that can easily provide the foundation to more complex work. Authors have explained, step by step, how each example works. Resources for Article: Further resources on this subject: Using Canvas and D3 [article] Interacting with your Visualization [article] Simple graphs with d3.js [article]
Read more
  • 0
  • 0
  • 9366
article-image-predicting-sports-winners-decision-trees-and-pandas
Packt
12 Aug 2015
6 min read
Save for later

Predicting Sports Winners with Decision Trees and pandas

Packt
12 Aug 2015
6 min read
In this article by Robert Craig Layton, author of Learning Data Mining with Python, we will look at predicting the winner of games of the National Basketball Association (NBA) using a different type of classification algorithm—decision trees. Collecting the data The data we will be using is the match history data for the NBA, for the 2013-2014 season. The Basketball-Reference.com website contains a significant number of resources and statistics collected from the NBA and other leagues. Perform the following steps to download the dataset: Navigate to http://www.basketball-reference.com/leagues/NBA_2014_games.html in your web browser. Click on the Export button next to the Regular Season heading. Download the file to your data folder (and make a note of the path). This will download a CSV file containing the results of 1,230 games in the regular season of the NBA. We will load the file with the pandas library, which is an incredibly useful library for manipulating data. Python also contains a built-in library called csv that supports reading and writing CSV files. We will use pandas instead as it provides more powerful functions to work with datasets. For this article, you will need to install pandas. The easiest way to do that is to use pip3, which you may previously have used to install scikit-learn: $pip3 install pandas Using pandas to load the dataset We can load the dataset using the read_csv function in pandas as follows: import pandas as pddataset = pd.read_csv(data_filename) The result of this is a data frame, a data structure used by pandas. The pandas.read_csv function has parameters to fix some of the problems in the data, such as missing headings, which we can specify when loading the file: dataset = pd.read_csv(data_filename, parse_dates=["Date"],skiprows=[0,])dataset.columns = ["Date", "Score Type", "Visitor Team","VisitorPts", "Home Team", "HomePts", "OT?", "Notes"] We can now view a sample of the data frame: dataset.ix[:5] Extracting new features We extract our classes, 1 for a home win, and 0 for a visitor win. We can specify this using the following code to extract those wins into a NumPy array: dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"] y_true = dataset["HomeWin"].values The first two new features we want to create are to indicate whether each of the two teams won their previous game. This would roughly approximate which team is currently playing well. We will compute this feature by iterating through the rows in order, and recording which team won. When we get to a new row, we look up whether the team won the last time: from collections import defaultdictwon_last = defaultdict(int) We can then iterate over all the rows and update the current row with the team's last result (win or loss): for index, row in dataset.iterrows():home_team = row["Home Team"]visitor_team = row["Visitor Team"]row["HomeLastWin"] = won_last[home_team]row["VisitorLastWin"] = won_last[visitor_team]dataset.ix[index] = row We then set our dictionary with each team's result (from this row) for the next time we see these teams: won_last[home_team] = row["HomeWin"]won_last[visitor_team] = not row["HomeWin"] Decision trees Decision trees are a class of classification algorithm such as a flow chart that consist of a sequence of nodes, where the values for a sample are used to make a decision on the next node to go to. We can use the DecisionTreeClassifier class to create a decision tree: from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier(random_state=14) We now need to extract the dataset from our pandas data frame in order to use it with our scikit-learn classifier. We do this by specifying the columns we wish to use and using the values parameter of a view of the data frame: X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values Decision trees are estimators, and therefore, they have fit and predict methods. We can also use the cross_val_score method as before to get the average score: scores = cross_val_score(clf, X_previouswins, y_true,scoring='accuracy')print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100)) This scores up to 56.1%; we are better off choosing randomly! Predicting sports outcomes We have a method for testing how accurate our models are using the cross_val_score method that allows us to try new features. For the first feature, we will create a feature that tells us whether the home team is generally better than the visitors by seeing whether they ranked higher in the previous season. To obtain the data, perform the following steps: Head to http://www.basketball-reference.com/leagues/NBA_2013_standings.html Scroll down to Expanded Standings. This gives us a single list for the entire league. Click on the Export link to the right of this heading. Save the download in your data folder. In your IPython Notebook, enter the following into a new cell. You'll need to ensure that the file was saved into the location pointed to by the data_folder variable: standings_filename = os.path.join(data_folder,"leagues_NBA_2013_standings_expanded-standings.csv")standings = pd.read_csv(standings_filename, skiprows=[0,1]) We then iterate over the rows and compare the team's standings: dataset["HomeTeamRanksHigher"] = 0for index, row in dataset.iterrows():home_team = row["Home Team"]visitor_team = row["Visitor Team"] Between 2013 and 2014, a team was renamed as follows: if home_team == "New Orleans Pelicans":home_team = "New Orleans Hornets"elif visitor_team == "New Orleans Pelicans":visitor_team = "New Orleans Hornets" Now, we can get the rankings for each team. We then compare them and update the feature in the row: home_rank = standings[standings["Team"] ==home_team]["Rk"].values[0]visitor_rank = standings[standings["Team"] ==visitor_team]["Rk"].values[0]row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)dataset.ix[index] = row Next, we use the cross_val_score function to test the result. First, we extract the dataset as before: X_homehigher = dataset[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values Then, we create a new DecisionTreeClassifier class and run the evaluation: clf = DecisionTreeClassifier(random_state=14)scores = cross_val_score(clf, X_homehigher, y_true,scoring='accuracy')print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100)) This now scores up to 60.3%—even better than our previous result. Unleash the full power of Python machine learning with our 'Learning Data Mining with Python' book.
Read more
  • 0
  • 0
  • 9355

article-image-2018-new-year-resolutions-to-thrive-in-the-algorithmic-world-part-3-of-3
Sugandha Lahoti
05 Jan 2018
5 min read
Save for later

2018 new year resolutions to thrive in the Algorithmic World - Part 3 of 3

Sugandha Lahoti
05 Jan 2018
5 min read
We have already talked about a simple learning roadmap for you to develop your data science skills in the first resolution. We also talked about the importance of staying relevant in an increasingly automated job market, in our second resolution. Now it’s time to think about the kind of person you want to be and the legacy you will leave behind. 3rd Resolution: Choose projects wisely and be mindful of their impact. Your work has real consequences. And your projects will often be larger than what you know or can do. As such, the first step toward creating impact with intention is to define the project scope, purpose, outcomes and assets clearly. The next most important factor is choosing the project team. 1. Seek out, learn from and work with a diverse group of people To become a successful data scientist you must learn how to collaborate. Not only does it make projects fun and efficient, but it also brings in diverse points of view and expertise from other disciplines. This is a great advantage for machine learning projects that attempt to solve complex real-world problems. You could benefit from working with other technical professionals like web developers, software programmers, data analysts, data administrators, game developers etc. Collaborating with such people will enhance your own domain knowledge and skills and also let you see your work from a broader technical perspective. Apart from the people involved in the core data and software domain, there are others who also have a primary stake in your project’s success. These include UX designers, people with humanities background if you are building a product intended to participate in society (which most products often are), business development folks, who actually sell your product and bring revenue, marketing people, who are responsible for bringing your product to a much wider audience to name a few. Working with people of diverse skill sets will help market your product right and make it useful and interpretable to the target audience. In addition to working with a melange of people with diverse skill sets and educational background it is also important to work with people who think differently from you, and who have experiences that are different from yours to get a more holistic idea of the problems your project is trying to tackle and to arrive at a richer and unique set of solutions to solve those problems. 2. Educate yourself on ethics for data science As an aspiring data scientist, you should always keep in mind the ethical aspects surrounding privacy, data sharing, and algorithmic decision-making.  Here are some ways to develop a mind inclined to designing ethically-sound data science projects and models. Listen to seminars and talks by experts and researchers in fairness, accountability, and transparency in machine learning systems. Our favorites include Kate Crawford’s talk on The trouble with bias, Tricia Wang on The human insights missing from big data and Ethics & Data Science by Jeff Hammerbacher. Follow top influencers on social media and catch up with their blogs and about their work regularly. Some of these researchers include Kate Crawford, Margaret Mitchell, Rich Caruana, Jake Metcalf, Michael Veale, and Kristian Lum among others. Take up courses which will guide you on how to eliminate unintended bias while designing data-driven algorithms. We recommend Data Science Ethics by the University of Michigan, available on edX. You can also take up a course on basic Philosophy from your choice of University.   Start at the beginning. Read books on ethics and philosophy when you get long weekends this year. You can begin with Aristotle's Nicomachean Ethics to understand the real meaning of ethics, a term Aristotle helped develop. We recommend browsing through The Stanford Encyclopedia of Philosophy, which is an online archive of peer-reviewed publication of original papers in philosophy, freely accessible to Internet users. You can also try Practical Ethics, a book by Peter Singer and The Elements of Moral Philosophy by James Rachels. Attend or follow upcoming conferences in the field of bringing transparency in socio-technical systems. For starters, FAT* (Conference on Fairness, Accountability, and Transparency) is scheduled on February 23 and 24th, 2018 at New York University, NYC. We also have the 5th annual conference of FAT/ML, later in the year.  3. Question/Reassess your hypotheses before, during and after actual implementation Finally, for any data science project, always reassess your hypotheses before, during, and after the actual implementation. Always ask yourself these questions after each of the above steps and compare them with the previous answers. What question are you asking? What is your project about? Whose needs is it addressing? Who could it adversely impact? What data are you using? Is the data-type suitable for your type of model? Is the data relevant and fresh? What are its inherent biases and limitations? How robust are your workarounds for them? What techniques are you going to try? What algorithms are you going to implement? What would be its complexity? Is it interpretable and transparent? How will you evaluate your methods and results? What do you expect the results to be? Are the results biased? Are they reproducible? These pointers will help you evaluate your project goals from a customer and business point of view. Additionally, it will also help you in building efficient models which can benefit the society and your organization at large. With this, we come to the end of our new year resolutions for an aspiring data scientist. However, the beauty of the ideas behind these resolutions is that they are easily transferable to anyone in any job. All you gotta do is get your foundations right, stay relevant, and be mindful of your impact. We hope this gives a great kick start to your career in 2018. “Motivation is what gets you started. Habit is what keeps you going.” ― Jim Ryun Happy New Year! May the odds and the God(s) be in your favor this year to help you build your resolutions into your daily routines and habits!
Read more
  • 0
  • 0
  • 9343

article-image-sap-hana-architecture
Packt
20 Dec 2013
12 min read
Save for later

SAP HANA Architecture

Packt
20 Dec 2013
12 min read
(For more resources related to this topic, see here.) Understanding the SAP HANA architecture Architecture is the key for SAP HANA to be a game changing innovative technology. SAP HANA has been designed so well architecture-wise such that it makes a lot of difference when compared to other traditional databases available today. This section explains us the various components of SAP HANA and its functionalities. Getting ready Enterprise application requirements have become more demanding—complex reports with high computation on huge volumes of transaction data and also business data of other formats (both structured and semi-structured). Data is being written or updated, and also read from the database in parallel. Thus, integration of both transactional and analytical data into single database is required, where SAP HANA has evolved. Columnar storage exploits modern hardware and technology (multiple CPU cores, large main memory, and caches) in achieving the requirements of enterprise applications. Apart from this, it should also support procedural logic where certain tasks cannot be completed with simple SQL. How it works… The SAP HANA database consists of several services (servers). Index server is the most important component of all the servers. Other servers are name server, preprocessor server, statistics server, and XS Engine: Index server: This server holds the actual data and the engines for processing the data. When SQL or MDX is fired against the SAP HANA system in the case of authenticated sessions and transactions, an index server takes care of these commands and processes them. Name server: This server holds complete information about the system landscape. Name server is responsible for the topology of the SAP HANA system. In a distributed system, SAP HANA instances will be running on multiple hosts. In this kind of setup, the name server knows where the components are running and how data is spread on different servers. Preprocessor server: This server comes into the picture during text data analysis. Index server utilizes the capabilities of preprocessor server in text data analysis and searching. This helps to extract the information on which text search capabilities are based. Statistics server: This server helps in collecting the data for the system monitor and helps you know the health of the SAP HANA system. The statistics server is responsible for collecting the data related to status, resource allocation/consumption and performance of the SAP HANA system. Monitoring the clients and getting the status of various alert monitors use the data collected by Statistics server. This server also provides a history of measurement data for further analysis. XS Engine: The XS Engine allows external applications and application developers to access the SAP HANA system via the XS Engine clients, for example, a web browser accesses SAP HANA apps built by application developers via HTTP. Application developers build applications by using the XS Engine, and the users access the app via HTTP by using a web browser. The persistent model in the SAP HANA database is converted into a consumption model for clients to access it via HTTP. This allows an organization to host system services that are a part of the SAP HANA database (for example, Search service, a built-in web server that provides access to static content in the repository). The following diagram shows the architecture of SAP HANA: There's is more... Let us continue learning about the different components: SAP Host Agent: According to the new approach from SAP, the SAP Host Agent should be installed on all machines that are related to the SAP landscape. It is used by Adaptive Computing Controller (ACC) to manage the system and Software Update Manager (SUM) for automatic updates. LM-structure: LM-structure for SAP HANA contains the information about current installation details. This information will be used by SUM during automatic updates. SAP Solution Manager diagnostic agent: This agent provides all the data to SAP Solution Manager (SAP SOLMAN) to monitor the SAP HANA system. After the SAP SOLMAN is integrated with the SAP HANA system, this agent provides information about the database at a glance, which includes the database state and general information about the system, such as alerts, CPU, or memory and disk usage. SAP HANA Studio repository: This helps the end users to update the SAP HANA studio to higher versions. The SAP HANA Studio repository is the code that does this process. Software Update Manager for SAP HANA: This helps in automatic updates of SAP HANA from the SAP Marketplace and patching the SAP host agent. It also allows distribution of the Studio repository to the end users. See also http://help.sap.com/hana/SAP_HANA_Installation_Guide_en.pdf SAP Notes:1793303 and 1514967 Explaining IMCE and its components We have seen the architecture of SAP HANA and its components. In this section, we will learn about IMCE (in-memory computing engine) and how its components and its functionalities. Getting Ready The SAP in-memory computing engine (formerly Business Analytic Engine (BAE)) is the core engine for SAP's next generation high-performance, in-memory solutions as it leverages technologies such as in-memory computing, columnar databases, massively parallel processing (MPP), and data compression, to allow organizations to instantly explore and analyze large volumes of transactional and analytical data from across the enterprise in real time. How it works... In-memory computing allows the processing of massive quantities of real-time data in the main memory of the server, providing immediate results from analyses and transactions. The SAP in-memory computing database delivers the following capabilities: In-memory computing functionality with native support for row and columnar datastores providing full ACID (atomicity, consistency, isolation, and durability) transactional capabilities Integrated lifecycle management capabilities and data integration capabilities to access SAP and non-SAP data sources SAP IMCE Studio, which includes tools for data modeling, data and life cycle management, and data security The SAP IMCE that resides at the heart of SAP HANA is an integrated database and calculation layer that allows the processing of massive quantities of real-time data in the main memory to provide immediate results from analysis and transactions. Like any standard database, the SAP IMCE not only supports industry standards such as SQL and MDX, but also incorporates a high-performance calculation engine that embeds procedural language support directly into the database kernel. This approach is designed to remove the need to read data from the database, process it, and then write data back to the database, that is, process the data near the database and return the results. The IMCE is an in-memory, column-oriented database technology. It is a powerful calculation engine at the heart of SAP HANA. As data resides in the Random Access Memory (RAM), highly accelerated performance can be achieved compared to systems that read data from disks. The heart lies within the IMCE, which allows us to create and perform calculations on data. SAP IMCE Studio includes tools for data modeling activities, data and life cycle management, and also tools that are related to data security. The following diagram shows the components of IMCE alone: There's more… SAP HANA database has the following two engines: Column-based store: This engine stores the huge amounts of relational data in column-optimized tables, which are aggregated and used in analytical operations. Row-based store: This engine stores the relational data in rows, similar to the storage mechanism of traditional database systems. The row store is more optimized for write operations and has a lower compression rate. Also, the query performance is lower when compared to the column-based store. The engine that is used to store data can be selected on a per-table basis at the time of creating a table. Tables in the row-based store are loaded at start up time. In the case of column-based stores, tables can be either loaded at start up or on demand, that is, during normal operation of the SAP HANA database. Both engines share a common persistence layer, which provides data persistency that is consistent across both engines. Like a traditional database, we have page management and logging in SAP HANA. The changes made to the in-memory database pages are persisted through savepoints. These savepoints are written to those data volumes on the persistent storage for which the storage medium is hard drives. All transactions committed in the SAP HANA database are stored/saved/referenced by the logger of the persistency layer in a log entry written to the log volumes on the persistent storage. To get high I/O performance and low latency, log volumes use the flash technology storage. The relational engines can be accessed through a variety of interfaces. The SAP HANA database supports SQL (JDBC/ODBC), MDX (ODBO), and BICS (SQLDBC). The calculation engine performs all the calculations in the database. No data moves into the application layer until calculations are completed. It also contains the business functions library that is called by applications to perform calculations based on the business rules and logic. The SAP HANA-specific SQL script language is an extension of SQL that can be used to push down data-intensive application logic into the SAP HANA database for specific requirements. Session management This component creates and manages sessions and connections for the database clients. When a session is created, a set of parameters are maintained. These parameters are like auto-commit settings or the current transaction isolation level. After establishing a session, database clients communicate with the SAP HANA database using SQL statements. SAP HANA database treats all the statements as transactions while processing them. Each new session created will be assigned to a new transaction. Transaction manager The transaction manager is the component that coordinates database transactions, takes care of controlling transaction isolation, and keeps track of running and closed transactions. The transaction manager informs the involved storage engines about the running or closed transactions, so that they can execute necessary actions, when a transaction is committed or rolled back. The transaction manager cooperates with the persistence layer to achieve atomic and durable transactions. The client requests are analyzed and executed by a set of components summarized as request processing and execution control. The client requests are analyzed by a request parser, and then it is dispatched to the responsible component. The transaction control statements are forwarded to the transaction manager. The data definition statements are sent to the metadata manager. The object invocations are dispatched to the object store. The data manipulation statements are sent to the optimizer, which creates an optimized execution plan that is given to the execution layer. The SAP HANA database also has built-in support for domain-specific models (such as for financial planning domain) and it offers scripting capabilities that allow application-specific calculations to run inside the database. It has its own scripting language named SQLScript that is designed to enable optimizations and parallelization. This SQLScript is based on free functions that operate on tables by using SQL queries for set processing. The SAP HANA database also contains a component called the planning engine that allows financial planning applications to execute basic planning operations in the database layer. For example, while applying filters/transformations, a new version of a dataset will be created as a copy of an existing one. An example of planning operation is disaggregation operation in which based on a distribution function; target values from higher to lower aggregation levels are distributed. Metadata manager Metadata manager helps to access metadata. SAP HANA database's metadata consists of a variety of objects, such as definitions of tables, views and indexes, SQLScript function definitions, and object store metadata. All these types of metadata are stored in one common catalog for all the SAP HANA database stores. Metadata is stored in tables in the row store. The SAP HANA features such as transaction support and multi-version concurrency control (MVCC) are also used for metadata management. Central metadata is shared across the servers in the case of a distributed database systems. The background mechanism of metadata storage and sharing is hidden from the components that use the metadata manager. As row-based tables and columnar tables can be combined in one SQL statement, both the row and column engines must be able to consume the intermediate results. The main difference between the two engines is the way they process data: the row store operators process data in a row-at-a-time fashion, whereas column store operations (such as scan and aggregate) require the entire column to be available in contiguous memory locations. To exchange intermediate results created by each other, the row store provides results to the column store. The result materializes as complete rows in the memory, while the column store can expose results using the iterators interface needed by the row store. Persistence layer The persistence layer is responsible for durability and atomicity of transactions. The persistent layer ensures that the database is restored to the most recent committed state after a restart, and makes sure that transactions are either completely executed or completely rolled back. To achieve this in an efficient way, the persistence layer uses a combination of write-ahead logs, shadow paging, and savepoints. Moreover, the persistence layer also offers interfaces for writing and reading data. It also contains SAP HANA's logger that manages the transaction log. Authorization manager The authorization manager is invoked by other SAP HANA database components to check the required privileges for users to execute the requested operations. Privileges to other users or roles can be granted. A privilege grants the right to perform a specified operation (such as create, update, select, and execute data manipulation languages) on a specified object such as a table, view, and SQLScript function. Analytic privileges represent filters or hierarchy, and they drill down limitations for analytic queries. Analytic privileges such as granting access to values with a certain combination of dimension attributes are supported in SAP HANA. Users are authenticated either by the SAP HANA database itself (log in with username and password), or authentication can be delegated to external authentication providers third-party such as an LDAP directory. See also SAP HANA in-memory analytics and in-memory computing available at http://scn.sap.com/people/vitaliy.rudnytskiy/blog/2011/03/22/time-to-update-your-sap-hana-vocabulary Summary This article explains the SAP architecture and the IMCE feature in brief. Resources for Article: Further resources on this subject: SAP HANA integration with Microsoft Excel [Article] Data Migration Scenarios in SAP Business ONE Application- part 2 [Article] Data Migration Scenarios in SAP Business ONE Application- part 1 [Article]
Read more
  • 0
  • 0
  • 9284
article-image-visualization-tool-understand-data
Packt
22 Sep 2014
23 min read
Save for later

Visualization as a Tool to Understand Data

Packt
22 Sep 2014
23 min read
In this article by Nazmus Saquib, the author of Mathematica Data Visualization, we will look at a few simple examples that demonstrate the importance of data visualization. We will then discuss the types of datasets that we will encounter over the course of this book, and learn about the Mathematica interface to get ourselves warmed up for coding. (For more resources related to this topic, see here.) In the last few decades, the quick growth in the volume of information we produce and the capacity of digital information storage have opened a new door for data analytics. We have moved on from the age of terabytes to that of petabytes and exabytes. Traditional data analysis is now augmented with the term big data analysis, and computer scientists are pushing the bounds for analyzing this huge sea of data using statistical, computational, and algorithmic techniques. Along with the size, the types and categories of data have also evolved. Along with the typical and popular data domain in Computer Science (text, image, and video), graphs and various categorical data that arise from Internet interactions have become increasingly interesting to analyze. With the advances in computational methods and computing speed, scientists nowadays produce an enormous amount of numerical simulation data that has opened up new challenges in the field of Computer Science. Simulation data tends to be structured and clean, whereas data collected or scraped from websites can be quite unstructured and hard to make sense of. For example, let's say we want to analyze some blog entries in order to find out which blogger gets more follows and referrals from other bloggers. This is not as straightforward as getting some friends' information from social networking sites. Blog entries consist of text and HTML tags; thus, a combination of text analytics and tag parsing, coupled with a careful observation of the results would give us our desired outcome. Regardless of whether the data is simulated or empirical, the key word here is observation. In order to make intelligent observations, data scientists tend to follow a certain pipeline. The data needs to be acquired and cleaned to make sure that it is ready to be analyzed using existing tools. Analysis may take the route of visualization, statistics, and algorithms, or a combination of any of the three. Inference and refining the analysis methods based on the inference is an iterative process that needs to be carried out several times until we think that a set of hypotheses is formed, or a clear question is asked for further analysis, or a question is answered with enough evidence. Visualization is a very effective and perceptive method to make sense of our data. While statistics and algorithmic techniques provide good insights about data, an effective visualization makes it easy for anyone with little training to gain beautiful insights about their datasets. The power of visualization resides not only in the ease of interpretation, but it also reveals visual trends and patterns in data, which are often hard to find using statistical or algorithmic techniques. It can be used during any step of the data analysis pipeline—validation, verification, analysis, and inference—to aid the data scientist. How have you visualized your data recently? If you still have not, it is okay, as this book will teach you exactly that. However, if you had the opportunity to play with any kind of data already, I want you to take a moment and think about the techniques you used to visualize your data so far. Make a list of them. Done? Do you have 2D and 3D plots, histograms, bar charts, and pie charts in the list? If yes, excellent! We will learn how to style your plots and make them more interactive using Mathematica. Do you have chord diagrams, graph layouts, word cloud, parallel coordinates, isosurfaces, and maps somewhere in that list? If yes, then you are already familiar with some modern visualization techniques, but if you have not had the chance to use Mathematica as a data visualization language before, we will explore how visualization prototypes can be built seamlessly in this software using very little code. The aim of this book is to teach a Mathematica beginner the data-analysis and visualization powerhouse built into Mathematica, and at the same time, familiarize the reader with some of the modern visualization techniques that can be easily built with Mathematica. We will learn how to load, clean, and dissect different types of data, visualize the data using Mathematica's built-in tools, and then use the Mathematica graphics language and interactivity functions to build prototypes of a modern visualization. The importance of visualization Visualization has a broad definition, and so does data. The cave paintings drawn by our ancestors can be argued as visualizations as they convey historical data through a visual medium. Map visualizations were commonly used in wars since ancient times to discuss the past, present, and future states of a war, and to come up with new strategies. Astronomers in the 17th century were believed to have built the first visualization of their statistical data. In the 18th century, William Playfair invented many of the popular graphs we use today (line, bar, circle, and pie charts). Therefore, it appears as if many, since ancient times, have recognized the importance of visualization in giving some meaning to data. To demonstrate the importance of visualization in a simple mathematical setting, consider fitting a line to a given set of points. Without looking at the data points, it would be unwise to try to fit them with a model that seemingly lowers the error bound. It should also be noted that sometimes, the data needs to be changed or transformed to the correct form that allows us to use a particular tool. Visualizing the data points ensures that we do not fall into any trap. The following screenshot shows the visualization of a polynomial as a circle: Figure1.1 Fitting a polynomial In figure 1.1, the points are distributed around a circle. Imagine we are given these points in a Cartesian space (orthogonal x and y coordinates), and we are asked to fit a simple linear model. There is not much benefit if we try to fit these points to any polynomial in a Cartesian space; what we really need to do is change the parameter space to polar coordinates. A 1-degree polynomial in polar coordinate space (essentially a circle) would nicely fit these points when they are converted to polar coordinates, as shown in figure 1.1. Visualizing the data points in more complicated but similar situations can save us a lot of trouble. The following is a screenshot of Anscombe's quartet: Figure1.2 Anscombe's quartet, generated using Mathematica Downloading the color images of this book We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/2999OT_coloredimages.PDF. Anscombe's quartet (figure 1.2), named after the statistician Francis Anscombe, is a classic example of how simple data visualization like plotting can save us from making wrong statistical inferences. The quartet consists of four datasets that have nearly identical statistical properties (such as mean, variance, and correlation), and gives rise to the same linear model when a regression routine is run on these datasets. However, the second dataset does not really constitute a linear relationship; a spline would fit the points better. The third dataset (at the bottom-left corner of figure 1.2) actually has a different regression line, but the outlier exerts enough influence to force the same regression line on the data. The fourth dataset is not even a linear relationship, but the outlier enforces the same regression line again. These two examples demonstrate the importance of "seeing" our data before we blindly run algorithms and statistics. Fortunately, for visualization scientists like us, the world of data types is quite vast. Every now and then, this gives us the opportunity to create new visual tools other than the traditional graphs, plots, and histograms. These visual signatures and tools serve the same purpose that the graph plotting examples previously just did—spy and investigate data to infer valuable insights—but on different types of datasets other than just point clouds. Another important use of visualization is to enable the data scientist to interactively explore the data. Two features make today's visualization tools very attractive—the ability to view data from different perspectives (viewing angles) and at different resolutions. These features facilitate the investigator in understanding both the micro- and macro-level behavior of their dataset. Types of datasets There are many different types of datasets that a visualization scientist encounters in their work. This book's aim is to prepare an enthusiastic beginner to delve into the world of data visualization. Certainly, we will not comprehensively cover each and every visualization technique out there. Our aim is to learn to use Mathematica as a tool to create interactive visualizations. To achieve that, we will focus on a general classification of datasets that will determine which Mathematica functions and programming constructs we should learn in order to visualize the broad class of data covered in this book. Tables The table is one of the most common data structures in Computer Science. You might have already encountered this in a computer science, database, or even statistics course, but for the sake of completeness, we will describe the ways in which one could use this structure to represent different kinds of data. Consider the following table as an example:   Attribute 1 Attribute 2 … Item 1       Item 2       Item 3       When storing datasets in tables, each row in the table represents an instance of the dataset, and each column represents an attribute of that data point. For example, a set of two-dimensional Cartesian vectors can be represented as a table with two attributes, where each row represents a vector, and the attributes are the x and y coordinates relative to an origin. For three-dimensional vectors or more, we could just increase the number of attributes accordingly. Tables can be used to store more advanced forms of scientific, time series, and graph data. We will cover some of these datasets over the course of this book, so it is a good idea for us to get introduced to them now. Here, we explain the general concepts. Scalar fields There are many kinds of scientific dataset out there. In order to aid their investigations, scientists have created their own data formats and mathematical tools to analyze the data. Engineers have also developed their own visualization language in order to convey ideas in their community. In this book, we will cover a few typical datasets that are widely used by scientists and engineers. We will eventually learn how to create molecular visualizations and biomedical dataset exploration tools when we feel comfortable manipulating these datasets. In practice, multidimensional data (just like vectors in the previous example) is usually augmented with one or more characteristic variable values. As an example, let's think about how a physicist or an engineer would keep track of the temperature of a room. In order to tackle the problem, they would begin by measuring the geometry and the shape of the room, and put temperature sensors at certain places to measure the temperature. They will note the exact positions of those sensors relative to the room's coordinate system, and then, they will be all set to start measuring the temperature. Thus, the temperature of a room can be represented, in a discrete sense, by using a set of points that represent the temperature sensor locations and the actual temperature at those points. We immediately notice that the data is multidimensional in nature (the location of a sensor can be considered as a vector), and each data point has a scalar value associated with it (temperature). Such a discrete representation of multidimensional data is quite widely used in the scientific community. It is called a scalar field. The following screenshot shows the representation of a scalar field in 2D and 3D: Figure1.3 In practice, scalar fields are discrete and ordered Figure 1.3 depicts how one would represent an ordered scalar field in 2D or 3D. Each point in the 2D field has a well-defined x and y location, and a single temperature value gets associated with it. To represent a 3D scalar field, we can think of it as a set of 2D scalar field slices placed at a regular interval along the third dimension. Each point in the 3D field is a point that has {x, y, z} values, along with a temperature value. A scalar field can be represented using a table. We will denote each {x, y} point (for 2D) or {x, y, z} point values (for 3D) as a row, but this time, an additional attribute for the scalar value will be created in the table. Thus, a row will have the attributes {x, y, z, T}, where T is the temperature associated with the point defined by the x, y, and z coordinates. This is the most common representation of scalar fields. A widely used visualization technique to analyze scalar fields is to find out the isocontours or isosurfaces of interest. However, for now, let's take a look at the kind of application areas such analysis will enable one to pursue. Instead of temperature, one could think of associating regularly spaced points with any relevant scalar value to form problem-specific scalar fields. In an electrochemical simulation, it is important to keep track of the charge density in the simulation space. Thus, the chemist would create a scalar field with charge values at specific points. For an aerospace engineer, it is quite important to understand how air pressure varies across airplane wings; they would keep track of the pressure by forming a scalar field of pressure values. Scalar field visualization is very important in many other significant areas, ranging from from biomedical analysis to particle physics. In this book, we will cover how to visualize this type of data using Mathematica. Time series Another widely used data type is the time series. A time series is a sequence of data points that are measured usually over a uniform interval of time. Time series arise in many fields, but in today's world, they are mostly known for their applications in Economics and Finance. Other than these, they are frequently used in statistics, weather prediction, signal processing, astronomy, and so on. It is not the purpose of this book to describe the theory and mathematics of time series data. However, we will cover some of Mathematica's excellent capabilities for visualizing time series, and in the course of this book, we will construct our own visualization tool to view time series data. Time series can be easily represented using tables. Each row of the time series table will represent one point in the series, with one attribute denoting the time stamp—the time at which the data point was recorded, and the other attribute storing the actual data value. If the starting time and the time interval are known, then we can get rid of the time attribute and simply store the data value in each row. The actual timestamp of each value can be calculated using the initial time and time interval. Images and videos can be represented as tables too, with pixel-intensity values occupying each entry of the table. As we focus on visualization and not image processing, we will skip those types of data. Graphs Nowadays, graphs arise in all contexts of computer science and social science. This particular data structure provides a way to convert real-world problems into a set of entities and relationships. Once we have a graph, we can use a plethora of graph algorithms to find beautiful insights about the dataset. Technically, a graph can be stored as a table. However, Mathematica has its own graph data structure, so we will stick to its norm. Sometimes, visualizing the graph structure reveals quite a lot of hidden information. Graph visualization itself is a challenging problem, and is an active research area in computer science. A proper visualization layout, along with proper color maps and size distribution, can produce very useful outputs. Text The most common form of data that we encounter everywhere is text. Mathematica does not provide any specific visualization package for state-of-the-art text visualization methods. Cartographic data As mentioned before, map visualization is one of the ancient forms of visualization known to us. Nowadays, with the advent of GPS, smartphones, and publicly available country-based data repositories, maps are providing an excellent way to contrast and compare different countries, cities, or even communities. Cartographic data comes in various forms. A common form of a single data item is one that includes latitude, longitude, location name, and an attribute (usually numerical) that records a relevant quantity. However, instead of a latitude and longitude coordinate, we may be given a set of polygons that describe the geographical shape of the place. The attributable quantity may not be numerical, but rather something qualitative, like text. Thus, there is really no standard form that one can expect when dealing with cartographic data. Fortunately, Mathematica provides us with excellent data-mining and dissecting capabilities to build custom formats out of the data available to us. . Mathematica as a tool for visualization At this point, you might be wondering why Mathematica is suited for visualizing all the kinds of datasets that we have mentioned in the preceding examples. There are many excellent tools and packages out there to visualize data. Mathematica is quite different from other languages and packages because of the unique set of capabilities it presents to its user. Mathematica has its own graphics language, with which graphics primitives can be interactively rendered inside the worksheet. This makes Mathematica's capability similar to many widely used visualization languages. Mathematica provides a plethora of functions to combine these primitives and make them interactive. Speaking of interactivity, Mathematica provides a suite of functions to interactively display any of its process. Not only visualization, but any function or code evaluation can be interactively visualized. This is particularly helpful when managing and visualizing big datasets. Mathematica provides many packages and functions to visualize the kinds of datasets we have mentioned so far. We will learn to use the built-in functions to visualize structured and unstructured data. These functions include point, line, and surface plots; histograms; standard statistical charts; and so on. Other than these, we will learn to use the advanced functions that will let us build our own visualization tools. Another interesting feature is the built-in datasets that this software provides to its users. This feature provides a nice playground for the user to experiment with different datasets and visualization functions. From our discussion so far, we have learned that visualization tools are used to analyze very large datasets. While Mathematica is not really suited for dealing with petabytes or exabytes of data (and many other popularly used visualization tools are not suited for that either), often, one needs to build quick prototypes of such visualization tools using smaller sample datasets. Mathematica is very well suited to prototype such tools because of its efficient and fast data-handling capabilities, along with its loads of convenient functions and user-friendly interface. It also supports GPU and other high-performance computing platforms. Although it is not within the scope of this book, a user who knows how to harness the computing power of Mathematica can couple that knowledge with visualization techniques to build custom big data visualization solutions. Another feature that Mathematica presents to a data scientist is the ability to keep the workflow within one worksheet. In practice, many data scientists tend to do their data analysis with one package, visualize their data with another, and export and present their findings using something else. Mathematica provides a complete suite of a core language, mathematical and statistical functions, a visualization platform, and versatile data import and export features inside a single worksheet. This helps the user focus on the data instead of irrelevant details. By now, I hope you are convinced that Mathematica is worth learning for your data-visualization needs. If you still do not believe me, I hope I will be able to convince you again at the end of the book, when we will be done developing several visualization prototypes, each requiring only few lines of code! Getting started with Mathematica We will need to know a few basic Mathematica notebook essentials. Assuming you already have Mathematica installed on your computer, let's open a new notebook by navigating to File|New|Notebook, and do the following experiments. Creating and selecting cells In Mathematica, a chunk of code or any number of mathematical expressions can be written within a cell. Each cell in the notebook can be evaluated to see the output immediately below it. To start a new cell, simply start typing at the position of the blinking cursor. Each cell can be selected by clicking on the respective rightmost bracket. To select multiple cells, press Ctrl + right-mouse button in Windows or Linux (or cmd + right-mouse button on a Mac) on each of the cells. The following screenshot shows several cells selected together, along with the output from each cell: Figure1.4 Selecting and evaluating cells in Mathematica We can place a new cell in between any set of cells in order to change the sequence of instruction execution. Use the mouse to place the cursor in between two cells, and start typing your commands to create a new cell. We can also cut, copy, and paste cells by selecting them and applying the usual shortcuts (for example, Ctrl + C, Ctrl + X, and Ctrl + V in Windows/Linux, or cmd + C, cmd + X, and cmd + V in Mac) or using the Edit menu bar. In order to delete cell(s), select the cell(s) and press the Delete key. Evaluating a cell A cell can be evaluated by pressing Shift + Enter. Multiple cells can be selected and evaluated in the same way. To evaluate the full notebook, press Ctrl + A (to select all the cells) and then press Shift + Enter. In this case, the cells will be evaluated one after the other in the sequence in which they appear in the notebook. To see examples of notebooks filled with commands, code, and mathematical expressions, you can open the notebooks supplied with this article, which are the polar coordinates fitting and Anscombe's quartet examples, and select each cell (or all of them) and evaluate them. If we evaluate a cell that uses variables declared in a previous cell, and the previous cell was not already evaluated, then we may get errors. It is possible that Mathematica will treat the unevaluated variables as a symbolic expression; in that case, no error will be displayed, but the results will not be numeric anymore. Suppressing output from a cell If we don't wish to see the intermediate output as we load data or assign values to variables, we can add a semicolon (;) at the end of each line that we want to leave out from the output. Cell formatting Mathematica input cells treat everything inside them as mathematical and/or symbolic expressions. By default, every new cell you create by typing at the horizontal cursor will be an input expression cell. However, you can convert the cell to other formats for convenient typesetting. In order to change the format of cell(s), select the cell(s) and navigate to Format|Style from the menu bar, and choose a text format style from the available options. You can add mathematical symbols to your text by selecting Palettes|Basic Math Assistant. Note that evaluating a text cell will have no effect/output. Commenting We can write any comment in a text cell as it will be ignored during the evaluation of our code. However, if we would like to write a comment inside an input cell, we use the (* operator to open a comment and the *) operator to close it, as shown in the following code snippet: (* This is a comment *) The shortcut Ctrl + / (cmd + / in Mac) is used to comment/uncomment a chunk of code too. This operation is also available in the menu bar. Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. Aborting evaluation We can abort the currently running evaluation of a cell by navigating to Evaluation|Abort Evaluation in the menu bar, or simply by pressing Alt + . (period). This is useful when you want to end a time-consuming process that you suddenly realize will not give you the correct results at the end of the evaluation, or end a process that might use up the available memory and shut down the Mathematica kernel. Further reading The history of visualization deserves a separate book, as it is really fascinating how the field has matured over the centuries, and it is still growing very strongly. Michael Friendly, from York University, published a historical development paper that is freely available online, titled Milestones in History of Data Visualization: A Case Study in Statistical Historiography. This is an entertaining compilation of the history of visualization methods. The book The Visual Display of Quantitative Information by Edward R. Tufte published by Graphics Press USA, is an excellent resource and a must-read for every data visualization practitioner. This is a classic book on the theory and practice of data graphics and visualization. Since we will not have the space to discuss the theory of visualization, the interested reader can consider reading this book for deeper insights. Summary In this article, we discussed the importance of data visualization in different contexts. We also introduced the types of dataset that will be visualized over the course of this book. The flexibility and power of Mathematica as a visualization package was discussed, and we will see the demonstration of these properties throughout the book with beautiful visualizations. Finally, we have taken the first step to writing code in Mathematica. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [article] Importing Dynamic Data [article] Interacting with Data for Dashboards [article]
Read more
  • 0
  • 0
  • 9283

article-image-implementing-deep-learning-keras
Amey Varangaonkar
05 Dec 2017
4 min read
Save for later

Implementing Deep Learning with Keras

Amey Varangaonkar
05 Dec 2017
4 min read
[box type="note" align="" class="" width=""]The following excerpt is from the title Deep Learning with Theano, Chapter 5 written by Christopher Bourez. The book offers a complete overview of Deep Learning with Theano, a Python-based library that makes optimizing numerical expressions and deep learning models easy on CPU or GPU. [/box] In this article, we introduce you to the highly popular deep learning library - Keras, which sits on top of the both, Theano and Tensorflow. It is a flexible platform for training deep learning models with ease. Keras is a high-level neural network API, written in Python and capable of running on top of either TensorFlow or Theano. It was developed to make implementing deep learning models as fast and easy as possible for research and development. You can install Keras easily using conda, as follows: conda install keras When writing your Python code, importing Keras will tell you which backend is used: >>> import keras Using Theano backend. Using cuDNN version 5110 on context None Preallocating 10867/11439 Mb (0.950000) on cuda0 Mapped name None to device cuda0: Tesla K80 (0000:83:00.0) Mapped name dev0 to device cuda0: Tesla K80 (0000:83:00.0) Using cuDNN version 5110 on context dev1 Preallocating 10867/11439 Mb (0.950000) on cuda1 Mapped name dev1 to device cuda1: Tesla K80 (0000:84:00.0) If you have installed Tensorflow, it might not use Theano. To specify which backend to use, write a Keras configuration file, ~/.keras/keras.json: { "epsilon": 1e-07, "floatx": "float32", "image_data_format": "channels_last", "backend": "theano" } It is also possible to specify the Theano backend directly with the environment Variable: KERAS_BACKEND=theano python Note that the device used is the device we specified for Theano in the ~/.theanorc file. It is also possible to modify these variables with Theano environment variables: KERAS_BACKEND=theano THEANO_FLAGS=device=cuda,floatX=float32,mode=FAST_ RUN python Programming with Keras Keras provides a set of methods for data pre-processing and for building models. Layers and models are callable functions on tensors and return tensors. In Keras, there is no difference between a layer/module and a model: a model can be part of a bigger model and composed of multiple layers. Such a sub-model behaves as a module, with inputs/outputs. Let's create a network with two linear layers, a ReLU non-linearity in between, and a softmax output: from keras.layers import Input, Dense from keras.models import Model inputs = Input(shape=(784,)) x = Dense(64, activation='relu')(inputs) predictions = Dense(10, activation='softmax')(x) model = Model(inputs=inputs, outputs=predictions) The model module contains methods to get input and output shape for either one or multiple inputs/outputs, and list the submodules of our module: >>> model.input_shape (None, 784) >>> model.get_input_shape_at(0) (None, 784) >>> model.output_shape (None, 10) >>> model.get_output_shape_at(0) (None, 10) >>> model.name 'Sequential_1' >>> model.input /dense_3_input >>> model.output Softmax.0 >>> model.get_output_at(0) Softmax.0 >>> model.layers [<keras.layers.core.Dense object at 0x7f0abf7d6a90>, <keras.layers.core.Dense object at 0x7f0abf74af90>] In order to avoid specify inputs to every layer, Keras proposes a functional way of writing models with the Sequential module, to build a new module or model composed. The following definition of the model builds exactly the same model as shown previously, with input_dim to specify the input dimension of the block that would be unknown otherwise and generate an error: from keras.models import Sequential from keras.layers import Dense, Activation model = Sequential() model.add(Dense(units=64, input_dim=784, activation='relu')) model.add(Dense(units=10, activation='softmax')) The model is considered a module or layer that can be part of a bigger model: model2 = Sequential() model2.add(model) model2.add(Dense(units=10, activation='softmax')) Each module/model/layer can be compiled then and trained with data : model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) model.fit(data, labels) Thus, we see it is fairly easy to train a model in Keras. The simplicity and ease of use that Keras offers makes it a very popular choice of tool for deep learning. If you think the article is useful, check out the book Deep Learning with Theano for interesting deep learning concepts and their implementation using Theano. For more information on the Keras library and how to train efficient deep learning models, make sure to check our highly popular title Deep Learning with Keras.  
Read more
  • 0
  • 0
  • 9250
Modal Close icon
Modal Close icon