Data | 0 articles | Tech News, Tutorials & Expert Insights

24 Mar 2014

7 min read

First Steps

24 Mar 2014

(For more resources related to this topic, see here.) Installing matplotlib Before experimenting with matplotlib, you need to install it. Here we introduce some tips to get matplotlib up and running without too much trouble. How to do it... We have three likely scenarios: you might be using Linux, OS X, or Windows. Linux Most Linux distributions have Python installed by default, and provide matplotlib in their standard package list. So all you have to do is use the package manager of your distribution to install matplotlib automatically. In addition to matplotlib, we highly recommend that you install NumPy, SciPy, and SymPy, as they are supposed to work together. The following list consists of commands to enable the default packages available in different versions of Linux: Ubuntu: The default Python packages are compiled for Python 2.7. In a command terminal, enter the following command: sudo apt-get install python-matplotlib python-numpy python-scipy python-sympy ArchLinux: The default Python packages are compiled for Python 3. In a command terminal, enter the following command: sudo pacman -S python-matplotlib python-numpy python-scipy python-sympy If you prefer using Python 2.7, replace python by python2 in the package names Fedora: The default Python packages are compiled for Python 2.7. In a command terminal, enter the following command: sudo yum install python-matplotlib numpy scipy sympy There are other ways to install these packages; in this article, we propose the most simple and seamless ways to do it. Windows and OS X Windows and OS X do not have a standard package system for software installation. We have two options—using a ready-made self-installing package or compiling matplotlib from the code source. The second option involves much more work; it is worth the effort to have the latest, bleeding edge version of matplotlib installed. Therefore, in most cases, using a ready-made package is a more pragmatic choice. You have several choices for ready-made packages: Anaconda, Enthought Canopy, Algorete Loopy, and more! All these packages provide Python, SciPy, NumPy, matplotlib, and more (a text editor and fancy interactive shells) in one go. Indeed, all these systems install their own package manager and from there you install/uninstall additional packages as you would do on a typical Linux distribution. For the sake of brevity, we will provide instructions only for Enthought Canopy. All the other systems have extensive documentation online, so installing them should not be too much of a problem. So, let's install Enthought Canopy by performing the following steps: Download the Enthought Canopy installer from https://www.enthought.com/products/canopy. You can choose the free Express edition. The website can guess your operating system and propose the right installer for you. Run the Enthought Canopy installer. You do not need to be an administrator to install the package if you do not want to share the installed software with other users. When installing, just click on Next to keep the defaults. You can find additional information about the installation process at http://docs.enthought.com/canopy/quick-start.html. That's it! You will have Python 2.7, NumPy, SciPy, and matplotlib installed and ready to run. Plotting one curve The initial example of Hello World! for a plotting software is often about showing a simple curve. We will keep up with that tradition. It will also give you a rough idea about how matplotlib works. Getting ready You need to have Python (either v2.7 or v3) and matplotlib installed. You also need to have a text editor (any text editor will do) and a command terminal to type and run commands. How to do it... Let's get started with one of the most common and basic graph that any plotting software offers—curves. In a text file saved as plot.py, we have the following code: import matplotlib.pyplot as plt X = range(100) Y = [value ** 2 for value in X] plt.plot(X, Y) plt.show() Assuming that you installed Python and matplotlib, you can now use Python to interpret this script. If you are not familiar with Python, this is indeed a Python script we have there! In a command terminal, run the script in the directory where you saved plot.py with the following command: python plot.py Doing so will open a window as shown in the following screenshot: The window shows the curve Y = X ** 2 with X in the [0, 99] range. As you might have noticed, the window has several icons, some of which are as follows: : This icon opens a dialog, allowing you to save the graph as a picture file. You can save it as a bitmap picture or a vector picture. : This icon allows you to translate and scale the graphics. Click on it and then move the mouse over the graph. Clicking on the left button of the mouse will translate the graph according to the mouse movements. Clicking on the right button of the mouse will modify the scale of the graphics. : This icon will restore the graph to its initial state, canceling any translation or scaling you might have applied before. How it works... Assuming that you are not very familiar with Python yet, let's analyze the script demonstrated earlier. The first line tells Python that we are using the matplotlib.pyplot module. To save on a bit of typing, we make the name plt equivalent to matplotlib.pyplot. This is a very common practice that you will see in matplotlib code. The second line creates a list named X, with all the integer values from 0 to 99. The range function is used to generate consecutive numbers. You can run the interactive Python interpreter and type the command range(100) if you use Python 2, or the command list(range(100)) if you use Python 3. This will display the list of all the integer values from 0 to 99. In both versions, sum(range(100)) will compute the sum of the integers from 0 to 99. The third line creates a list named Y, with all the values from the list X squared. Building a new list by applying a function to each member of another list is a Python idiom, named list comprehension. The list Y will contain the squared values of the list X in the same order. So Y will contain 0, 1, 4, 9, 16, 25, and so on. The fourth line plots a curve, where the x coordinates of the curve's points are given in the list X, and the y coordinates of the curve's points are given in the list Y. Note that the names of the lists can be anything you like. The last line shows a result, which you will see on the window while running the script. There's more... So what we have learned so far? Unlike plotting packages like gnuplot, matplotlib is not a command interpreter specialized for the purpose of plotting. Unlike Matlab, matplotlib is not an integrated environment for plotting either. matplotlib is a Python module for plotting. Figures are described with Python scripts, relying on a (fairly large) set of functions provided by matplotlib. Thus, the philosophy behind matplotlib is to take advantage of an existing language, Python. The rationale is that Python is a complete, well-designed, general purpose programming language. Combining matplotlib with other packages does not involve tricks and hacks, just Python code. This is because there are numerous packages for Python for pretty much any task. For instance, to plot data stored in a database, you would use a database package to read the data and feed it to matplotlib. To generate a large batch of statistical graphics, you would use a scientific computing package such as SciPy and Python's I/O modules. Thus, unlike many plotting packages, matplotlib is very orthogonal—it does plotting and only plotting. If you want to read inputs from a file or do some simple intermediary calculations, you will have to use Python modules and some glue code to make it happen. Fortunately, Python is a very popular language, easy to master and with a large user base. Little by little, we will demonstrate the power of this approach.

0
0
5668

Packt

19 Mar 2014

14 min read

Going Viral

Packt

19 Mar 2014

14 min read

(For more resources related to this topic, see here.) Social media mining using sentiment analysis People are highly opinionated. We hold opinions about everything from international politics to pizza delivery. Sentiment analysis, synonymously referred to as opinion mining, is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions through written language. Practically speaking, this field allows us to measure, and thus harness, opinions. Up until the last 40 years or so, opinion mining hardly existed. This is because opinions were elicited in surveys rather than in text documents, computers were not powerful enough to store or sort a large amount of information, and algorithms did not exist to extract opinion information from written language. The explosion of sentiment-laden content on the Internet, the increase in computing power, and advances in data mining techniques have turned social data mining into a thriving academic field and crucial commercial domain. Professor Richard Hamming famously pushes researchers to ask themselves, "What are the important problems in my field?" Researchers in the broad area of natural language processing (NLP) cannot help but list sentiment analysis as one such pressing problem. Sentiment analysis is not only a prominent and challenging research area, but also a powerful tool currently being employed in almost every business and social domain. This prominence is due, at least in part, to the centrality of opinions as both measures and causes of human behavior. This article is an introduction to social data mining. For us, social data refers to data generated by people or by their interactions. More specifically, social data for the purposes of this article will usually refer to data in text form produced by people for other people's consumption. Data mining is a set of tools and techniques used to describe and make inferences about data. We approach social data mining with a potent mix of applied statistics and social science theory. As for tools, we utilize and provide an introduction to the statistical programming language R. The article covers important topics and latest developments in the field of social data mining with many references and resources for continued learning. We hope it will be of interest to an audience with a wide array of substantive interests from fields such as marketing, sociology, politics, and sales. We have striven to make it accessible enough to be useful for beginners while simultaneously directing researchers and practitioners already active in the field towards resources for further learning. Code and additional material will be available online at http://socialmediaminingr.com as well as on the authors' GitHub account, https://github.com/SocialMediaMininginR. The state of communication The state of communication section describes the fundamentally altered modes of social communication fostered by the Internet. The interconnected, social, rapid, and public exchange of information detailed here underlies the power of social data mining. Now more than ever before, information can go viral, a phrase first cited as early as 2004. By changing the manner in which we connect with each other, the Internet changed the way we interact—communication is now bi-directional and many-to-many. Networks are now self-organized, and information travels along every dimension, varying systematically depending on direction and purpose. This new economy with ideas as currency has impacted nearly every person. More than ever, people rely on context and information before making decisions or purchases, and by extension, more and more on peer effects and interactions rather than centralized sources. The traditional modes of communication are represented mainly by radio and television, which are isotropic and one-to-many. It took 38 years for radio broadcasters and 13 years for television to reach an audience of 50 million, but the Internet did it in just four years (Gallup). Not only has the nature of communication changed, but also its scale. There were 50 pages on the World Wide Web (WWW) in 1993. Today, the full impact and scope of the WWW is difficult to measure, but we can get a rough sense of its size: the Indexed Web contains at least 1.7 billion pages as of February 2014 (World Wide Web size). The WWW is the largest, most widely used source of information, with nearly 2.4 billion users (Wikipedia). 70 percent of these users use it daily to both contribute and receive information in order to learn about the world around them and to influence that same world—constantly organizing information around pieces that reflect their desires. In today's connected world, many of us are members of at least one, if not more, social networking service. The influence and reach of social media enterprises such as Facebook is staggering. Facebook has 1.11 billion monthly active users and 751 million monthly active users of their mobile products (Facebook key facts). Twitter has more than 200 million (Twitter blog) active users. As communication tools, they offer a global reach to huge multinational audiences, delivering messages almost instantaneously. Connectedness and social media have altered the way we organize our communications. Today we have dramatically more friends and more friends of friends, and we can communicate with these higher order connections faster and more frequently than ever before. It is difficult to ignore the abundance of mimicry (that is, copying or reposting) and repeated social interactions in our social networks. This mimicry is a result of virtual social interactions organized into reaffirming or oppositional feedback loops. We self-organize these interactions via (often preferential) attachments that form organic, shifting networks. There is little question of whether or not social media has already impacted your life and changed the manner in which you communicate. Our beliefs and perceptions of reality, as well as the choices we make, are largely conditioned by our neighbors in virtual and physical networks. When we need to make a decision, we seek out for opinions of others—more and more of those opinions are provided by virtual networks. Information bounce is the resonance of content within and between social networks often powered by social media such as customer reviews, forums, blogs, microblogs, and other user-generated content. This notion represents a significant change when compared to how information has traveled throughout history; individuals no longer need to exclusively rely on close ties within their physical social networks. Social media has both made our close ties closer and the number of weak ties exponentially greater. Beyond our denser and larger social networks is a general eagerness to incorporate information from other networks with similar interests and desires. The increased access to networks of various types has, in fact, conditioned us to seek even more information; after all, ignoring available information would constitute irrational behavior. These fundamental changes to the nature and scope of communication are crucial due to the importance of ideas in today's economic and social interactions. Today, and in the future, ideas will be of central importance, especially those ideas that bounce and go viral. Ideas that go viral are those that resonate and spur on social movements, which may have political and social purposes or reshape businesses and allow companies such as Nike and Apple to produce outsized returns on capital. This article introduces readers to the tools necessary to measure ideas and opinions derived from social data at scale. Along the way, we'll describe strategies for dealing with Big Data. What is Big Data? People create 2.5 quintillion bytes (2.5 * 1018) of data, or nearly 2.3 million Terabytes of data every day, so much that 90 percent of the data in the world today has been created in the last two years alone. Furthermore, rather than being a large collection of disparate data, much of this data flow consists of data on similar things, generating huge data-sets with billions upon billions of observations. Big Data refers not only to the deluge of data being generated, but also to the astronomical size of data-sets themselves. Both factors create challenges and opportunities for data scientists. This data comes from everywhere: physical sensors used to gather information, human sensors such as the social web, transaction records, and cell phone GPS signals to name a few. This data is not only big but is growing at an increasing rate. The data used in this article, namely, Twitter data, is no exception. Twitter was launched in March 21, 2006, and it took 3 years, 2 months, and 1 day to reach 1 billion tweets. Twitter users now send 1 billion tweets every 2.5 days. The size and scope of Big Data helps us overcome some of the hurdles caused by its low density. For instance, even though each unique piece of social data may have little applicability to our particular task, these small bits of information quickly become useful as we aggregate them across thousands or millions of people. Like the proverbial bundle of sticks—none of which could support inferences alone—when tied together, these small bits of information can be a powerful tool for understanding the opinions of the online populace. The sheer scope of Big Data has other benefits as well. The size and coverage of many social data-sets creates coverage overlaps in time, space, and topic. This allows analysts to cross-refer socially generated sets against one another or against small-scale data-sets designed to examine niche questions. This type of cross-coverage can generate consilience (Osborne)—the principle that states evidence from independent, unrelated sources can converge to form strong conclusions. That is, when multiple sources of evidence are in agreement, the conclusion can be very strong even when none of the individual sources of evidence are very strong on their own. A crucial characteristic of socially generated data is that it is opinionated. This point underpins the usefulness of big social data for sentiment analysis, and is novel. For the first time in history, interested parties can put their fingers to the pulse of the masses because the masses are frequently opining about what is important to them. They opine with and for each other and anyone else who cares to listen. In sum, opinionated data is the great enabler of opinion-based research. Human sensors and honest signals Opinion data generated by humans in real time presents tremendous opportunities. However, big social data will only prove useful to the extent that it is valid. This section tackles the extent to which socially generated data can be used to accurately measure individual and/or group-level opinions head-on. One potential indicator of the validity of socially generated data is the extent of its consumption for factual content. Online media has expanded significantly over the past 20 years. For example, online news is displacing print and broadcast. More and more Americans distrust mainstream media, with a majority (60 percent) now having little to no faith in traditional media to report news fully, accurately, and fairly. Instead, people are increasingly turning to the Internet to research, connect, and share opinions and views. This was especially evident during the 2012 election where social media played a large role in information transmission (Gallup). Politics is not the only realm affected by social Big Data. People are increasingly relying on the opinions of others to inform about their consumption preferences. Let's have a look at this: 91 percent of people report having gone into a store because of an online experience 89 percent of consumers conduct research using search engines 62 percent of consumers end up making a purchase in a store after researching it online 72 percent of consumers trust online reviews as much as personal recommendations 78 percent of consumers say that posts made by companies on social media influence their purchases If individuals are willing to use social data as a touchstone for decision making in their own lives, perhaps this is prima facie evidence of its validity. Other Big Data thinkers point out that much of what people do online constitutes their genuine actions and intentions. The breadcrumbs left from when people execute online transactions, send messages, or spend time on web pages constitute what Alex Petland of MIT calls honest signals. These signals are honest insofar as they are actions taken by people with no subtext or secondary intent. Specifically, he writes the following: "Those breadcrumbs tell the story of your life. It tells what you've chosen to do. That's very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy." To paraphrase, Petland finds some web-based data to be valid measures of people's attitudes when that data is without subtext or secondary intent; what he calls data exhaust. In other words, actions are harder to fake than words. He cautions against taking people's online statements at face value, because they may be nothing more than cheap talk. Anthony Stefanidis of George Mason University also advocates for the use of social data mining. He favorably speaks about its reliability, noting that its size inherently creates a preponderance of evidence. This article takes neither the strong position of Pentland and honest signals nor Stefanidis and preponderance of evidence. Instead, we advocate a blended approach of curiosity and creativity as well as some healthy skepticism. Generally, we follow the attitude of Charles Handy (The Empty Raincoat, 1994), who described the steps to measurement during the Vietnam War as follows: "The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide." The social web may not consist of perfect data, but its value is tremendous if used properly and analyzed with care. 40 years ago, a social science study containing millions of observations was unheard of due to the time and cost associated with collecting that much information. The most successful efforts in social data mining will be by those who "measure (all) what is measurable, and make measurable (all) what is not so" (Rasinkinski, 2008). Ultimately, we feel that the size and scope of big social data, the fact that some of it is comprised of honest signals, and the fact that some of it can be validated with other data, lends it validity. In another sense, the "proof is in the pudding". Businesses, governments, and organizations are already using social media mining to good effect; thus, the data being mined must be at least moderately useful. Another defining characteristic of big social data is the speed with which it is generated, especially when considered against traditional media channels. Social media platforms such as Twitter, but also the web generally, spread news in near-instant bursts. From the perspective of social media mining, this speed may be a blessing or a curse. On the one hand, analysts can keep up with the very fast-moving trends and patterns, if necessary. On the other hand, fast-moving information is subject to mistakes or even abuse. Following the tragic bombings in Boston, Massachusetts (April 15, 2013), Twitter was instrumental in citizen reporting and provided insight into the events as they unfolded. Law enforcement asked for and received help from general public, facilitated by social media. For example, Reddit saw an overall peak in traffic when reports came in that the second suspect was captured. Google Analytics reports that there were about 272,000 users on the site with 85,000 in the news update thread alone. This was the only time in Reddit's history other than Obama AMA that a thread beat the front page in the ratings (Reddit). The downside of this fast-paced, highly visible public search is that masses can be incorrect. This is exactly what happened; users began to look at the details and photos posted and pieced together their own investigation—as it turned out, the information was incorrect. This was a charged event and created an atmosphere that ultimately undermined the good intentions of many. Other efforts such as those by governments (Wikipedia) and companies (Forbes) to post messages favorable to their position is less than well intentioned. Overall, we should be skeptical of tactical (that is, very real time) uses of social media. Summary In this article, we introduced readers to the concepts of social media, sentiment analysis, and Big Data. We described how social media has changed the nature of interpersonal communication and the opportunities it presents for analysts of social data. This article also made a case for the use of quantitative approaches to measure all that is measurable, and make the one which is not so measurable. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [Article] Managing your social channels [Article] Social Media for Wordpress: VIP Memberships [Article]

0
0
2362

article-image-integrating-other-frameworks

Packt

14 Mar 2014

6 min read

Integrating with other Frameworks

Packt

14 Mar 2014

6 min read

(For more resources related to this topic, see here.) Using NodeJS as a data provider JavaScript has become a formidable language in its own right. Google's work on the V8 JavaScript engine has created something very performant, and has enabled others to develop Node.js, and with it, allow the development of JavaScript on the serverside. This article will take a look at how we can serve data using NodeJS, specifically using a framework known as express. Getting ready We will need to set up a simple project before we can get started: Download and install NodeJS (http://nodejs.org/download/). Create a folder nodejs for our project. Create a file nodejs/package.json and fill it with the following contents: {"name": "highcharts-cookbook-nodejs","description": "An example application for using highchartswith nodejs","version": "0.0.1","private": true,"dependencies": {"express": "3.4.4"}} From within the nodejs folder, install our dependencies locally (that is, within the nodejs folder) using npm (NodeJS package manager): npm install If we wanted to install packages globally, we could have instead done the following: npm install -g Create a folder nodejs/static which will later contain our static assets (for example, a webpage and our JavaScript). Create a file nodejs/app.js which will later contain our express application and data provider. Create a file nodejs/bower.json to list our JavaScript dependencies for the page: {"name": "highcharts-cookbook-chapter-8","dependencies": {"jquery": "^1.9","highcharts": "~3.0"}} Create a file nodejs/.bowerrc to configure where our JavaScript dependencies will be installed: { "directory": "static/js" } How to do it... Let’s begin: Create an example file nodejs/static/index.html for viewing our charts <html> <head> </head> <body> <div id='example'></div> <script src = './js/jquery/jquery.js'></script> <script src = './js/highcharts/highcharts.js'></script> <script type = 'text/javascript'> $(document).ready(function() { var options = { chart: { type: 'bar', events: { load: function () { var self = this; setInterval(function() { $.getJSON('/ajax/series', function(data) { var series = self.series[0]; series.setData(data); }); }, 1000); } } }, title: { text: 'Using AJAX for polling charts' }, series: [{ name: 'AJAX data (series)', data: [] }] }; $('#example').highcharts(options); }); </script> </body> </html> In nodejs/app.js, import the express framework: var express = require('express'); Create a new express application: var app = express(); Tell our application where to serve static files from: var app = express(); app.use(express.static('static')); Create a method to return data: app.use(express.static('static')); app.get('/ajax/series', function(request, response) { var count = 10, results = []; for(var i = 0; i < count; i++) { results.push({ "y": Math.random()*100 }); } response.json(results); }); Listen on port 8888: response.json(results); }); app.listen(8888); Start our application: node app.js View the output on http://localhost:8888/index.html How it works... Most of what we've done in our application is fairly simple: create an express instance, create request methods, and listen on a certain port. With express, we could also process different HTTP verbs like POST or DELETE. We can handle these methods by creating a new request method. In our example, we handled GET requests (that is, app.get) but in general, we can use app.VERB (Where VERB is an HTTP verb). In fact, we can also be more flexible in what our URLs look like: we can use JavaScript regular expressions as well. More information on the express API can be found at http://expressjs.com/api.html. Using Django as a data provider Django is likely one of the more robust python frameworks, and certainly one of the oldest. As such, Django can be used to tackle a variety of different cases, and has a lot of support and extensions available. This recipe will look at how we can leverage Django to provide data for Highcharts. Getting ready Download and install Python 2.7 (http://www.python.org/getit/) Download and install Django (http://www.djangoproject.com/download/) Create a new folder for our project, django. From within the django folder, run the following to create a new project: django-admin.py startproject example Create a file django/bower.json to list our JavaScript dependencies { "name": "highcharts-cookbook-chapter-8", "dependencies": { "jquery": "^1.9", "highcharts": "~3.0" } } Create a file django/.bowerrc to configure where our JavaScript dependencies will be installed. { "directory": "example/static/js" } Create a folder example/templates for any templates we may have. How to do it... To get started, follow the instructions below: Create a folder example/templates, and include a file index.html as follows: {% load staticfiles %} <html> <head> </head> <body> <div class='example' id='example'></div> <script src = '{% static "js/jquery/jquery.js" %}'></script> <script src = '{% static "js/highcharts/highcharts.js" %}'></script> <script type='text/javascript'> $(document).ready(function() { var options = { chart: { type: 'bar', events: { load: function () { var self = this; setInterval(function() { $.getJSON('/ajax/series', function(data) { var series = self.series[0]; series.setData(data); }); }, 1000); } } }, title: { text: 'Using AJAX for polling charts' }, series: [{ name: 'AJAX data (series)', data: [] }] }; $('#example').highcharts(options); }); </script> </body> </html> Edit example/example/settings.py and include the following at the end of the file: STATIC_URL = '/static/' TEMPLATE_DIRS = ( os.path.join(BASE_DIR, 'templates/') ) STATICFILES_DIRS = ( os.path.join(BASE_DIR, 'static/'), ) Create a file example/example/views.py and create a handler to show our page: from django.shortcuts import render_to_response def index(request): return render_to_response('index.html') Edit example/example/views.py and create a handler to serve our data: import json from random import randint from django.http import HttpResponse from django.shortcuts import render_to_response def index(request): return render_to_response('index.html') def series(request): results = [] for i in xrange(1, 11): results.append({ 'y': randint(0, 100) }) json_results = json.dumps(results) return HttpResponse(json_results, mimetype='application/json') Edit example/example/urls.py to register our URL handlers: from django.conf.urls import patterns, include, url from django.contrib import admin admin.autodiscover() import views urlpatterns = patterns('', # Examples: # url(r'^$', 'example.views.home', name='home'), # url(r'^blog/', include('blog.urls')), url(r'^admin/', include(admin.site.urls)), url(r'^/?$', views.index, name='index'), url(r'^ajax/series/?$', views.series, name='series'), ) Run the following command from the django folder to start the server: python example/manage.py runserver Observe the page by visiting http://localhost:8000

0
0
2054

article-image-using-show-explain-running-queries

Packt

13 Mar 2014

5 min read

Using SHOW EXPLAIN with running queries

Packt

13 Mar 2014

5 min read

(For more resources related to this topic, see here.) Getting ready Import the ISFDB database which is available under Creative Commons licensing. How to do it... Open a terminal window and launch the mysql command-line client and connect to the isfdb database using the following statement. mysql isfdb Next, we open another terminal window and launch another instance of the mysql command-line client. Run the following command in the first window: ALTER TABLE title_relationships DROP KEY titles; Next, in the first window, start the following example query: SELECT titles.title_id AS ID, titles.title_title AS Title, authors.author_legalname AS Name, (SELECT COUNT(DISTINCT title_relationships.review_id) FROM title_relationships WHERE title_relationships.title_id = titles.title_id) AS reviews FROM titles,authors,canonical_author WHERE (SELECT COUNT(DISTINCT title_relationships.review_id) FROM title_relationships WHERE title_relationships.title_id = titles.title_id)>=10 AND canonical_author.author_id = authors.author_id AND canonical_author.title_id=titles.title_id AND titles.title_parent=0 ; Wait for at least a minute and then run the following query to look for the details of the query that we executed in step 4 and QUERY_ID for that query: SELECT INFO, TIME, ID, QUERY_ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE TIME > 60G Run SHOW EXPLAIN in the second window (replace id in the following command line with the numeric ID that we discovered in step 5): SHOW EXPLAIN FOR id Run the following command in the second window to kill the query running in the first window (replace query_id in the following command line with the numeric QUERY_ID number that we discovered in step 5): KILL QUERY ID query_id; In the first window, reverse the change we made in step 3 using the following command: ALTER TABLE title_relationships ADD KEY titles (title_id); How it works... The SHOW EXPLAIN statement allows us to obtain information about how MariaDB executes a long-running statement. This is very useful for identifying bottlenecks in our database. The query in this article will execute efficiently only if it touches the indexes in our data. So, for demonstration purposes, we will first sabotage the title_relationships table by removing the title's index. This causes our query to unnecessarily iterate through hundreds of thousands of rows and generally take far too long to complete. The output of steps 3 and 4 will look similar to the following screenshot: While our sabotaged query is running, and after waiting for at least a minute, we switch to another window and look for all queries that have been running for longer than 60 seconds. Our sabotaged query will likely be the only one in the output. From this output, we get ID and QUERY_ID. The output of the command will look like the following with the ID and QUERY_ID as the last two items: Next, we use the ID number to execute SHOW EXPLAIN for our query. Incidentally, our query looks up all titles in the database that have 10 or more reviews and displays the title, author, and the number of reviews that the title has. The EXPLAIN for our query will look similar to the following screenshot: An easy-to-read version of this EXPLAIN is available at https://mariadb.org/ea/8v65g. Looking at rows 4 and 5 of EXPLAIN, it's easy to see why our query runs for so long. These two rows are dependent subqueries of the primary query (the first row). In the first query, we see that 117044 rows will be searched, and then, for the two dependent subqueries, MariaDB searches through 83389 additional rows, twice. Ouch. If we were analyzing a slow query in the real world at this point, we would fix the query to not have such an inefficient subquery, or we would add a KEY to our table to make the subquery efficient. If we're part of a larger development team, we could send the output of SHOW EXPLAIN and the query to the appropriate people to easily and accurately show them what the problem is with the query. In our case, we know exactly what to do; we will add back the KEY that we removed earlier. For fun, after adding back the KEY, we could rerun the query and the SHOW EXPLAIN command to see the difference that having the KEY in place makes. We'll have to be quick though, as with the KEY there, the query will only take a few seconds to complete (depending on the speed of our computer). There's more... The output of SHOW EXPLAIN is always accompanied by a warning. The purpose of this warning is to show us the command that is being run. After running SHOW EXPLAIN on a process ID, we simply issue SHOW WARNINGSG and we will see what SQL statement the process ID is running: This is useful for very long-running commands that after their start, takes a long time to execute, and then returns back at a time where we might not remember the command we started. In the examples of this article, we're using "G" as the delimiter instead of the more common ";" so that the data fits the page better. We can use either one. See also The full documentation of the KILL QUERY ID command can be found at https://mariadb.com/kb/en/data-manipulation-kill-connectionquery/ The full documentation of the SHOW EXPLAIN command can be found at https://mariadb.com/kb/en/show-explain/ Summary In this article, we saw the functionality of the SHOW EXPLAIN feature after altering the database using various queries. Further information regarding the SHOW EXPLAIN command can be found in the official documents provided in the preceding section. Resources for Article: Further resources on this subject: Installing MariaDB on Windows and Mac OS X [Article] A Look Inside a MySQL Daemon Plugin [Article] Visual MySQL Database Design in MySQL Workbench [Article]

0
0
5406

Packt

11 Mar 2014

7 min read

Self-service reporting

Packt

11 Mar 2014

7 min read

(For more resources related to this topic, see here.) Server 2012 Power View – Self-service Reporting Self-service reporting is when business users have the ability to create personalized reports and analytical queries without requiring the IT department to get involved. There will be some basic work that the IT department must do, namely creating the various data marts that the reporting tools will use as well as deploying those reporting tools. However, once that is done, IT will be freed of creating reports so that they can work on other tasks. Instead, the people who know the data best—the business users—will be able to build the reports. Here is a typical scenario that occurs when a self-service reporting solution is not in place: a business user wants a report created, so they fill out a report request that gets routed to IT. The IT department is backlogged with report requests, so it takes them weeks to get back to the user. When they do, they interview the user to get more details about exactly what data the user wants on the report and the look of the report (the business requirements). The IT person may not know the data that well, so they will have to get educated by the user on what the data means. This leads to mistakes in understanding what the user is requesting. The IT person may take away an incorrect assumption of what data the report should contain or how it should look. Then, the IT person goes back and creates the report. A week or so goes by and he shows the user the report. Then, they hear things from the user such as "that is not correct" or "that is not what I meant". The IT person fixes the report and presents it to the user once again. More problems are noticed, fixes are made, and this cycle is repeated four to five times before the report is finally up to the user's satisfaction. In the end, a lot of time has been wasted by the business user and the IT person, and the finished version of the report took way longer that it should have. This is where a self-service reporting tool such as Power View comes in. It is so intuitive and easy to use that most business users can start developing reports with it with little or no training. The interface is so visually appealing that it makes report writing fun. This results in users creating their own reports, thereby empowering businesses to make timely, proactive decisions and explore issues much more effectively than ever before. In this article, we will cover the major features and functions of Power View, including the setup, various ways to start Power View, data visualizations, the user interface, data models, deploying and sharing reports, multiple views, chart highlighting, slicing, filters, sorting, exporting to PowerPoint, and finally, design tips. We will also talk about PowerPivot and the Business Intelligence Sematic Model (BISM). By the end of the article, you should be able to jump right in and start creating reports. Getting started Power View was first introduced as a new integrated reporting feature of SQL Server 2012 (Enterprise or BI Edition) with SharePoint 2010 Enterprise Edition. It has also been seamlessly integrated and built directly into Excel 2013 and made available as an add-in that you can simply enable (although it is not possible to share Power View reports between SharePoint and Excel). Power View allows users to quickly create highly visual and interactive reports via a What You See Is What You Get (WYSIWYG) interface. The following screenshot gives an example of a type of report you can build with Power View, which includes various types of visualizations: Sales Dashboard The following screenshot is another example of a Power View report that makes heavy use of slicers along with a bar chart and tables: Promotion Dashboard We will start by discussing PowerPivot and BISM and will then go over the setup procedures for the two possible ways to use Power View: through SharePoint or via Excel 2013. PowerPivot It is important to understand what PowerPivot is and how it relates to Power View. PowerPivot is a data analysis add-on for Microsoft Excel. With it, you can mash large amounts of data together that you can then analyze and aggregate all in one workbook, bypassing the Excel maximum worksheet size of one million rows. It uses a powerful data engine to analyze and query large volumes of data very quickly. There are many data sources that you can use to import data into PowerPivot. Once the data is imported, it becomes part of a data model, which is simply a collection of tables that have relationships between them. Since the data is in Excel, it is immediately available to PivotTables, PivotCharts, and Power View. PowerPivot is implemented in an application window separate from Excel that gives you the ability to do such things as insert and delete columns, format text, hide columns from client tools, change column names, and add images. Once you complete your changes, you have the option of uploading (publishing) the PowerPivot workbook to a PowerPivot Gallery or document library (on a BI site) in SharePoint (a PowerPivot Gallery is a special type of SharePoint document library that provides document and preview management for published Excel workbooks that contain PowerPivot data). This will allow you to share the data model inside PowerPivot with others. To publish your PowerPivot workbook to SharePoint, perform the following steps: Open the Excel file that contains the PowerPivot workbook. Select the File tab on the ribbon. If using Excel 2013, click on Save As and then click on Browse and enter the SharePoint location of the PowerPivot Gallery (see the next screenshot). If using Excel 2010, click on Save & Send, click on Save to SharePoint, and then click on Browse. Click on Save and the file will then be uploaded to SharePoint and immediately be made available to others. Saving files to the PowerPivot Gallery A Power View report can be built from the PowerPivot workbook in the PowerPivot Gallery in SharePoint or from the PowerPivot workbook in an Excel 2013 file. Business Intelligence Semantic Model Business Intelligence Semantic Model (BISM) is a new data model that was introduced by Microsoft in SQL Server 2012. It is a single unified BI platform that publicizes one model for all end-user experiences. It is a hybrid model that exposes two storage implementations: the multidimensional data model (formerly called OLAP) and the tabular data model, which uses the xVelocity engine (formally called VertiPaq), all of which are hosted in SQL Server Analysis Services (SSAS). The tabular data model provides the architecture and optimization in a format that is the same as the data storage method used by PowerPivot, which uses an in-memory analytics engine to deliver fast access to tabular data. Tabular data models are built using SQL Server Data Tools (SSDT) and can be created from scratch or by importing a PowerPivot data model contained within an Excel workbook. Once the model is complete, it is deployed to an SSAS server instance configured for tabular storage mode to make it available for others to use. This provides a great way to create a self-service BI solution, and then make it a department solution and then an enterprise solution, as shown: Self-service solution: A business user loads data into PowerPivot and analyzes the data, making improvements along the way. Department solution: The Excel file that contains the PowerPivot workbook is deployed to a SharePoint site used by the department (in which the active data model actually resides in an SSAS instance and not in the Excel file). Department members use and enhance the data model over time. Enterprise solution: The PowerPivot data model from the SharePoint site is imported into a tabular data model by the IT department. Security is added and then the model is deployed to SSAS so the entire company can use it. Summary In this article, we learned about the features of Power View and how it is an excellent tool for self-service reporting. We talked about PowerPivot how it relates to Power View. Resources for Article: Further resources on this subject: Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Microsoft SQL Server 2008 - Installation Made Easy [Article] Best Practices for Microsoft SQL Server 2008 R2 Administration [Article]

0
0
2300

article-image-article-reporting-with-microsoft-sql

Packt

11 Mar 2014

7 min read

Reporting with Microsoft SQL

Packt

11 Mar 2014

7 min read

(For more resources related to this topic, see here.) Server 2012 Power View – Self-service Reporting Self-service reporting is when business users have the ability to create personalized reports and analytical queries without requiring the IT department to get involved. There will be some basic work that the IT department must do, namely creating the various data marts that the reporting tools will use as well as deploying those reporting tools. However, once that is done, IT will be freed of creating reports so that they can work on other tasks. Instead, the people who know the data best—the business users—will be able to build the reports. Here is a typical scenario that occurs when a self-service reporting solution is not in place: a business user wants a report created, so they fill out a report request that gets routed to IT. The IT department is backlogged with report requests, so it takes them weeks to get back to the user. When they do, they interview the user to get more details about exactly what data the user wants on the report and the look of the report (the business requirements). The IT person may not know the data that well, so they will have to get educated by the user on what the data means. This leads to mistakes in understanding what the user is requesting. The IT person may take away an incorrect assumption of what data the report should contain or how it should look. Then, the IT person goes back and creates the report. A week or so goes by and he shows the user the report. Then, they hear things from the user such as "that is not correct" or "that is not what I meant". The IT person fixes the report and presents it to the user once again. More problems are noticed, fixes are made, and this cycle is repeated four to five times before the report is finally up to the user's satisfaction. In the end, a lot of time has been wasted by the business user and the IT person, and the finished version of the report took way longer that it should have. This is where a self-service reporting tool such as Power View comes in. It is so intuitive and easy to use that most business users can start developing reports with it with little or no training. The interface is so visually appealing that it makes report writing fun. This results in users creating their own reports, thereby empowering businesses to make timely, proactive decisions and explore issues much more effectively than ever before. In this article, we will cover the major features and functions of Power View, including the setup, various ways to start Power View, data visualizations, the user interface, data models, deploying and sharing reports, multiple views, chart highlighting, slicing, filters, sorting, exporting to PowerPoint, and finally, design tips. We will also talk about PowerPivot and the Business Intelligence Sematic Model (BISM). By the end of the article, you should be able to jump right in and start creating reports. Getting started Power View was first introduced as a new integrated reporting feature of SQL Server 2012 (Enterprise or BI Edition) with SharePoint 2010 Enterprise Edition. It has also been seamlessly integrated and built directly into Excel 2013 and made available as an add-in that you can simply enable (although it is not possible to share Power View reports between SharePoint and Excel). Power View allows users to quickly create highly visual and interactive reports via a What You See Is What You Get (WYSIWYG) interface. The following screenshot gives an example of a type of report you can build with Power View, which includes various types of visualizations: Sales Dashboard The following screenshot is another example of a Power View report that makes heavy use of slicers along with a bar chart and tables: Promotion Dashboard We will start by discussing PowerPivot and BISM and will then go over the setup procedures for the two possible ways to use Power View: through SharePoint or via Excel 2013. PowerPivot It is important to understand what PowerPivot is and how it relates to Power View. PowerPivot is a data analysis add-on for Microsoft Excel. With it, you can mash large amounts of data together that you can then analyze and aggregate all in one workbook, bypassing the Excel maximum worksheet size of one million rows. It uses a powerful data engine to analyze and query large volumes of data very quickly. There are many data sources that you can use to import data into PowerPivot. Once the data is imported, it becomes part of a data model, which is simply a collection of tables that have relationships between them. Since the data is in Excel, it is immediately available to PivotTables, PivotCharts, and Power View. PowerPivot is implemented in an application window separate from Excel that gives you the ability to do such things as insert and delete columns, format text, hide columns from client tools, change column names, and add images. Once you complete your changes, you have the option of uploading (publishing) the PowerPivot workbook to a PowerPivot Gallery or document library (on a BI site) in SharePoint (a PowerPivot Gallery is a special type of SharePoint document library that provides document and preview management for published Excel workbooks that contain PowerPivot data). This will allow you to share the data model inside PowerPivot with others. To publish your PowerPivot workbook to SharePoint, perform the following steps: Open the Excel file that contains the PowerPivot workbook. Select the File tab on the ribbon. If using Excel 2013, click on Save As and then click on Browse and enter the SharePoint location of the PowerPivot Gallery (see the next screenshot). If using Excel 2010, click on Save & Send, click on Save to SharePoint, and then click on Browse. Click on Save and the file will then be uploaded to SharePoint and immediately be made available to others. Saving files to the PowerPivot Gallery A Power View report can be built from the PowerPivot workbook in the PowerPivot Gallery in SharePoint or from the PowerPivot workbook in an Excel 2013 file. Business Intelligence Semantic Model Business Intelligence Semantic Model (BISM) is a new data model that was introduced by Microsoft in SQL Server 2012. It is a single unified BI platform that publicizes one model for all end-user experiences. It is a hybrid model that exposes two storage implementations: the multidimensional data model (formerly called OLAP) and the tabular data model, which uses the xVelocity engine (formally called VertiPaq), all of which are hosted in SQL Server Analysis Services (SSAS). The tabular data model provides the architecture and optimization in a format that is the same as the data storage method used by PowerPivot, which uses an in-memory analytics engine to deliver fast access to tabular data. Tabular data models are built using SQL Server Data Tools (SSDT) and can be created from scratch or by importing a PowerPivot data model contained within an Excel workbook. Once the model is complete, it is deployed to an SSAS server instance configured for tabular storage mode to make it available for others to use. This provides a great way to create a self-service BI solution, and then make it a department solution and then an enterprise solution, as shown: Self-service solution: A business user loads data into PowerPivot and analyzes the data, making improvements along the way. Department solution: The Excel file that contains the PowerPivot workbook is deployed to a SharePoint site used by the department (in which the active data model actually resides in an SSAS instance and not in the Excel file). Department members use and enhance the data model over time. Enterprise solution: The PowerPivot data model from the SharePoint site is imported into a tabular data model by the IT department. Security is added and then the model is deployed to SSAS so the entire company can use it. Summary In this article, we learned about the features of Power View and how it is an excellent tool for self-service reporting. We talked about PowerPivot how it relates to Power View. Resources for Article: Further resources on this subject: Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Microsoft SQL Server 2008 - Installation Made Easy [Article] Best Practices for Microsoft SQL Server 2008 R2 Administration [Article]

0
0
1127

Packt

20 Feb 2014

5 min read

MongoDB data modeling

Packt

20 Feb 2014

5 min read

(For more resources related to this topic, see here.) MongoDB data modeling is an advanced topic. We will instead focus on a few key points regarding data modeling so that you will understand the implications for your report queries. These points touch on the strengths and weaknesses of embedded data models. The most common data modeling decision in MongoDB is whether to normalize or denormalize related collections of data. Denormalized MongoDB models embed related data together in the same database documents while normalized models represent the same records with references between documents. This modeling decision will impact the method used to extract and query data using Pentaho, because the normalized method requires joining documents outside of MongoDB. With normalized models, Pentaho Data Integration is used to resolve the reference joins. The following table provides a list of objects that form the fundamental building blocks of a MongoDB database and the associated object names in the pentaho database. MongoDB objects Sample MongoDB object names Database Pentaho Collection sessions, events, and sessions_events Document The sessions document fields (key-value pairs) are as follows: id: 512d200e223e7f294d13607 ip_address: 235.149.145.57 id_session: DF636FB593684905B335982BEA3D967B browser: Chrome date_session: 2013-02-20T15:47:49Z duration_session: 18.24 referring_url: www.xyz.com The events document fields (key-value pairs) are as follows: _id: 512d1ff2223e7f294d13c0c6 id_session: DF636FB593684905B335982BEA3D967B event: Signup Free Offer The sessions_events document fields (key-value pairs) are as follows: _id: 512d200e223e7f294d13607 ip_address: 235.149.145.57 id_session: DF636FB593684905B335982BEA3D967B browser: Chrome date_session: 2013-02-20T15:47:49Z duration_session: 18.24 referring_url: www.xyz.com event_data: [array] event: Visited Site event: Signup Free Offer This table shows three collections in the pentaho database: sessions, events, and sessions_events. The first two collections, sessions and events, illustrate the concept of normalizing the clickstream data by separating it into two related collections with a reference key field in common. In addition to the two normalized collections, the sessions_events collection is included to illustrate the concept of denormalizing the clickstream data by combining the data into a single collection. Normalized models Because multiple clickstream events can occur within a single web session, we know that sessions have a one-to-many relationship with events. For example, during a single 20-minute web session, a user could invoke four events by visiting the website, watching a video, signing up for a free offer, and completing a lead form. These four events would always appear within the context of a single session and would share the same id_session reference. The data resulting from the normalized model would include one new session document in the sessions collection and four new event documents in the events collection, as shown in the following figure: Each event document is linked to the parent session document by the shared id_session reference field, whose values are highlighted in red. This normalized model would be an efficient data model if we expect the number of events per session to be very large for a couple of reasons. The first reason is that MongoDB limits the maximum document size to 16 megabytes, so you will want to avoid data models that create extremely large documents. The second reason is that query performance can be negatively impacted by large data arrays that contains thousands of event values. This is not a concern for the clickstream dataset, because the number of events per session is small. Denormalized models The one-to-many relationship between sessions and events also gives us the option of embedding multiple events inside of a single session document. Embedding is accomplished by declaring a field to hold either an array of values or embedded documents known as subdocuments. The sessions_events collection is an example of embedding, because it embeds the event data into an array within a session document. The data resulting from our denormalized model includes four event values in the event_data array within the sessions_events collection as shown in the following figure: As you can see, we have the choice to keep the session and event data in separate collections, or alternatively, store both datasets inside a single collection. One important rule to keep in mind when you consider the two approaches is that the MongoDB query language does not support joins between collections. This rule makes embedded documents or data arrays better for querying, because the embedded relationship allows us to avoid expensive client-side joins between collections. In addition, the MongoDB query language supports a large number of powerful query operators for accessing documents by the contents of an array. A list of query operators can be found on the MongoDB documentation site at http://docs.mongodb.org/manual/reference/operator/. To summarize, the following are a few key points to consider when deciding on a normalized or denormalized data model in MongoDB: The MongoDB query language does not support joins between collections The maximum document size is 16 megabytes Very large data arrays can negatively impact query performance In our sample database, the number of clickstream events per session is expected to be small—within a modest range of only one to 20per session. The denormalized model works well in this scenario, because it eliminates joins by keeping events and sessions in a single collection. However, both data modeling scenarios are provided in the pentaho MongoDB database to highlight the importance of having an analytics platform, such as Pentaho, to handle both normalized and denormalized data models. Summary This article expands on the topic of data modeling and explains MongoDB database concepts essential to querying MongoDB data with Pentaho. Resources for Article: Further resources on this subject: Installing Pentaho Data Integration with MySQL [article] Integrating Kettle and the Pentaho Suite [article] Getting Started with Pentaho Data Integration [article]

0
0
2682

article-image-understanding-process-variation-control-charts

Packt

19 Feb 2014

10 min read

Understanding Process Variation with Control Charts

Packt

19 Feb 2014

10 min read

(For more resources related to this topic, see here.) Xbar-R charts and applying stages to a control chart As with all control charts, the Xbar-R charts are used to monitor process stability. Apart from generating the basic control chart, we will look at how we can control the output with a few options within the dialog boxes. Xbar-R stands for means and ranges; we use the means chart to estimate the population mean of a process and the range chart to observe how the population variation changes. For more information on control charts, see Understanding Statistical Process Control by Donald J. Wheeler and David S. Chambers. As an example, we will study the fill volumes of syringes. Five syringes are sampled from the process at hourly intervals; these are used to represent the mean and variation of that process over time. We will plot the means and ranges of the fill volumes across 50 subgroups. The data also includes a process change. This will be displayed on the chart by dividing the data into two stages. The charts for subgrouped data can use a worksheet set up in two formats. Here the data is recorded such that each row represents a subgroup. The columns are the sample points. The Xbar-S chart will use data in the other format where all the results are recorded in one column. The following screenshot shows the data with subgroups across the rows on the left, and the same data with subgroups stacked on the right: How to do it… The following steps will create an Xbar-R chart staged by the Adjustment column with all eight of the tests for special causes: Use the Open Worksheet command from the File menu to open the Volume.mtw worksheet. Navigate to Stat | Control Charts | Variables charts for subgroups. Then click on Xbar-R…. Change the drop down at the top of the dialog to Observations for a subgroup are in one row of columns:. Enter the columns Measure1 to Measure5 into the dialog box by highlighting all the measure columns in the left selection box and clicking on Select. Click on Xbar-R Options and navigate to the tab for Tests. Select all the tests for special causes. Select the Stages tab. Enter Adjustment in the Define Stages section. Click on OK in each dialog box. How it works… The R or range chart displays the variation over time in the data by plotting the range of measurements in a subgroup. The Xbar chart plots the means of the subgroups. The choice of layout of the worksheet is picked from the drop-down box in the main dialog box. The All observations for a chart are in one column: field is used for data stacked into columns. Means of subgroups and ranges are found from subgroups indicated in the worksheet. The Observations for a subgroup are in one row of columns: field will find means and ranges from the worksheet rows. The Xbar-S chart example shows us how to use the dialog box when the data is in a single column. The dialog boxes for both Xbar-R and Xbar-S work the same way. Tests for special causes are used to check the data for nonrandom events. The Xbar-R chart options give us control over the tests that will be used. The values of the tests can be changed from these options as well. The options from the Tools menu of Minitab can be used to set the default values and tests to use in any control chart. By using the option under Stages, we are able to recalculate the means and standard deviations for the pre and post change groups in the worksheet. Stages can be used to recalculate the control chart parameters on each change in a column or on specific values. A date column can be used to define stages by entering the date at which a stage should be started. There's more… Xbar-R charts are also available under the Assistant menu. The default display option for a staged control chart is to show only the mean and control limits for the final stage. Should we want to see the values for all stages, we would use the Xbar-R Options and Display tab. To place these values on the chart for all stages, check the Display control limit / center line labels for all stages box. See Xbar-S charts for a description of all the tabs within the Control Charts options. Using an Xbar-S chart Xbar-S charts are similar in use to Xbar-R. The main difference is that the variation chart uses standard deviation from the subgroups instead of the range. The choice between using Xbar-R or Xbar-S is usually made based on the number of samples in each subgroup. With smaller subgroups, the standard deviation estimated from these can be inflated. Typically, with less than nine results per subgroup, we see them inflating the standard deviation, which increases the width of the control limits on the charts. Automotive Industry Action Group (AIAG) suggests using the Xbar-R, which is greater than or equal to nine times the Xbar-S. Now, we will apply an Xbar-S chart to a slightly different scenario. Japan sits above several active fault lines. Because of this, minor earthquakes are felt in the region quite regularly. There may be several minor seismic events on any given day. For this example, we are going to use seismic data from the Advanced National Seismic System. All seismic events from January 1, 2013 to July 12, 2013 from the region that covers latitudes 31.128 to 45.275 and longitudes 129.799 to 145.269 are included in this dataset. This corresponds to an area that roughly encompasses Japan. The dataset is provided for us already but we could gather more up-to-date results from the following link: http://earthquake.usgs.gov/monitoring/anss/ To search the catalog yourself, use the following link: http://www.ncedc.org/anss/catalog-search.html We will look at seismic events by week that create Xbar-S charts of magnitude and depth. In the initial steps, we will use the date to generate a column that identifies the week of the year. This column is then used as the subgroup identifier. How to do it… The following steps will create an Xbar-S chart for the depth and magnitude of earthquakes. This will display the mean, and standard deviation of the events by week: Use the Open Worksheet command from the File menu to open the earthquake.mtw file. Go to the Data menu, click on Extract from Date/Time, and then click on To Text. Enter Date in the Extract from Date/time column: section. Type Week in the Store text column in: section. Check the selection for Week and click on OK to create the new column. Navigate to Stat | Control Charts | Variable charts for Subgroups and click on Xbar-S. Enter Depth and Mag into the dialog box as shown in the following screenshot and Week into the Subgroup sizes: field. Click on the Scale button, and select the option for Stamp. Enter Date in the Stamp columns section. Click on OK. Click on Xbar-S Options and then navigate to the Tests tab. Select all tests for special causes. Click on OK in each dialog box. How it works… Steps 1 to 4 build the Week column that we use as the subgroup. The extracts from the date/time options are fantastic for quickly generating columns based on dates. Days of the week, week of the year, month, or even minutes or seconds can all be separated from the date. Multiple columns can be entered into the control chart dialog box just as we have done here. Each column is then used to create a new Xbar-S chart. This lets us quickly create charts for several dimensions that are recorded at the same time. The use of the week column as the subgroup size will generate the control chart with mean depth and magnitude for each week. The scale options within control charts are used to change the display on the chart scales. By default, the x axis displays the subgroup number; changing this to display the date can be more informative when identifying the results that are out of control. Options to add axes, tick marks, gridlines, and additional reference lines are also available. We can also edit the axis of the chart after we have generated it by double-clicking on the x axis. The Xbar-S options are similar for all control charts; the tabs within Options give us control over a number of items for the chart. The following list shows us the tabs and the options found within each tab: Parameters: This sets the historical means and standard deviations; if using multiple columns, enter the first column mean, leave a space, and enter the second column mean Estimate: This allows us to specify subgroups to include or exclude in the calculations and change the method of estimating sigma Limits: This can be used to change where sigma limits are displayed or place on the control limits Tests: This allows us to choose the tests for special causes of the data and change the default values. The Using I-MR chart recipe details the options for the Tests tab. Stages: This allows the chart to be subdivided and will recalculate center lines and control limits for each stage Box Cox: This can be used to transform the data, if necessary Display: This has settings to choose how much of the chart to display. We can limit the chart to show only the last section of the data or split a larger dataset into separate segments. There is also an option to display the control limits and center lines for all stages of a chart in this option. Storage: This can be used to store parameters of the chart, such as means, standard deviations, plotted values, and test results There's more… The control limits for the graphs that are produced vary as the subgroup sizes are not constant; this is because the number of earthquakes varies each week. In most practical applications, we may expect to collect the same number of samples or items in a subgroup and hence have flat control limits. If we wanted to see the number of earthquakes in each week, we could use Tally from inside the Tables menu. This will display a result of counts per week. We could also store this tally back into the worksheet. The result of this tally could be used with a c-chart to display a count of earthquake events per week. If we wanted to import the data directly from the Advanced National Seismic System, then the following steps will obtain the data and prepare the worksheet for us: Follow the link to the ANSS catalog search at http://www.ncedc.org/anss/catalog-search.html. Enter 2013/01/01 in the Start date, time: field. Enter 2013/06/12 in the End date, time: field. Enter 3 in the Min magnitude: field. Enter 31.128 in the Min latitude field and 45.275 in the Max latitude field. Enter 129.799 in the Min longitude field and 145.269 in the Max longitude filed. Copy the data from the search results, excluding the headers, and paste it into a Minitab worksheet. Change the names of the columns to C1 Date, C2 Time, C3 Lat, C4 Long, C5 Depth, C6 Mag. The other columns, C7 to C13, can then be deleted. The Date column will have copied itself into Minitab as text; to convert this back to a date, navigate to Data | Change Data Type | Text to Date/Time. Enter Date in both the Change text columns: and Store date/time columns in: sections. In the Format of text columns: section, enter yyyy/mm/dd. Click on OK. To extract the week from the Date column, navigate to Data | Date/Time | Extract to Text. Enter 'Date' in the Extract from date/time column: section. Enter 'Week' in the Store text column in: field. Check the box for Week and click on OK.

0
0
13801

Packt

18 Feb 2014

5 min read

Query Performance Tuning

Packt

18 Feb 2014

5 min read

(For more resources related to this topic, see here.) Understanding how Analysis Services processes queries We need to understand what happens inside Analysis Services when a query is run. The two major parts of the Analysis Services engine are: The Formula Engine: This part processes MDX queries, works out what data is needed to answer them, requests that data from the Storage Engine, and then performs all calculations needed for the query. The Storage Engine: This part handles all the reading and writing of data, for example, during cube processing and fetching all the data that the Formula Engine requests when a query is run. When you run an MDX query, then, that query goes first to the Formula Engine, then to the Storage Engine, and then back to the Formula Engine before the results are returned back to you. Performance tuning methodology When tuning performance there are certain steps you should follow to allow you to measure the effect of any changes you make to your cube, its calculations or the query you're running: Always test your queries in an environment that is identical to your production environment, wherever possible. Otherwise, ensure that the size of the cube and the server hardware you're running on is at least comparable, and running the same build of Analysis Services. Make sure that no-one else has access to the server you're running your tests on. You won't get reliable results if someone else starts running queries at the same time as you. Make sure that the queries you're testing with are equivalent to the ones that your users want to have tuned. As we'll see, you can use Profiler to capture the exact queries your users are running against the cube. Whenever you test a query, run it twice; first on a cold cache, and then on a warm cache. Make sure you keep a note of the time each query takes to run and what you changed on the cube or in the query for that run. Clearing the cache is a very important step—queries that run for a long time on a cold cache may be instant on a warm cache. When you run a query against Analysis Services, some or all of the results of that query (and possibly other data in the cube, not required for the query) will be held in cache so that the next time a query is run that requests the same data it can be answered from cache much more quickly. To clear the cache of an Analysis Services database, you need to execute a ClearCache XMLA command. To do this in SQL Management Studio, open up a new XMLA query window and enter the following: <Batch > <ClearCache> <Object> <DatabaseID>Adventure Works DW 2012</DatabaseID> </Object> </ClearCache> </Batch> Remember that the ID of a database may not be the same as its name—you can check this by right-clicking on a database in the SQL Management Studio Object Explorer and selecting Properties. Alternatives to this method also exist: the MDX Studio tool allows you to clear the cache with a menu option, and the Analysis Services Stored Procedure Project (http://tinyurl.com/asstoredproc) contains code that allows you to clear the Analysis Services cache and the Windows File System cache directly from MDX. Clearing the Windows File System cache is interesting because it allows you to compare the performance of the cube on a warm and cold file system cache as well as a warm and cold Analysis Services cache. When the Analysis Services cache is cold or can't be used for some reason, a warm filesystem cache can still have a positive impact on query performance. After the cache has been cleared, before Analysis Services can answer a query it needs to recreate the calculated members, named sets, and other objects defined in a cube's MDX Script. If you have any reasonably complex named set expressions that need to be evaluated, you'll see some activity in Profiler relating to these sets being built and it's important to be able to distinguish between this and activity that's related to the queries you're actually running. All MDX Script related activity occurs between Execute MDX Script Begin and Execute MDX Script End events; these are fired after the Query Begin event but before the Query Cube Begin event for the query run after the cache has been cleared and there is one pair of Begin/End events for each command on the MDX Script. When looking at a Profiler trace you should either ignore everything between the first Execute MDX Script Begin event and the last Execute MDX Script End event or run a query that returns no data at all to trigger the evaluation of the MDX Script, for example: SELECT {} ON 0 FROM [Adventure Works] Designing for performance Many of the recommendations for designing cubes will improve query performance, and in fact the performance of a query is intimately linked to the design of the cube it's running against. For example, dimension design, especially optimizing attribute relationships, can have significant effect on the performance of all queries—at least as much as any of the optimizations. As a result, we recommend that if you've got a poorly performing query the first thing you should do is review the design of your cube to see if there is anything you could do differently. There may well be some kind of trade-off needed between usability, manageability, time-to-develop, overall 'elegance' of the design and query performance, but since query performance is usually the most important consideration for your users then it will take precedence. To put it bluntly, if the queries your users want to run don't run fast, your users will not want to use the cube at all!

0
0
2461

article-image-processing-tweets-apache-hive

Packt

18 Feb 2014

6 min read

Processing Tweets with Apache Hive

Packt

18 Feb 2014

6 min read

(For more resources related to this topic, see here.) Extracting hashtags In this part and the following one, we'll see how to extract data efficiently from tweets such as hashtags and emoticons. We need to do it because we want to be able to know what the most discussed topics are, and also get the mood across the tweets. And then, we'll want to join that information to get people's sentiments. We'll start with hashtags; to do so, we need to do the following: Create a new hashtag table. Use a function that will extract the hashtags from the tweet string. Feed the hashtag table with the result of the extracted function. So, I have some bad news and good news: Bad news: Hive provides a lot of built-in user-defined functions, but unfortunately, it does not provide any function based on a regex pattern; we need to use a custom user-defined function to do that. This is such a bad news as you will learn how to do it. Good news: Hive provides an extremely efficient way to create a Hive table from an array. We'll then use the lateral view and the Explode Hive UDF to do that. The following is the Hive-processing workflow that we are going to apply to our tweets: Hive-processing workflow The preceding diagram describes the workflow to be followed to extract the hashtags. The steps are basically as follows: Receive the tweets. Detect all the hashtags using the custom Hive user-defined function. Obtain an array of hashtags. Explode it and obtain a lateral view to feed our hashtags table. This kind of processing is really useful if we want to have a feeling of what the top tweeted topics are, and is most of the time represented by a word cloud chart like the one shown in the following diagram: Topic word cloud sample Let's do this by creating a new CH04_01_HIVE_PROCESSING_HASH_TAGS job under a new Chapter4 folder. This job will contain six components: One to connect to Hive; you can easily copy and paste the connection component from the CH03_03_HIVE_FORMAT_DATA job One tHiveRow to add the custom Hive UDF to the Hive runtime classpath. The following would be the steps to create a new job: First, we will add the following context variable to our PacktContext group: Name Value custom_udf_jar PATH_TO_THE_JAR For Example: /Users/bahaaldine/here/is/the/jar/extractHashTags.jar This new context variable is just the path to the Hive UDF JAR file provided in the source file Now, we can add the "add jar "+context.custom_udf_jar Hive query in our tHiveRow component to load the JAR file in the classpath when the job is being run. We use the add jar query so that Hive will load all the classes in the JAR file when the job starts, as shown in the following screenshot: Adding a Custom UDF JAR to Hive classpath. After the JAR file is loaded by the previous component, we need tHiveRow to register the custom UDF into the available UDF catalog. The custom UDF is a Java class with a bunch of methods that can be invoked from Hive-QL code. The custom UDF that we need is located in the org.talend.demo package of the JAR file and is named ExtractPattern. So we will simply add the "create temporary function extract_patterns as 'org.talend.demo.ExtractPattern'" configuration to the component. We use the create temporary function query to create a new extract_patterns function in Hive UDF catalog and give the implementation class contained in our package We need one tHiveRow to drop the hashtags table if it exists. As we have done in the CH03_02_HIVE_CREATE_TWEET_TABLE job, just add the "DROP TABLE IF EXISTS hash_tags" drop statement to be sure that the table is removed when we relaunch the job. We need one tHiveRow to create the hashtags table. We are going to create a new table to store the hashtags. For the purpose of simplicity, we'll only store the minimum time and description information as shown in the following table: Name Value hash_tags_id String day_of_week String day_of_month String time String month String hash_tags_label String The essential information here is the hash_tags_label column, which contains the hashtag name. With this knowledge, the following is our create table query: CREATE EXTERNAL TABLE hash_tags ( hash_tags_id string, day_of_week string, day_of_month string, time string, month string, hash_tags_label string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' LOCATION '/user/"+context.hive_user+"/packt/chp04/hashtags' Finally we need a tHiveRow component to feed the hashtags table. Here, we are going to use all the assets provided by the previous components as shown in the following query: insert into table hash_tags select concat(formatted_tweets.day_of_week, formatted_tweets. day_of_month, formatted_tweets.time, formatted_tweets.month) as hash_id, formatted_tweets.day_of_week, formatted_tweets.day_of_month, formatted_tweets.time, formatted_tweets.month, hash_tags_label from formatted_tweets LATERAL VIEW explode( extract_patterns(formatted_tweets.content,'#(\\w+)') ) hashTable as hash_tags_label Let's analyze the query from the end to the beginning. The last part of the query uses the extract_patterns function to parse in the formatted_tweets.content all hashtags based on the regex #(+). In Talend, all strings are Java string objects. That's why we need here to escape all backslash. Hive also needs special character escape, that brings us to finally having four backslashes. The extract_patterns command returns an array that we inject in the exploded Hive UDF in order to obtain a list of objects. We then pass them to the lateral view statement, which creates a new on-the-fly view called hashTable with one column hash_tags_label. Take a breath. We are almost done. If we go one level up, we will see that we selected all the required columns for our new hash_tags table, do a concatenation of data to build hash_id, and dynamically select a runtime-built column called hash_tags_label provided by the lateral view. Finally, all the selected data is inserted in the hash_tags table. We just need to run the job, and then, using the following query, we will check in Hive if the new table contains our hashtags: $ select * from hash_tags The following diagram shows the complete hashtags-extracting job structure: Hive processing job Summary By now, you should have a good overview of how to use Apache Hive features with Talend, from the ELT mode to the lateral view, passing by the custom Hive user-defined function. From the point of view of a use case, we have now reached the step where we need to reveal some added-value data from our Hive-based processing data. Resources for Article: Further resources on this subject: Visualization of Big Data [article] Managing Files [article] Making Big Data Work for Hadoop and Solr [article]

0
0
2225

article-image-sizing-configuring-hadoop-cluster

Oli Huggins

16 Feb 2014

10 min read

Sizing and Configuring your Hadoop Cluster

Oli Huggins

16 Feb 2014

10 min read

This article, written by Khaled Tannir, the author of Optimizing Hadoop for MapReduce, discusses two of the most important aspects to consider while optimizing Hadoop for MapReduce: sizing and configuring the Hadoop cluster correctly. Sizing your Hadoop cluster Hadoop's performance depends on multiple factors based on well-configured software layers and well-dimensioned hardware resources that utilize its CPU, Memory, hard drive (storage I/O) and network bandwidth efficiently. Planning the Hadoop cluster remains a complex task that requires a minimum knowledge of the Hadoop architecture and may be out the scope of this book. This is what we are trying to make clearer in this section by providing explanations and formulas in order to help you to best estimate your needs. We will introduce a basic guideline that will help you to make your decision while sizing your cluster and answer some How to plan questions about cluster's needs such as the following: How to plan my storage? How to plan my CPU? How to plan my memory? How to plan the network bandwidth? While sizing your Hadoop cluster, you should also consider the data volume that the final users will process on the cluster. The answer to this question will lead you to determine how many machines (nodes) you need in your cluster to process the input data efficiently and determine the disk/memory capacity of each one. Hadoop is a Master/Slave architecture and needs a lot of memory and CPU bound. It has two main components: JobTracker: This is the critical component in this architecture and monitors jobs that are running on the cluster TaskTracker: This runs tasks on each node of the cluster To work efficiently, HDFS must have high throughput hard drives with an underlying filesystem that supports the HDFS read and write pattern (large block). This pattern defines one big read (or write) at a time with a block size of 64 MB, 128 MB, up to 256 MB. Also, the network layer should be fast enough to cope with intermediate data transfer and block. HDFS is itself based on a Master/Slave architecture with two main components: the NameNode / Secondary NameNode and DataNode components. These are critical components and need a lot of memory to store the file's meta information such as attributes and file localization, directory structure, names, and to process data. The NameNode component ensures that data blocks are properly replicated in the cluster. The second component, the DataNode component, manages the state of an HDFS node and interacts with its data blocks. It requires a lot of I/O for processing and data transfer. Typically, the MapReduce layer has two main prerequisites: input datasets must be large enough to fill a data block and split in smaller and independent data chunks (for example, a 10 GB text file can be split into 40,960 blocks of 256 MB each, and each line of text in any data block can be processed independently). The second prerequisite is that it should consider the data locality, which means that the MapReduce code is moved where the data lies, not the opposite (it is more efficient to move a few megabytes of code to be close to the data to be processed, than moving many data blocks over the network or the disk). This involves having a distributed storage system that exposes data locality and allows the execution of code on any storage node. Concerning the network bandwidth, it is used at two instances: during the replication process and following a file write, and during the balancing of the replication factor when a node fails. The most common practice to size a Hadoop cluster is sizing the cluster based on the amount of storage required. The more data into the system, the more will be the machines required. Each time you add a new node to the cluster, you get more computing resources in addition to the new storage capacity. Let's consider an example cluster growth plan based on storage and learn how to determine the storage needed, the amount of memory, and the number of DataNodes in the cluster. Daily data input 100 GB Storage space used by daily data input = daily data input * replication factor = 300 GB HDFS replication factor 3 Monthly growth 5% Monthly volume = (300 * 30) + 5% = 9450 GB After one year = 9450 * (1 + 0.05)^12 = 16971 GB Intermediate MapReduce data 25% Dedicated space = HDD size * (1 - Non HDFS reserved space per disk / 100 + Intermediate MapReduce data / 100) = 4 * (1 - (0.25 + 0.30)) = 1.8 TB (which is the node capacity) Non HDFS reserved space per disk 30% Size of a hard drive disk 4 TB Number of DataNodes needed to process: Whole first month data = 9.450 / 1800 ~= 6 nodes The 12th month data = 16.971/ 1800 ~= 10 nodes Whole year data = 157.938 / 1800 ~= 88 nodes Do not use RAID array disks on a DataNode. HDFS provides its own replication mechanism. It is also important to note that for every disk, 30 percent of its capacity should be reserved to non-HDFS use. It is easy to determine the memory needed for both NameNode and Secondary NameNode. The memory needed by NameNode to manage the HDFS cluster metadata in memory and the memory needed for the OS must be added together. Typically, the memory needed by Secondary NameNode should be identical to NameNode. Then you can apply the following formulas to determine the memory amount: NameNode memory 2 GB - 4 GB Memory amount = HDFS cluster management memory + NameNode memory + OS memory Secondary NameNode memory 2 GB - 4 GB OS memory 4 GB - 8 GB HDFS memory 2 GB - 8 GB At least NameNode (Secondary NameNode) memory = 2 + 2 + 4 = 8 GB It is also easy to determine the DataNode memory amount. But this time, the memory amount depends on the physical CPU's core number installed on each DataNode. DataNode process memory 4 GB - 8 GB Memory amount = Memory per CPU core * number of CPU's core + DataNode process memory + DataNode TaskTracker memory + OS memory DataNode TaskTracker memory 4 GB - 8 GB OS memory 4 GB - 8 GB CPU's core number 4+ Memory per CPU core 4 GB - 8 GB At least DataNode memory = 4*4 + 4 + 4 + 4 = 28 GB Regarding how to determine the CPU and the network bandwidth, we suggest using the now-a-days multicore CPUs with at least four physical cores per CPU. The more physical CPU's cores you have, the more you will be able to enhance your job's performance (according to all rules discussed to avoid underutilization or overutilization). For the network switches, we recommend to use equipment having a high throughput (such as 10 GB) Ethernet intra rack with N x 10 GB Ethernet inter rack. Configuring your cluster correctly To run Hadoop and get a maximum performance, it needs to be configured correctly. But the question is how to do that. Well, based on our experiences, we can say that there is not one single answer to this question. The experiences gave us a clear indication that the Hadoop framework should be adapted for the cluster it is running on and sometimes also to the job. In order to configure your cluster correctly, we recommend running a Hadoop job(s) the first time with its default configuration to get a baseline. Then, you will check the resource's weakness (if it exists) by analyzing the job history logfiles and report the results (measured time it took to run the jobs). After that, iteratively, you will tune your Hadoop configuration and re-run the job until you get the configuration that fits your business needs. The number of mappers and reducer tasks that a job should use is important. Picking the right amount of tasks for a job can have a huge impact on Hadoop's performance. The number of reducer tasks should be less than the number of mapper tasks. Google reports one reducer for 20 mappers; the others give different guidelines. This is because mapper tasks often process a lot of data, and the result of those tasks are passed to the reducer tasks. Often, a reducer task is just an aggregate function that processes a minor portion of the data compared to the mapper tasks. Also, the correct number of reducers must also be considered. The number of mappers and reducers is related to the number of physical cores on the DataNode, which determines the maximum number of jobs that can run in parallel on DataNode. In a Hadoop cluster, master nodes typically consist of machines where one machine is designed as a NameNode, and another as a JobTracker, while all other machines in the cluster are slave nodes that act as DataNodes and TaskTrackers. When starting the cluster, you begin starting the HDFS daemons on the master node and DataNode daemons on all data nodes machines. Then, you start the MapReduce daemons: JobTracker on the master node and the TaskTracker daemons on all slave nodes. The following diagram shows the Hadoop daemon's pseudo formula: When configuring your cluster, you need to consider the CPU cores and memory resources that need to be allocated to these daemons. In a huge data context, it is recommended to reserve 2 CPU cores on each DataNode for the HDFS and MapReduce daemons. While in a small and medium data context, you can reserve only one CPU core on each DataNode. Once you have determined the maximum mapper's slot numbers, you need to determine the reducer's maximum slot numbers. Based on our experience, there is a distribution between the Map and Reduce tasks on DataNodes that give good performance result to define the reducer's slot numbers the same as the mapper's slot numbers or at least equal to two-third mapper slots. Let's learn to correctly configure the number of mappers and reducers and assume the following cluster examples: Cluster machine Nb Medium data size Large data size DataNode CPU cores 8 Reserve 1 CPU core Reserve 2 CPU cores DataNode TaskTracker daemon 1 1 1 DataNode HDFS daemon 1 1 1 Data block size 128 MB 256 MB DataNode CPU % utilization 95% to 120% 95% to 150% Cluster nodes 20 40 Replication factor 2 3 We want to use the CPU resources at least 95 percent, and due to Hyper-Threading, one CPU core might process more than one job at a time, so we can set the Hyper-Threading factor range between 120 percent and 170 percent. Maximum mapper's slot numbers on one node in a large data context = number of physical cores - reserved core * (0.95 -> 1.5) Reserved core = 1 for TaskTracker + 1 for HDFS Let's say the CPU on the node will use up to 120% (with Hyper-Threading) Maximum number of mapper slots = (8 - 2) * 1.2 = 7.2 rounded down to 7 Let's apply the 2/3 mappers/reducers technique: Maximum number of reducers slots = 7 * 2/3 = 5 Let's define the number of slots for the cluster: Mapper's slots: = 7 * 40 = 280 Reducer's slots: = 5 * 40 = 200 The block size is also used to enhance performance. The default Hadoop configuration uses 64 MB blocks, while we suggest using 128 MB in your configuration for a medium data context as well and 256 MB for a very large data context. This means that a mapper task can process one data block (for example, 128 MB) by only opening one block. In the default Hadoop configuration (set to 2 by default), two mapper tasks are needed to process the same amount of data. This may be considered as a drawback because initializing one more mapper task and opening one more file takes more time. Summary In this article, we learned about sizing and configuring the Hadoop cluster for optimizing it for MapReduce. Resources for Article: Further resources on this subject: Hadoop Tech Page Hadoop and HDInsight in a Heartbeat Securing the Hadoop Ecosystem Advanced Hadoop MapReduce Administration

0
3
32350

Packt

13 Feb 2014

14 min read

Changing the Appearance

Packt

13 Feb 2014

14 min read

0
0
3457

article-image-securing-qlikview-documents

Packt

16 Jan 2014

6 min read

Securing QlikView Documents

Packt

16 Jan 2014

6 min read

(For more resources related to this topic, see here.) Making documents available to the correct users can be handled in several different ways, depending on your implementation and license structure. These methods are not mutually exclusive and you may choose to implement a combination of them. By license If you only have named Document users, you can restrict access to documents by simply not granting users a license. If users do not have a Document license for a particular document, they may be able to see that document in AccessPoint, but will not be able to open it. You will need to turn off any automatic allocation of licenses for both Document licenses and Named User licenses, or the system will simply override your security by allocating an available license and giving the user access to that document. This only works for Document license users. The Named User license holders can't be locked out of a document this way as they have a license that allows them to open any number of documents, so they cannot be restricted. The fact that this is user based—a Document license can only be granted to a user, not a group—also means that there is no option to secure by a named group. This is the most basic, least flexible, and least user-friendly way to implement security. While it will certainly stop users getting access to documents—and it will work in either NTFS or DMS security modes—it can be frustrating for users to see a document that they think can open, but for which they will get a NO CAL error when they try to open it. The QlikView file will also need to have appropriate NTFS or DMS security so that users would be able to access it. The easiest way to do this is to grant access to a group that all the users will be in, or even allow access to an Authenticated Users group. Section Access Section Access security is a very effective way of securing a document to the correct set of users. This is because a user must be actually listed in the Section Access user list for the document to be even listed in AccessPoint for them. Additionally, if Section Access is in place, a user cannot even connect by using a direct access URL because they have no security access to the data. This method of securing documents works well in both NTFS and DMS security modes. When using the NTLM (Windows authentication via Internet Explorer) authentication method, you can have Group Names listed in Section Access. However, when using alternative authentication, Section Access does not give us an option to secure by group. As with the license method discussed earlier, appropriate file security needs to be in place in order to allow all the users access the QlikView file. NTFS Access Control List (ACL) NTFS (Microsoft's NT File System) security is the default method of securing access to files in a QlikView implementation. It works very well for installations where all the users are Windows users within the same domain or a set of trusted domains. In NTFS security mode, the Access Control List (ACL) of the QlikView file is used to list the documents for a particular user. This is a very straightforward way of securing access and will be very familiar to Windows system administrators. As with normal Windows file security, the security can be applied at the folder level. Windows security groups can also be used. Between groups and folder security, very flexible levels of security can be applied. By default, Internet Explorer and Google Chrome will pass through the Kerberos/NTLM credentials to sites in the local Intranet zone. For other browsers, such as Safari on the iPad, the user will be prompted for a username and password. When a user connects to AccessPoint and their credentials are established, they are compared against the ACLs for all the files hosted by QVS. Only those files that the user has access to—either directly by name or by group membership—will be listed in AccessPoint. Document Metadata Service (DMS) For non-Windows users, QlikView provides a way of managing user access to files called the Document Metadata Service (DMS). DMS uses a .META file in the same folder as the .QVW file to store the Access Control List. The Windows ACL, which has permissions on the file, now becomes mostly irrelevant as it is not used to authenticate users. It is only the QlikView service account that will need access to the file. It is a binary choice between using NTFS or DMS security on any one QlikView Server. Enabling DMS To enable DMS, we need to make a change to the server configuration. In the QlikView Management Console, on the Security tab of the QVS settings screen, we change Authorization to DMS authorization and then click on the Apply button. The QlikView Server service will need to be restarted for this change to take effect. Once the service has restarted, a new tab, Authorization, becomes available in the document properties: Clicking on the + button to the right of this tab allows you to enter new details of Access, User Type, and specific Users and Groups. Access is either set to Always or Restricted. When Access is set to Always, the associated user or group will have access at any time. If it is set to Restricted, you can specify a time range and specific days when the user or group has access. You can keep clicking on the + button to add as many sets of restricted times as needed for a user or group. The restrictions are additive; that is, if the user only has access on Monday and Tuesday in one group of restrictions, and then Thursday and Friday in another set of restrictions, they will therefore, have access on all four days. The User Type is one of the following categories: User Type Details All Users Essentially no security. Any user, including anonymous, who can access the server will be able to access the file. All Authenticated Users For most implementations, this will also be All Users. However, it will not give access to anonymous users. The Section Access would typically be used to manage the security. Named Users This allows you to specify a list of named users and/or groups that will have specific access to the document. If Named Users is selected, a Manage Users button will appear that allows you to specify users and/or groups. Summary In this article, we have looked at several ways of securing QlikView Documents—by license, using Section Access, utilizing NTFS ACLs, and implementing QlikView's DMS authorization. Resources for Article: Further resources on this subject: Common QlikView script errors [Article] Introducing QlikView elements [Article] Meet QlikView [Article]

0
0
5587

article-image-aspects-data-manipulation-r

Packt

10 Jan 2014

6 min read

Aspects of Data Manipulation in R

Packt

10 Jan 2014

6 min read

(For more resources related to this topic, see here.) Factor variables in R In any data analysis task, the majority of the time is dedicated to data cleaning and pre-processing. Sometimes, it is considered that about 80 percent of the effort is devoted for data cleaning before conducting actual analysis. Also, in real world data, we often work with categorical variables. A variable that takes only a limited number of distinct values is usually known as a categorical variable, and in R, this is known as a factor. Working with categorical variables in R is a bit technical, and in this article we have tried to demystify this process of dealing with categorical variables. During data analysis, sometimes the factor variable plays an important role, particularly in studying the relationship between two categorical variables. In this section, we will see some important aspects of factor manipulation. When a factor variable is first created, it stores all its levels along with the factor. But if we take any subset of that factor variable, it inherits all its levels from the original factor levels. This feature sometimes creates confusion in understanding the results. Numeric variables are convenient during statistical analysis, but sometimes we need to create categorical (factor) variables from numeric variables. We can create a limited number of categories from a numeric variable using a series of conditional statements, but this is not an efficient way to perform this operation. In R, cut is a generic command to create factor variables from numeric variables. Split-apply-combine strategy Data manipulation is an integral part of data cleaning and analysis. For large data it is always preferable to perform the operation within subgroup of a dataset to speed up the process. In R this type of data manipulation could be done with base functionality, but for large-scale data it requires considerable amount of coding and eventually it takes a longer time to process. In case of Big Data we could split the dataset, perform the manipulation or analysis, and then again combine the results into a single output. This type of split using base R is not efficient and to overcome this limitation, Wickham developed an R package plyr where he efficiently implemented the split-apply-combine strategy. Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break up a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we could compare this with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently. The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were previously unconnected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator. Reshaping a dataset Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping and we need to implement some reorientation to perform certain types of analyses. Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, whereas in wide-layout, a single row represents a single subject's record. Statistical analysis sometimes requires wide data and sometimes long data, and in such cases, we need to be able to fluently and fluidly reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. In this article, we will show you different layouts of the same dataset and see how they can be transferred from one layout to another. This article mainly highlights the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient. A single dataset can be rearranged in many different ways, but before going into rearrangement, let's look back at how we usually perceive a dataset. Whenever we think about any dataset, we think of a two-dimensional arrangement where a row represents a subject's (a subject could be a person and is typically the respondent in a survey) information for all the variables in a dataset and a column represents the information for each characteristic for all subjects. This means rows indicate records and columns indicate variables, characteristics, or attributes. This is the typical layout of a dataset. In this arrangement, one or more variables might play a role as an identifier and others are measured characteristics. For the purpose of reshaping, we could group the variables into two groups: identifier variables and measured variables. Identifier variables: These help to identify the subject from whom we took information on different characteristics. Typically, identifier variables are qualitative in nature and take a limited number of unique values. In database terms, an identifier is termed as the primary key, and this can be a single variable or a composite of multiple variables. Measured variables: These are those characteristics whose information we took from a subject of interest. These can be qualitative, quantitative, or a mix of both. Now beyond this typical structure of dataset, we could think differently, where we will have only identification variables and a value. The identification variable identifies a subject along with which measured variable the value represents. In this new paradigm, each row represents one observation of one variable. In the new paradigm this is termed as melting and it produces molten data. The difference between this new layout of the data and the typical layout is that it now contains only the ID variable and a new column, value, which represents the value of that observation. Summary This article explains briefly about factor variables, the split-apply-combine strategy, and reshaping a dataset in R. Resources for Article: Further resources on this subject: SQL Server 2008 R2: Multiserver Management Using Utility Explorer [Article] Working with Data Application Components in SQL Server 2008 R2 [Article] Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article]

0
0
7172

Packt

02 Jan 2014

8 min read

Conozca QlikView

Packt

02 Jan 2014

8 min read

(Para más recursos relacionados con este tema, vea aquí.) ¿Qué es QlikView? QlikView es una herramienta computacional desarrollada por QlikTech, una compañía que fue fundada en Suecia en 1993, pero actualmente con sede a Estados Unidos. QlikView es una herramienta usada para Inteligencia de Negocios, comúnmente abreviada como BI por las siglas de su denominación en inglés: Business Intelligence . La inteligencia de negocios es definida por Gartner, una firma líder de analistas de la industria, como: Un término general que incluye la aplicación, infraestructura y herramientas, y mejores prácticas que permiten el acceso a información y análisis de la misma para mejorar y optimizar el proceso de toma de decisiones y desempeño de una compañía. Siguiendo esta definición, QlikView es una herramienta que permite el acceso a la información y posibilita el análisis de los datos, lo cual a su vez mejora y optimiza el proceso de toma de decisiones de negocio y por ende también el desempeño del mismo. Históricamente, la Inteligencia de Negocios ha sido comandada principalmente por los departamentos de Tecnologías de Información en las empresas. Los departamentos de TI eran responsables de todo el ciclo de vida de una solución de Inteligencia de Negocios, desde extraer los datos hasta entregar los reportes finales, análisis y cuadros de mando. Aunque este modelo funciona bien para la distribución de reportes estáticos predefinidos, la mayoría de las empresas se han ido dando cuenta que no cumple con las necesidades de sus usuarios de negocio. Como TI controla de cerca los datos y herramientas, los usuarios comúnmente experimentan largos tiempos de espera cuando surgen nuevas preguntas de negocio que no pueden ser respondidas con los reportes estándar. ¿Cómo se diferencia QlikView de herramientas tradicionales de BI? QlikTech se enorgullece de abordar la Inteligencia de Negocios de una manera distinta a lo que compañías como Oracle, SAP, e IBM – descritas por QlikTech como proveedores tradicionales de BI – ofrecen. QlikTech busca poner las herramientas en manos del usuario de negocio, permitiéndole ser autosuficiente, ya que así puede realizar sus propios análisis. Las firmas independientes de analistas de la industria han notado también este acercamiento distinto. En 2011, Gartner creó una subcategoría para herramientas de Descubrimiento de Datos en su evaluación anual de mercado, el Cuadrante Mágico de plataformas de Inteligencia de Negocios . QlikView fue el abanderado en esta nueva categoría de herramientas de BI. QlikTech prefiere describir su producto como una herramienta de Descubrimiento del Negocio en lugar de Descubrimiento de Datos. Sostiene que descubrir cosas sobre el negocio es mucho más importante que descubrir datos. El siguiente diagrama ilustra este paradigma. Fuente: QlikTech Además de la diferencia en quién usa la herramienta – usuarios de TI contra usuarios de negocio – hay algunas otras funcionalidades que diferencian a QlikView de otras soluciones. Experiencia de usuario asociativa La principal diferencia entre QlikView y otras soluciones de BI es la experiencia de usuario asociativa. Mientras que las soluciones de BI tradicionales usan caminos predefinidos para navegar y explorar datos, QlikView permite a los usuarios tomar cualquier ruta que deseen para realizar análisis. Esto resulta en una manera mucho más intuitiva de navegar los datos. QlikTech describe esto como "trabajar de la forma en que trabaja la mente humana". En la siguiente imagen se muestra un ejemplo. Mientras que en una solución típica de BI tendríamos que comenzar seleccionando una Región para después entrar paso a paso en el camino jerárquico definido, en QlikView podemos elegir cualquier punto de entrada que deseemos – Región , Estado , Producto , o Vendedor . Al ir navegando los datos, se nos presenta solo la información relacionada a nuestra selección y, para nuestra siguiente selección, podemos elegir cualquier camino que deseemos. La navegación es infinitamente flexible. Adicionalmente, la interfaz de usuario QlikView nos permite ver los datos que están asociados a nuestra selección. Por ejemplo, la siguiente imagen de pantalla (del documento demostrativo de QlikTech llamado What's New in QlikView 11 ) muestra un Cuadro de Mando en QlikView en el que hay dos valores seleccionados. En el campo Quarter , está seleccionado el valor Q3 , y en el campo Sales Reps , está seleccionado Cart Lynch . Podemos ver esto porque los valores correspondientes están en color verde, lo cual significa que dichos valores han sido seleccionados. Cuando se hace una selección, la interfaz se actualiza automáticamente no solo para mostrar los datos que están asociados a esta nueva consulta, sino también los datos que no están asociados con dicha selección. Los datos asociados aparecen con un fondo blanco, mientras que los datos no asociados tienen un fondo gris. Algunas veces las asociaciones pueden ser bastante obvias; no es sorpresa que el tercer trimestre del año tenga asociado los meses de Julio, Agosto y Septiembre. Sin embargo, en otras ocasiones nos encontramos con otras asociaciones no tan obvias, como por ejemplo que Carl Lynch no ha vendido ningún producto en Alemania o España. Esta información extra, que no se ve en herramientas tradicionales de BI, puede ser de gran valor ya que ofrece un nuevo punto de comienzo para exploración de datos. Tecnología El principal diferenciador tecnológico de QlikView es que utiliza un modelo de datos en memoria, es decir, que toda la información con que interactúa el usuario está guardada en RAM en lugar de utilizar disco. Como el uso de RAM es mucho más rápido que disco, los tiempos de respuesta son muy rápidos, generando así una experiencia de usuario muy fluida. En una sección posterior de este capítulo, ahondaremos un poco más en el tema de la tecnología detrás de QlikView. Adopción Hay otra diferencia entre QlikView y otras soluciones tradicionales de BI que radica en la forma en que se implementa dentro de una compañía. Mientras que las soluciones tradicionales de BI son típicamente implementadas de arriba hacia abajo – en donde TI selecciona una herramienta de BI para toda la compañía – QlikView comúnmente toma una ruta de adopción de abajo hacia arriba. Los usuarios de negocio de un solo departamento la implementan localmente, y su uso se expande desde ahí. QlikView se puede descargar de manera gratuita para uso personal. A esta versión se le llama QlikView Personal Edition o PE. Los documentos creados en la edición personal de QlikView pueden ser abiertos por usuarios con licencia completa del software o publicarse a través de QlikView Server. La limitación es que, a excepción de algunos documentos habilitados por QlikTech para PE, un usuario de la edición personal de QlikView no puede abrir documentos creados por otro usuario o en otro equipo; algunas veces tampoco se pueden abrir sus propios documentos si fueron abiertos y guardados por otro usuario o instancia de servidor. Frecuentemente, un usuario de negocio decidirá descargar QlikView para ver si puede resolver un problema de negocio. Cuando otros usuarios dentro del departamento ven el software, se vuelven cada vez más entusiastas sobre la herramienta, y cada quien baja el programa. Para poder compartir documentos, deciden comprar algunas licencias para el departamento. Luego, otros departamentos comienzan a notarlo también, y QlikView gana tracción dentro de la organización. Poco tiempo después, TI y los directivos de la empresa comienzan también a notarlo, lo cual lleva eventualmente a la adopción de QlikView en toda la empresa. QlikView facilita cada paso del proceso, escalando de una implementación en una computadora personal hasta implementaciones a nivel organización con miles de usuarios. La siguiente imagen ilustra este crecimiento dentro de una organización: Conforme la popularidad e historial de QlikView en la organización crece, gana cada vez más visibilidad a nivel empresa. Aunque la ruta de adopción descrita anteriormente es probablemente el escenario más común, no es extraño ahora que una compañía opte por una implementación de QlikView en modo top-down a nivel empresa desde un inicio.

0
0
3160

How-To Tutorials - Data

First Steps

Going Viral

Integrating with other Frameworks

Using SHOW EXPLAIN with running queries

Self-service reporting

Reporting with Microsoft SQL

MongoDB data modeling

Understanding Process Variation with Control Charts

Query Performance Tuning

Processing Tweets with Apache Hive

Trending Topics

Sizing and Configuring your Hadoop Cluster

Changing the Appearance

Securing QlikView Documents

Aspects of Data Manipulation in R

Conozca QlikView

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access