Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7018 Articles
article-image-hacking-toys-ifttt-and-spark
David Resseguie
31 Mar 2015
6 min read
Save for later

Hacking toys with IFTTT and Spark

David Resseguie
31 Mar 2015
6 min read
Open up even the simplest of toys and you’ll often be amazed at the number of interesting electronic components inside. This is especially true in many of the otherwise “throw away” toys found in fast food kids’ meals. I’ve tried to make it a habit of salvaging as many parts as possible from such toys so I can use them in future projects. (And I recommend you do the same!) But what if we could use the toy itself as a basis for a new project? In this post, we’ll look at one example of how we can Internet-enable a simple LED lantern toy using a wireless Spark Core device and the powerful IFTTT service. This particular LED lantern is operated by a standard on-off switch, and inside is a single LED, three coin batteries, and a simple switch mechanism for connecting and disconnecting power. Like many fast-food premiums, the lantern uses “tamper proof” triangular screws. If you don’t have the appropriate bit, you can usually make do with a small straight edge screwdriver. In addition to screws, some toys are also glued or sonic welded together, which makes it difficult to open without damaging the plastic beyond repair. Not shown in this photo is a small plastic piece that holds all the components in place. To programmatically control our lantern, we want to remove the batteries and run jumper cables to a pin on our microcontroller instead. Here is an exposed view after also removing the switch mechanism and attaching female-male jumper cables to the positive and negative leads of the LED. The next step is to hook our lantern up to the Spark Core. We choose the Spark Core for this project for two primary reasons. First, the Spark’s size is very conducive to toy hacking, especially for projects where you want to completely embed the electronics inside the finished product. Second, there is already a Spark channel on IFTTT that allows us to remotely trigger actions. More on that later! But before we go too far, let’s test our Spark setup to be sure we can power the LED. Run the jumper cable from the positive lead to pin D0 and the negative lead to GND. Now let’s write a simple Spark application that turns the LED on and off. Using Spark’s Web IDE, flash the following program onto your Spark Core. This will cause the LED to blink on and off in one second intervals. int led = D0; void setup() { pinMode(led, OUTPUT); } void loop() { digitalWrite(led, HIGH); delay(1000); digitalWrite(led, LOW); delay(1000); } But to really make our project useful, we need to hook it up to the Internet and respond to remote triggers for controlling the LED. IFTTT (pronounced like “gift” without the “g”) is a web-based service for connecting a variety of other online services and devices through “recipes”. An IFTT recipe is of the form “If [this] then [that]. The services that can be combined to fill in those blanks are called “channels”. IFTTT has dozens of channels to pick from, including email, SMS, Twitter, etc. But especially important to us: there is a Spark channel that allows Spark devices to serve as both triggers and actuators. For this project, we’ll set up our Spark as an actuator that that turns on the LED when the “if this” condition is met. To trigger our lantern, we could use any number of IFTTT channels, but for simplicity, let’s connect it up to the Yo smartphone app. Yo is a (rather silly) app that just lets you send a “yo” message to friends. The Yo channel for IFTTT allows you to trigger recipes by Yo-ing IFTTT. Load the app to your smartphone and add IFTTT as a contact by clicking the + button and typing “IFTTT” in the username field. If you haven’t already done so, create an IFTTT account and go to the “Channels” tab to activate the Yo and Spark channels. In both cases, you’ll have to log in to your respective accounts and authorize IFTTT. The process is straightforward and the IFTTT website walks you through the entire process. Once you’ve done this, you’re ready to create your first recipe. Click the “Create a Recipe” button found on the “My Recipes” tab. IFTTT will walk you through setting up both the trigger and action. For the “if this” condition, select your Yo channel and the “You Yo IFTTT” trigger. For the “then that” action, select the Spark channel and “Publish an event” action. Name the event (I just used “yo”) and select the “private event” option. (It doesn’t matter what you enter as the data field--we’re just going to ignore it anyway.) Name your recipe and click “Create Recipe” to finish the process. Your new recipe will now show up in your personal recipe list. Now we need to modify our Spark code to listen for our “yo” events. Back in the Spark Web IDE, change the code to the following. Now instead of turning the LED on and off in the loop() function, we instead register an event listener using Spark.subscribe() and turn the LED on for five seconds inside the callback function. int led = D0; void setup() { Spark.subscribe("yo", yoHandler, MY_DEVICES); pinMode(led, OUTPUT); } void loop() {} void yoHandler(const char *event, const char *data) { digitalWrite(led, HIGH); delay(5000); digitalWrite(led, LOW); } Once you’ve flashed this update to your Spark, it’s time to test it out! Be sure the Spark is flashing cyan (meaning it has a connection to the Spark cloud) and then use your smartphone to Yo IFTTT. The LED should light up for five seconds, then turn back off and wait again for the next “yo” event. Note that the “yo” events will be broadcast to all your Spark devices if you have more than one, so you could set up multiple hacked toys and send your greetings to several people at once. And if you choose to use public events, you could even trigger events to family and friends around the world.  All that’s left to do is package up the lantern by screwing everything back together. For a more permanent solution, instead of running the wires out to the external Spark, you could carefully fit the Spark and a small LiPo battery inside the lantern as well. I hope this post has inspired you to give new life to broken or disposable toys you have around the house. If you build something really cool, I’d love to see it. Consider sharing your project on the hackster.io Spark community. About the author David Resseguie is a member of the Computational Sciences and Engineering Division at Oak Ridge National Laboratory and lead developer for Sensorpedia. His interests include human computer interaction, Internet of Things, robotics, data visualization, and STEAM education. His current research focus is on applying social computing principles to the design of information sharing systems.
Read more
  • 0
  • 0
  • 3375

article-image-performing-hand-written-digit-recognition-golearn
Alex Browne
31 Mar 2015
9 min read
Save for later

Performing hand-written digit recognition with GoLearn

Alex Browne
31 Mar 2015
9 min read
In this step-by-step post, you'll learn how to do basic recognition of hand-written digits using GoLearn, a machine learning library for Go. I'll assume you are already comfortable with Go and have a basic understanding of machine learning. To learn Go, I recommend the interactive tutorial. And to learn about machine learning, I recommend Andrew Ng's Machine Learning course on Coursera. All of the code for this tutorial is available on github. Installation & Set Up  To follow along with this post, you will need to install: Go version 1.2 or later The GoLearn package Also, make sure that you follow these intructions for setting up your go work environment. In particular, you will need to have the GOPATH environment variable pointing to a directory where all of your Go code will reside. Project Structure Now is a good time to setup the directory where your code for this project will reside. Somewhere in your $GOPATH/src, create a new directory and call it whatever you want. I recommend $GOPATH/src/github.com/your-github-username/golearn-digit-recognition. Our basic project structure is going to look like this: golearn-digit-recognition/ data/ mnist_train.csv mnist_test.csv main.go The data directory is where we'll put our training and test data, and our program is going to consist of a single file: main.go. Getting the Training Data As I mentioned, in this post we're going to be using GoLearn to recognize hand-written digits. The training data we'll use comes from the popular MNIST handwritten digit database. I've already split the data into training and test sets and formatted it in the way GoLearn expects. You can simply download the CSV files and put them in your data directory:  Training Data Test Data The data consists of a series of 28x28 pixel grayscale images and labels for the corresponding digit (0-9). 28x28 = 784 so there are 784 features. In the CSV files, the pixels are labeled pixel0-pixel783. Each pixel can take on a value between 0 and 255, where 0 is white and 255 is black. There are 5,000 rows in the training data, and 500 in the test data. Writing the Code Without further ado, let's write a simple program to detect hand-written digits. Open up the main.go file in your favorite text editor and add the following lines: package main import ( "fmt" "github.com/sjwhitworth/golearn/base" ) func main() { // Load and parse the data from csv files fmt.Println("Loading data...") trainData, err := base.ParseCSVToInstances("data/mnist_train.csv", true) if err != nil { panic(err) } testData, err := base.ParseCSVToInstances("data/mnist_test.csv", true) if err != nil { panic(err) } } The ParseCSVToInstances function reads the CSV file and converts it into "Instances," which is simply a data structure that GoLearn can understand and manipulate. You should run the program with go run main.go to make sure everything works so far. Next, we're going to create a linear Support Vector Classifier, which is a type of Support Vector Machine where the output is the probability that the input belongs to some class. In our case, there are 10 possible classes representing the digits 0 through 9, so our SVC will consist of 10 SVMs, each of which outputs the probability that the input belongs to a certain class. The SVC will then simply output the class with the highest probability.  Modify main.go by importing the linear_models package from golearn: import (     // ...     "github.com/sjwhitworth/golearn/linear_models" ) Then add the following lines: func main() {           // ...        // Create a new linear SVC with some good default values      classifier, err := linear_models.NewLinearSVC("l1", "l2", true, 1.0, 1e-4)      if err != nil {           panic(err)      }        // Don't output information on each iteration      base.Silent()        // Train the linear SVC      fmt.Println("Training...")      classifier.Fit(trainData) }   You can read more about the different parameters for the SVC here. I found that these parameters give pretty good results. After we've created the classifier, training it is as simple as calling classifier.Fit(). Now might be a good time to run go run main.go again to make sure everything compiles and works as expected. If you want to see some details about what's going on with the classifier, comment out or remove the base.Silent() line. Finally, we can test the accuracy of our SVC by making predictions on the test data and then comparing our predictions to the expected output. GoLearn makes it really easy to do this. Just modify main.go as follows: package main   import (      // ...      "github.com/sjwhitworth/golearn/evaluation"     // ... )   func main() {           // ...        // Make predictions for the test data      fmt.Println("Predicting...")      predictions, err := classifier.Predict(testData)      if err != nil {           panic(err)      }        // Get a confusion matrix and print out some accuracy stats for our predictions      confusionMat, err := evaluation.GetConfusionMatrix(testData, predictions)      if err != nil {           panic(fmt.Sprintf("Unable to get confusion matrix: %s", err.Error()))      }      fmt.Println(evaluation.GetSummary(confusionMat)) }     After making the predictions for our test data, we use the evaluation package to quickly get some stats about the accuracy of our classifier. You should run the program again with go run main.go. If everything works correctly, you should see output that looks something like this:  Loading data...Training...Predicting...Reference Class     True Positives     False Positives     True Negatives     Precision     Recall     F1 Score---------------     --------------     ---------------     --------------     ---------     ------     --------6          42          4          447          0.9130          0.8571     0.88425          31          15          444          0.6739          0.7561     0.71268          37          7          445          0.8409          0.7708     0.80437          47          5          440          0.9038          0.8545     0.87852          51          6          434          0.8947          0.8500     0.87183          35          9          448          0.7955          0.8140     0.80461          50          5          443          0.9091          0.9615     0.93464          48          4          441          0.9231          0.8727     0.89720          41          3          455          0.9318          0.9762     0.95359          49          11          434          0.8167          0.8909     0.8522Overall accuracy: 0.8620 That's about an 86% accuracy. Not too bad! And all it took was a few lines of code! Summary If you want to do even better, try playing around with the parameters for the SVC or use a different classifier. GoLearn has support for linear and logistic regression, K nearest neighbor, neural networks, and more! About the author Alex Browne is a recent college grad living in Raleigh NC with 4 years of professional software experience. He does software contract work to make ends meet, and spends most of his free time learning new things and working on various side projects. He is passionate about open source technology and has plans to start his own company.
Read more
  • 0
  • 0
  • 3064

article-image-dealing-legacy-code
Packt
31 Mar 2015
16 min read
Save for later

Dealing with Legacy Code

Packt
31 Mar 2015
16 min read
In this article by Arun Ravindran, author of the book Django Best Practices and Design Patterns, we will discuss the following topics: Reading a Django code base Discovering relevant documentation Incremental changes versus full rewrites Writing tests before changing code Legacy database integration (For more resources related to this topic, see here.) It sounds exciting when you are asked to join a project. Powerful new tools and cutting-edge technologies might await you. However, quite often, you are asked to work with an existing, possibly ancient, codebase. To be fair, Django has not been around for that long. However, projects written for older versions of Django are sufficiently different to cause concern. Sometimes, having the entire source code and documentation might not be enough. If you are asked to recreate the environment, then you might need to fumble with the OS configuration, database settings, and running services locally or on the network. There are so many pieces to this puzzle that you might wonder how and where to start. Understanding the Django version used in the code is a key piece of information. As Django evolved, everything from the default project structure to the recommended best practices have changed. Therefore, identifying which version of Django was used is a vital piece in understanding it. Change of Guards Sitting patiently on the ridiculously short beanbags in the training room, the SuperBook team waited for Hart. He had convened an emergency go-live meeting. Nobody understood the "emergency" part since go live was at least 3 months away. Madam O rushed in holding a large designer coffee mug in one hand and a bunch of printouts of what looked like project timelines in the other. Without looking up she said, "We are late so I will get straight to the point. In the light of last week's attacks, the board has decided to summarily expedite the SuperBook project and has set the deadline to end of next month. Any questions?" "Yeah," said Brad, "Where is Hart?" Madam O hesitated and replied, "Well, he resigned. Being the head of IT security, he took moral responsibility of the perimeter breach." Steve, evidently shocked, was shaking his head. "I am sorry," she continued, "But I have been assigned to head SuperBook and ensure that we have no roadblocks to meet the new deadline." There was a collective groan. Undeterred, Madam O took one of the sheets and began, "It says here that the Remote Archive module is the most high-priority item in the incomplete status. I believe Evan is working on this." "That's correct," said Evan from the far end of the room. "Nearly there," he smiled at others, as they shifted focus to him. Madam O peered above the rim of her glasses and smiled almost too politely. "Considering that we already have an extremely well-tested and working Archiver in our Sentinel code base, I would recommend that you leverage that instead of creating another redundant system." "But," Steve interrupted, "it is hardly redundant. We can improve over a legacy archiver, can't we?" "If it isn't broken, then don't fix it", replied Madam O tersely. He said, "He is working on it," said Brad almost shouting, "What about all that work he has already finished?" "Evan, how much of the work have you completed so far?" asked O, rather impatiently. "About 12 percent," he replied looking defensive. Everyone looked at him incredulously. "What? That was the hardest 12 percent" he added. O continued the rest of the meeting in the same pattern. Everybody's work was reprioritized and shoe-horned to fit the new deadline. As she picked up her papers, readying to leave she paused and removed her glasses. "I know what all of you are thinking... literally. But you need to know that we had no choice about the deadline. All I can tell you now is that the world is counting on you to meet that date, somehow or other." Putting her glasses back on, she left the room. "I am definitely going to bring my tinfoil hat," said Evan loudly to himself. Finding the Django version Ideally, every project will have a requirements.txt or setup.py file at the root directory, and it will have the exact version of Django used for that project. Let's look for a line similar to this: Django==1.5.9 Note that the version number is exactly mentioned (rather than Django>=1.5.9), which is called pinning. Pinning every package is considered a good practice since it reduces surprises and makes your build more deterministic. Unfortunately, there are real-world codebases where the requirements.txt file was not updated or even completely missing. In such cases, you will need to probe for various tell-tale signs to find out the exact version. Activating the virtual environment In most cases, a Django project would be deployed within a virtual environment. Once you locate the virtual environment for the project, you can activate it by jumping to that directory and running the activated script for your OS. For Linux, the command is as follows: $ source venv_path/bin/activate Once the virtual environment is active, start a Python shell and query the Django version as follows: $ python >>> import django >>> print(django.get_version()) 1.5.9 The Django version used in this case is Version 1.5.9. Alternatively, you can run the manage.py script in the project to get a similar output: $ python manage.py --version 1.5.9 However, this option would not be available if the legacy project source snapshot was sent to you in an undeployed form. If the virtual environment (and packages) was also included, then you can easily locate the version number (in the form of a tuple) in the __init__.py file of the Django directory. For example: $ cd envs/foo_env/lib/python2.7/site-packages/django $ cat __init__.py VERSION = (1, 5, 9, 'final', 0) ... If all these methods fail, then you will need to go through the release notes of the past Django versions to determine the identifiable changes (for example, the AUTH_PROFILE_MODULE setting was deprecated since Version 1.5) and match them to your legacy code. Once you pinpoint the correct Django version, then you can move on to analyzing the code. Where are the files? This is not PHP One of the most difficult ideas to get used to, especially if you are from the PHP or ASP.NET world, is that the source files are not located in your web server's document root directory, which is usually named wwwroot or public_html. Additionally, there is no direct relationship between the code's directory structure and the website's URL structure. In fact, you will find that your Django website's source code is stored in an obscure path such as /opt/webapps/my-django-app. Why is this? Among many good reasons, it is often more secure to move your confidential data outside your public webroot. This way, a web crawler would not be able to accidentally stumble into your source code directory. Starting with urls.py Even if you have access to the entire source code of a Django site, figuring out how it works across various apps can be daunting. It is often best to start from the root urls.py URLconf file since it is literally a map that ties every request to the respective views. With normal Python programs, I often start reading from the start of its execution—say, from the top-level main module or wherever the __main__ check idiom starts. In the case of Django applications, I usually start with urls.py since it is easier to follow the flow of execution based on various URL patterns a site has. In Linux, you can use the following find command to locate the settings.py file and the corresponding line specifying the root urls.py: $ find . -iname settings.py -exec grep -H 'ROOT_URLCONF' {} ; ./projectname/settings.py:ROOT_URLCONF = 'projectname.urls'   $ ls projectname/urls.py projectname/urls.py Jumping around the code Reading code sometimes feels like browsing the web without the hyperlinks. When you encounter a function or variable defined elsewhere, then you will need to jump to the file that contains that definition. Some IDEs can do this automatically for you as long as you tell it which files to track as part of the project. If you use Emacs or Vim instead, then you can create a TAGS file to quickly navigate between files. Go to the project root and run a tool called Exuberant Ctags as follows: find . -iname "*.py" -print | etags - This creates a file called TAGS that contains the location information, where every syntactic unit such as classes and functions are defined. In Emacs, you can find the definition of the tag, where your cursor (or point as it called in Emacs) is at using the M-. command. While using a tag file is extremely fast for large code bases, it is quite basic and is not aware of a virtual environment (where most definitions might be located). An excellent alternative is to use the elpy package in Emacs. It can be configured to detect a virtual environment. Jumping to a definition of a syntactic element is using the same M-. command. However, the search is not restricted to the tag file. So, you can even jump to a class definition within the Django source code seamlessly. Understanding the code base It is quite rare to find legacy code with good documentation. Even if you do, the documentation might be out of sync with the code in subtle ways that can lead to further issues. Often, the best guide to understand the application's functionality is the executable test cases and the code itself. The official Django documentation has been organized by versions at https://docs.djangoproject.com. On any page, you can quickly switch to the corresponding page in the previous versions of Django with a selector on the bottom right-hand section of the page: In the same way, documentation for any Django package hosted on readthedocs.org can also be traced back to its previous versions. For example, you can select the documentation of django-braces all the way back to v1.0.0 by clicking on the selector on the bottom left-hand section of the page: Creating the big picture Most people find it easier to understand an application if you show them a high-level diagram. While this is ideally created by someone who understands the workings of the application, there are tools that can create very helpful high-level depiction of a Django application. A graphical overview of all models in your apps can be generated by the graph_models management command, which is provided by the django-command-extensions package. As shown in the following diagram, the model classes and their relationships can be understood at a glance: Model classes used in the SuperBook project connected by arrows indicating their relationships This visualization is actually created using PyGraphviz. This can get really large for projects of even medium complexity. Hence, it might be easier if the applications are logically grouped and visualized separately. PyGraphviz Installation and Usage If you find the installation of PyGraphviz challenging, then don't worry, you are not alone. Recently, I faced numerous issues while installing on Ubuntu, starting from Python 3 incompatibility to incomplete documentation. To save your time, I have listed the steps that worked for me to reach a working setup. On Ubuntu, you will need the following packages installed to install PyGraphviz: $ sudo apt-get install python3.4-dev graphviz libgraphviz-dev pkg-config Now activate your virtual environment and run pip to install the development version of PyGraphviz directly from GitHub, which supports Python 3: $ pip install git+http://github.com/pygraphviz/pygraphviz.git#egg=pygraphviz Next, install django-extensions and add it to your INSTALLED_APPS. Now, you are all set. Here is a sample usage to create a GraphViz dot file for just two apps and to convert it to a PNG image for viewing: $ python manage.py graph_models app1 app2 > models.dot $ dot -Tpng models.dot -o models.png Incremental change or a full rewrite? Often, you would be handed over legacy code by the application owners in the earnest hope that most of it can be used right away or after a couple of minor tweaks. However, reading and understanding a huge and often outdated code base is not an easy job. Unsurprisingly, most programmers prefer to work on greenfield development. In the best case, the legacy code ought to be easily testable, well documented, and flexible to work in modern environments so that you can start making incremental changes in no time. In the worst case, you might recommend discarding the existing code and go for a full rewrite. Or, as it is commonly decided, the short-term approach would be to keep making incremental changes, and a parallel long-term effort might be underway for a complete reimplementation. A general rule of thumb to follow while taking such decisions is—if the cost of rewriting the application and maintaining the application is lower than the cost of maintaining the old application over time, then it is recommended to go for a rewrite. Care must be taken to account for all the factors, such as time taken to get new programmers up to speed, the cost of maintaining outdated hardware, and so on. Sometimes, the complexity of the application domain becomes a huge barrier against a rewrite, since a lot of knowledge learnt in the process of building the older code gets lost. Often, this dependency on the legacy code is a sign of poor design in the application like failing to externalize the business rules from the application logic. The worst form of a rewrite you can probably undertake is a conversion, or a mechanical translation from one language to another without taking any advantage of the existing best practices. In other words, you lost the opportunity to modernize the code base by removing years of cruft. Code should be seen as a liability not an asset. As counter-intuitive as it might sound, if you can achieve your business goals with a lesser amount of code, you have dramatically increased your productivity. Having less code to test, debug, and maintain can not only reduce ongoing costs but also make your organization more agile and flexible to change. Code is a liability not an asset. Less code is more maintainable. Irrespective of whether you are adding features or trimming your code, you must not touch your working legacy code without tests in place. Write tests before making any changes In the book Working Effectively with Legacy Code, Michael Feathers defines legacy code as, simply, code without tests. He elaborates that with tests one can easily modify the behavior of the code quickly and verifiably. In the absence of tests, it is impossible to gauge if the change made the code better or worse. Often, we do not know enough about legacy code to confidently write a test. Michael recommends writing tests that preserve and document the existing behavior, which are called characterization tests. Unlike the usual approach of writing tests, while writing a characterization test, you will first write a failing test with a dummy output, say X, because you don't know what to expect. When the test harness fails with an error, such as "Expected output X but got Y", then you will change your test to expect Y. So, now the test will pass, and it becomes a record of the code's existing behavior. Note that we might record buggy behavior as well. After all, this is unfamiliar code. Nevertheless, writing such tests are necessary before we start changing the code. Later, when we know the specifications and code better, we can fix these bugs and update our tests (not necessarily in that order). Step-by-step process to writing tests Writing tests before changing the code is similar to erecting scaffoldings before the restoration of an old building. It provides a structural framework that helps you confidently undertake repairs. You might want to approach this process in a stepwise manner as follows: Identify the area you need to make changes to. Write characterization tests focusing on this area until you have satisfactorily captured its behavior. Look at the changes you need to make and write specific test cases for those. Prefer smaller unit tests to larger and slower integration tests. Introduce incremental changes and test in lockstep. If tests break, then try to analyze whether it was expected. Don't be afraid to break even the characterization tests if that behavior is something that was intended to change. If you have a good set of tests around your code, then you can quickly find the effect of changing your code. On the other hand, if you decide to rewrite by discarding your code but not your data, then Django can help you considerably. Legacy databases There is an entire section on legacy databases in Django documentation and rightly so, as you will run into them many times. Data is more important than code, and databases are the repositories of data in most enterprises. You can modernize a legacy application written in other languages or frameworks by importing their database structure into Django. As an immediate advantage, you can use the Django admin interface to view and change your legacy data. Django makes this easy with the inspectdb management command, which looks as follows: $ python manage.py inspectdb > models.py This command, if run while your settings are configured to use the legacy database, can automatically generate the Python code that would go into your models file. Here are some best practices if you are using this approach to integrate to a legacy database: Know the limitations of Django ORM beforehand. Currently, multicolumn (composite) primary keys and NoSQL databases are not supported. Don't forget to manually clean up the generated models, for example, remove the redundant 'ID' fields since Django creates them automatically. Foreign Key relationships may have to be manually defined. In some databases, the auto-generated models will have them as integer fields (suffixed with _id). Organize your models into separate apps. Later, it will be easier to add the views, forms, and tests in the appropriate folders. Remember that running the migrations will create Django's administrative tables (django_* and auth_*) in the legacy database. In an ideal world, your auto-generated models would immediately start working, but in practice, it takes a lot of trial and error. Sometimes, the data type that Django inferred might not match your expectations. In other cases, you might want to add additional meta information such as unique_together to your model. Eventually, you should be able to see all the data that was locked inside that aging PHP application in your familiar Django admin interface. I am sure this will bring a smile to your face. Summary In this article, we looked at various techniques to understand legacy code. Reading code is often an underrated skill. But rather than reinventing the wheel, we need to judiciously reuse good working code whenever possible. Resources for Article: Further resources on this subject: So, what is Django? [article] Adding a developer with Django forms [article] Introduction to Custom Template Filters and Tags [article]
Read more
  • 0
  • 0
  • 7306

article-image-gui-components-qt-5
Packt
30 Mar 2015
8 min read
Save for later

GUI Components in Qt 5

Packt
30 Mar 2015
8 min read
In this article by Symeon Huang, author of the book Qt 5 Blueprints, explains typical and basic GUI components in Qt 5 (For more resources related to this topic, see here.) Design UI in Qt Creator Qt Creator is the official IDE for Qt application development and we're going to use it to design application's UI. At first, let's create a new project: Open Qt Creator. Navigate to File | New File or Project. Choose Qt Widgets Application. Enter the project's name and location. In this case, the project's name is layout_demo. You may wish to follow the wizard and keep the default values. After this creating process, Qt Creator will generate the skeleton of the project based on your choices. UI files are under Forms directory. And when you double-click on a UI file, Qt Creator will redirect you to integrated Designer, the mode selector should have Design highlighted and the main window should contains several sub-windows to let you design the user interface. Here we can design the UI by dragging and dropping. Qt Widgets Drag three push buttons from the widget box (widget palette) into the frame of MainWindow in the center. The default text displayed on these buttons is PushButtonbut you can change text if you want, by double-clicking on the button. In this case, I changed them to Hello, Hola, and Bonjouraccordingly. Note that this operation won't affect the objectName property and in order to keep it neat and easy-to-find, we need to change the objectName! The right-hand side of the UI contains two windows. The upper right section includes Object Inspector and the lower-right includes the Property Editor. Just select a push button, we can easily change objectName in the Property Editor. For the sake of convenience, I changed these buttons' objectName properties to helloButton, holaButton, and bonjourButton respectively. Save changes and click on Run on the left-hand side panel, it will build the project automatically then run it as shown in the following screenshot: In addition to the push button, Qt provides lots of commonly used widgets for us. Buttons such as tool button, radio button, and checkbox. Advanced views such as list, tree, and table. Of course there are input widgets, line edit, spin box, font combo box, date and time edit, and so on. Other useful widgets such as progress bar, scroll bar, and slider are also in the list. Besides, you can always subclass QWidget and write your own one. Layouts A quick way to delete a widget is to select it and press the Delete button. Meanwhile, some widgets, such as the menu bar, status bar, and toolbar can't be selected, so we have to right-click on them in Object Inspector and delete them. Since they are useless in this example, it's safe to remove them and we can do this for good. Okay, let's understand what needs to be done after the removal. You may want to keep all these push buttons on the same horizontal axis. To do this, perform the following steps: Select all the push buttons either by clicking on them one by one while keeping the Ctrl key pressed or just drawing an enclosing rectangle containing all the buttons. Right-click and select Layout | LayOut Horizontally. The keyboard shortcut for this is Ctrl + H. Resize the horizontal layout and adjust its layoutSpacing by selecting it and dragging any of the points around the selection box until it fits best. Hmm…! You may have noticed that the text of the Bonjour button is longer than the other two buttons, and it should be wider than the others. How do you do this? You can change the property of the horizontal layout object's layoutStretch property in Property Editor. This value indicates the stretch factors of the widgets inside the horizontal layout. They would be laid out in proportion. Change it to 3,3,4, and there you are. The stretched size definitely won't be smaller than the minimum size hint. This is how the zero factor works when there is a nonzero natural number, which means that you need to keep the minimum size instead of getting an error with a zero divisor. Now, drag Plain Text Edit just below, and not inside, the horizontal layout. Obviously, it would be neater if we could extend the plain text edit's width. However, we don't have to do this manually. In fact, we could change the layout of the parent, MainWindow. That's it! Right-click on MainWindow, and then navigate to Lay out | Lay Out Vertically. Wow! All the children widgets are automatically extended to the inner boundary of MainWindow; they are kept in a vertical order. You'll also find Layout settings in the centralWidget property, which is exactly the same thing as the previous horizontal layout. The last thing to make this application halfway decent is to change the title of the window. MainWindow is not the title you want, right? Click on MainWindow in the object tree. Then, scroll down its properties to find windowTitle. Name it whatever you want. In this example, I changed it to Greeting. Now, run the application again and you will see it looks like what is shown in the following screenshot: Qt Quick Components Since Qt 5, Qt Quick has evolved to version 2.0 which delivers a dynamic and rich experience. The language it used is so-called QML, which is basically an extended version of JavaScript using a JSON-like format. To create a simple Qt Quick application based on Qt Quick Controls 1.2, please follow following procedures: Create a new project named HelloQML. Select Qt Quick Application instead of Qt Widgets Application that we chose previously. Select Qt Quick Controls 1.2 when the wizard navigates you to Select Qt Quick Components Set. Edit the file main.qml under the root of Resources file, qml.qrc, that Qt Creator has generated for our new Qt Quick project. Let's see how the code should be. import QtQuick 2.3 import QtQuick.Controls 1.2   ApplicationWindow {    visible: true    width: 640    height: 480    title: qsTr("Hello QML")      menuBar: MenuBar {        Menu {            title: qsTr("File")            MenuItem {                text: qsTr("Exit")                shortcut: "Ctrl+Q"                onTriggered: Qt.quit()            }        }    }      Text {        id: hw        text: qsTr("Hello World")        font.capitalization: Font.AllUppercase        anchors.centerIn: parent    }      Label {        anchors { bottom: hw.top; bottomMargin: 5; horizontalCenter: hw.horizontalCenter }        text: qsTr("Hello Qt Quick")    } } If you ever touched Java or Python, then the first two lines won't be too unfamiliar for you. It simply imports the Qt Quick and Qt Quick Controls. And the number behind is the version of the library. The body of this QML source file is really in JSON style, which enables you understand the hierarchy of the user interface through the code. Here, the root item is ApplicationWindow, which is basically the same thing as QMainWindow in Qt/C++. When you run this application in Windows, you can barely find the difference between the Text item and Label item. But on some platforms, or when you change system font and/or its colour, you'll find that Label follows the font and colour scheme of the system while Text doesn't. Run this application, you'll see there is a menu bar, a text, and a label in the application window. Exactly what we wrote in the QML file: You may miss the Design mode for traditional Qt/C++ development. Well, you can still design Qt Quick application in Design mode! Click on Design in mode selector when you edit main.qml file. Qt Creator will redirect you into Design mode where you can use mouse drag-and-drop UI components: Almost all widgets you use in Qt Widget application can be found here in a Qt Quick application. Moreover, you can use other modern widgets such as busy indicator in Qt Quick while there's no counterpart in Qt Widget application. However, QML is a declarative language whose performance is obviously poor than C++. Therefore, more and more developers choose to write UI with Qt Quick in order to deliver a better visual style, while keep core functions in Qt/C++. Summary In this article, we had a brief contact with various GUI components of Qt 5 and focus on the Design mode in Qt Creator. Two small examples used as a Qt-like "Hello World" demonstrations. Resources for Article: Further resources on this subject: Code interlude – signals and slots [article] Program structure, execution flow, and runtime objects [article] Configuring Your Operating System [article]
Read more
  • 0
  • 0
  • 5044

article-image-geocoding-address-based-data
Packt
30 Mar 2015
7 min read
Save for later

Geocoding Address-based Data

Packt
30 Mar 2015
7 min read
In this article by Kurt Menke, GISP, Dr. Richard Smith Jr., GISP, Dr. Luigi Pirelli, Dr. John Van Hoesen, GISP, authors of the book Mastering QGIS, we'll have a look at how to geocode address-based date using QGIS and MMQGIS. (For more resources related to this topic, see here.) Geocoding addresses has many applications, such as mapping the customer base for a store, members of an organization, public health records, or incidence of crime. Once mapped, the points can be used in many ways to generate information. For example, they can be used as inputs to generate density surfaces, linked to parcels of land, and characterized by socio-economic data. They may also be an important component of a cadastral information system. An address geocoding operation typically involves the tabular address data and a street network dataset. The street network needs to have attribute fields for address ranges on the left- and right-hand side of each road segment. You can geocode within QGIS using a plugin named MMQGIS (http://michaelminn.com/linux/mmqgis/). MMQGIS has many useful tools. For geocoding, we will use the tools found in MMQGIS | Geocode. There are two tools there: Geocode CSV with Google/ OpenStreetMap and Geocode from Street Layer as shown in the following screenshot. The first tool allows you to geocode a table of addresses using either the Google Maps API or the OpenStreetMap Nominatim web service. This tool requires an Internet connection but no local street network data as the web services provide the street network. The second tool requires a local street network dataset with address range attributes to geocode the address data: How address geocoding works The basic mechanics of address geocoding are straightforward. The street network GIS data layer has attribute columns containing the address ranges on both the even and odd side of every street segment. In the following example, you can see a piece of the attribute table for the Streets.shp sample data. The columns LEFTLOW, LEFTHIGH, RIGHTLOW, and RIGHTHIGH contain the address ranges for each street segment: In the following example we are looking at Easy Street. On the odd side of the street, the addresses range from 101 to 199. On the even side, they range from 102 to 200. If you wanted to map 150 Easy Street, QGIS would assume that the address is located halfway down the even side of that block. Similarly, 175 Easy Street would be on the odd side of the street three quarters the way down the block. Address geocoding assumes that the addresses are evenly spaced along the linear network. QGIS should place the address point very close to its actual position, but due to variability in lot sizes not every address point will be perfectly positioned. Now that you've learned the basics, let's work through an example. Here we will geocode addresses using web services. The output will be a point shapefile containing all the attribute fields found in the source Addresses.csv file. An example – geocoding using web services Here are the steps for geocoding the Addresses.csv sample data using web services. Load the Addresses.csv and the Streets.shp sample data into QGIS Desktop. Open Addresses.csv and examine the table. These are addresses of municipal facilities. Notice that the street address (for example, 150 Easy Street) is contained in a single field. There are also fields for the city, state, and country. Since both Google and OpenStreetMap are global services, it is wise to include such fields so that the services can narrow down the geography. Install and enable the MMQGIS plugin. Navigate to MMQGIS | Geocode | Geocode CSV with Google/OpenStreetMap. The Web Service Geocode dialog window will open. Select Input CSV File (UTF-8) by clicking on Browse… and locating the delimited text file on your system. Select the address fields by clicking on the drop-down menu and identifying the Address Field, City Field, State Field, and Country Field fields. MMQGIS may identify some or all of these fields by default if they are named with logical names such as Address or State. Choose the web service. Name the output shapefile by clicking on Browse…. Name Not Found Output List by clicking on Browse…. Any records that are not matched will be written to this file. This allows you to easily see and troubleshoot any unmapped records. Click on OK. The status of the geocoding operation can be seen in the lower-left corner of QGIS. The word Geocoding will be displayed, followed by the number of records that have been processed. The output will be a point shapefile and a CSV file listing that addresses were not matched. Two additional attribute columns will be added to the output address point shapefile: addrtype and addrlocat. These fields provide information on how the web geocoding service obtained the location. These may be useful for accuracy assessment. Addrtype is the Google <type> element or the OpenStreetMap class attribute. This will indicate what kind of address type this is (highway, locality, museum, neighborhood, park, place, premise, route, train_station, university etc.). Addrlocat is the Google <location_type> element or OpenStreetMap type attribute. This indicates the relationship of the coordinates to the addressed feature (approximate, geometric center, node, relation, rooftop, way interpolation, and so on). If the web service returns more than one location for an address, the first of the locations will be used as the output feature. Use of this plugin requires an active Internet connection. Google places both rate and volume restrictions on the number of addresses that can be geocoded within various time limits. You should visit the Google Geocoding API website: (http://code.google.com/apis/maps/documentation/geocoding/) for more details, and current information and Google's terms of service. Geocoding via these web services can be slow. If you don't get the desired results with one service, try the other. Geocoding operations rarely have 100% success. Street names in the street shapefile must match the street names in the CSV file exactly. Any discrepancies between the name of a street in the address table, and the street attribute table will lower the geocoding success rate. The following image shows the results of geocoding addresses via street address ranges. The addresses are shown with the street network used in the geocoding operation: Geocoding is often an iterative process. After the initial geocoding operation, you can review the Not Found CSV file. If it's empty then all the records were matched. If it has records in it, compare them with the attributes of the streets layer. This will help you determine why those records were not mapped. It may be due to inconsistencies in the spelling of street names. It may also be due to a street centerline layer that is not as current as the addresses. Once the errors have been identified they can be corrected by editing the data, or obtaining a different street centreline dataset. The geocoding operation can be re-run on those unmatched addresses. This process can be repeated until all records are matched. Use the Identify tool to inspect the mapped points, and the roads, to ensure that the operation was successful. Never take a GIS operation for granted. Check your results with a critical eye. Summary This article introduced you to the process of address geocoding using QGIS and the MMQGIS plugin. Resources for Article: Further resources on this subject: Editing attributes [article] How Vector Features are Displayed [article] QGIS Feature Selection Tools [article]
Read more
  • 0
  • 1
  • 3425

article-image-getting-started-intel-galileo
Packt
30 Mar 2015
12 min read
Save for later

Getting Started with Intel Galileo

Packt
30 Mar 2015
12 min read
In this article by Onur Dundar, author of the book Home Automation with Intel Galileo, we will see how to develop home automation examples using the Intel Galileo development board along with the existing home automation sensors and devices. In the book, a good review of Intel Galileo will be provided, which will teach you to develop native C/C++ applications for Intel Galileo. (For more resources related to this topic, see here.) After a good introduction to Intel Galileo, we will review home automation's history, concepts, technology, and current trends. When we have an understanding of home automation and the supporting technologies, we will develop some examples on two main concepts of home automation: energy management and security. We will build some examples under energy management using electrical switches, light bulbs and switches, as well as temperature sensors. For security, we will use motion, water leak sensors, and a camera to create some examples. For all the examples, we will develop simple applications with C and C++. Finally, when we are done building good and working examples, we will work on supporting software and technologies to create more user friendly home automation software. In this article, we will take a look at the Intel Galileo development board, which will be the device that we will use to build all our applications; also, we will configure our host PC environment for software development. The following are the prerequisites for this article: A Linux PC for development purposes. All our work has been done on an Ubuntu 12.04 host computer, for this article and others as well. (If you use newer versions of Ubuntu, you might encounter problems with some things in this article.) An Intel Galileo (Gen 2) development board with its power adapter. A USB-to-TTL serial UART converter cable; the suggested cable is TTL-232R-3V3 to connect to the Intel Galileo Gen 2 board and your host system. You can see an example of a USB-to-TTL serial UART cable at http://www.amazon.com/GearMo%C2%AE-3-3v-Header-like-TTL-232R-3V3/dp/B004LBXO2A. If you are going to use Intel Galileo Gen 1, you will need a 3.5 mm jack-to-UART cable. You can see the mentioned cable at http://www.amazon.com/Intel-Galileo-Gen-Serial-cable/dp/B00O170JKY/. An Ethernet cable connected to your modem or switch in order to connect Intel Galileo to the local network of your workplace. A microSD card. Intel Galileo supports microSD cards up to 32 GB storage. Introducing Intel Galileo The Intel Galileo board is the first in a line of Arduino-certified development boards based on Intel x86 architecture. It is designed to be hardware and software pin-compatible with Arduino shields designed for the UNOR3. Arduino is an open source physical computing platform based on a simple microcontroller board, and it is a development environment for writing software for the board. Arduino can be used to develop interactive objects, by taking inputs from a variety of switches or sensors and controlling a variety of lights, motors, and other physical outputs. The Intel Galileo board is based on the Intel Quark X1000 SoC, a 32-bit Intel Pentium processor-class system on a chip (SoC). In addition to Arduino compatible I/O pins, Intel Galileo inherited mini PCI Express slots, a 10/100 Mbps Ethernet RJ45 port, USB 2.0 host, and client I/O ports from the PC world. The Intel Galileo Gen 1 USB host is a micro USB slot. In order to use a generation 1 USB host with USB 2.0 cables, you will need an OTG (On-the-go) cable. You can see an example cable at http://www.amazon.com/Cable-Matters-2-Pack-Micro-USB-Adapter/dp/B00GM0OZ4O. Another good feature of the Intel Galileo board is that it has open source hardware designed together with its software. Hardware design schematics and the bill of materials (BOM) are distributed on the Intel website. Intel Galileo runs on a custom embedded Linux operating system, and its firmware, bootloader, as well as kernel source code can be downloaded from https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=23171. Another helpful URL to identify, locate, and ask questions about the latest changes in the software and hardware is the open source community at https://communities.intel.com/community/makers. Intel delivered two versions of the Intel Galileo development board called Gen 1 and Gen 2. At the moment, only Gen 2 versions are available. There are some hardware changes in Gen 2, as compared to Gen 1. You can see both versions in the following image: The first board (on the left-hand side) is the Intel Galileo Gen 1 version and the second one (on the right-hand side) is Intel Galileo Gen 2. Using Intel Galileo for home automation As mentioned in the previous section, Intel Galileo supports various sets of I/O peripherals. Arduino sensor shields and USB and mini PCI-E devices can be used to develop and create applications. Intel Galileo can be expanded with the help of I/O peripherals, so we can manage the sensors needed to automate our home. When we take a look at the existing home automation modules in the market, we can see that preconfigured hubs or gateways manage these modules to automate homes. A hub or a gateway is programmed to send and receive data to/from home automation devices. Similarly, with the help of a Linux operating system running on Intel Galileo and the support of multiple I/O ports on the board, we will be able to manage home automation devices. We will implement new applications or will port existing Linux applications to connect home automation devices. Connecting to the devices will enable us to collect data as well as receive and send commands to these devices. Being able to send and receive commands to and from these devices will make Intel Galileo a gateway or a hub for home automation. It is also possible to develop simple home automation devices with the help of the existing sensors. Pinout helps us to connect sensors on the board and read/write data to sensors and come up with a device. Finally, the power of open source and Linux on Intel Galileo will enable you to reuse the developed libraries for your projects. It can also be used to run existing open source projects on technologies such as Node.js and Python on the board together with our C application. This will help you to add more features and extend the board's capability, for example, serving a web user interface easily from Intel Galileo with Node.js. Intel Galileo – hardware specifications The Intel Galileo board is an open source hardware design. The schematics, Cadence Allegro board files, and BOM can be downloaded from the Intel Galileo web page. In this section, we will just take a look at some key hardware features for feature references to understand the hardware capability of Intel Galileo in order to make better decisions on software design. Intel Galileo is an embedded system with the required RAM and flash storages included on the board to boot it and run without any additional hardware. The following table shows the features of Intel Galileo: Processor features 1 Core 32-bit Intel Pentium processor-compatible ISA Intel Quark SoC X1000 400 MHz 16 KB L1 Cache 512 KB SRAM Integrated real-time clock (RTC) Storage 8 MB NOR Flash for firmware and bootloader 256 MB DDR3; 800 MT/s SD card, up to 32 GB 8 KB EEPROM Power 7 V to 15 V Power over Ethernet (PoE) requires you to install the PoE module Ports and connectors USB 2.0 host (standard type A), client (micro USB type B) RJ45 Ethernet 10-pin JTAG for debugging 6-pin UART 6-pin ICSP 1 mini-PCI Express slot 1 SDIO Arduino compatible headers 20 digital I/O pins 6 analog inputs 6 PWMs with 12-bit resolution 1 SPI master 2 UARTs (one shared with the console UART) 1 I2C master Intel Galileo – software specifications Intel delivers prebuilt images and binaries along with its board support package (BSP) to download the source code and build all related software with your development system. The running operating system on Intel Galileo is Linux; sometimes, it is called Yocto Linux because of the Linux filesystem, cross-compiled toolchain, and kernel images created by the Yocto Project's build mechanism. The Yocto Project is an open source collaboration project that provides templates, tools, and methods to help you create custom Linux-based systems for embedded products, regardless of the hardware architecture. The following diagram shows the layers of the Intel Galileo development board: Intel Galileo is an embedded Linux product; this means you need to compile your software on your development machine with the help of a cross-compiled toolchain or software development kit (SDK). A cross-compiled toolchain/SDK can be created using the Yocto project; we will go over the instructions in the following sections. The toolchain includes the necessary compiler and linker for Intel Galileo to compile and build C/C++ applications for the Intel Galileo board. The binary created on your host with the Intel Galileo SDK will not work on the host machine since it is created for a different architecture. With the help of the C/C++ APIs and libraries provided with the Intel Galileo SDK, you can build any C/C++ native application for Intel Galileo as well as port any existing native application (without a graphical user interface) to run on Intel Galileo. Intel Galileo doesn't have a graphical processor unit. You can still use OpenCV-like libraries, but the performance of matrix operations is so poor on CPU compared to systems with GPU that it is not wise to perform complex image processing on Intel Galileo. Connecting and booting Intel Galileo We can now proceed to power up Intel Galileo and connect it to its terminal. Before going forward with the board connection, you need to install a modem control program to your host system in order to connect Intel Galileo from its UART interface with minicom. Minicom is a text-based modem control and terminal emulation program for Unix-like operating systems. If you are not comfortable with text-based applications, you can use graphical serial terminals such as CuteCom or GtkTerm. To start with Intel Galileo, perform the following steps: Install minicom: $ sudo apt-get install minicom Attach the USB of your 6-pin TTL cable and start minicom for the first time with the –s option: $ sudo minicom –s Before going into the setup details, check the device is connected to your host. In our case, the serial device is /dev/ttyUSB0 on our host system. You can check it from your host's device messages (dmesg) to see the connected USB. When you start minicom with the –s option, it will prompt you. From minicom's Configuration menu, select Serial port setup to set the values, as follows: After setting up the serial device, select Exit to go to the terminal. This will prompt you with the booting sequence and launch the Linux console when the Intel Galileo serial device is connected and powered up. Next, complete connections on Intel Galileo. Connect the TTL-232R cable to your Intel Galileo board's UART pins. UART pins are just next to the Ethernet port. Make sure that you have connected the cables correctly. The black-colored cable on TTL is the ground connection. It is written on TTL pins which one is ground on Intel Galileo. We are ready to power up Intel Galileo. After you plug the power cable into the board, you will see the Intel Galileo board's boot sequence on the terminal. When the booting process is completed, it will prompt you to log in; log in with the root user, where no password is needed. The final prompt will be as follows; we are in the Intel Galileo Linux console, where you can just use basic Linux commands that already exist on the board to discover the Intel Galileo filesystem: Poky 9.0.2 (Yocto Project 1.4 Reference Distro) 1.4.2   clanton clanton login: root root@clanton:~# Your board will now look like the following image: Connecting to Intel Galileo via Telnet If you have connected Intel Galileo to a local network with an Ethernet cable, you can use Telnet to connect it without using a serial connection, after performing some simple steps: Run the following commands on the Intel Galileo terminal: root@clanton:~# ifup eth0 root@clanton:~# ifconfig root@clanton:~# telnetd The ifup command brings the Ethernet interface up, and the second command starts the Telnet daemon. You can check the assigned IP address with the ifconfig command. From your host system, run the following command with your Intel Galileo board's IP address to start a Telnet session with Intel Galileo: $ telnet 192.168.2.168 Summary In this article, we learned how to use the Intel Galileo development board, its software, and system development environment. It takes some time to get used to all the tools if you are not used to them. A little practice with Eclipse is very helpful to build applications and make remote connections or to write simple applications on the host console with a terminal and build them. Let's go through all the points we have covered in this article. First, we read some general information about Intel Galileo and why we chose Intel Galileo, with some good reasons being Linux and the existing I/O ports on the board. Then, we saw some more details about Intel Galileo's hardware and software specifications and understood how to work with them. I believe understanding the internal working of Intel Galileo in building a Linux image and a kernel is a good practice, leading us to customize and run more tools on Intel Galileo. Finally, we learned how to develop applications for Intel Galileo. First, we built an SDK and set up the development environment. There were more instructions about how to deploy the applications on Intel Galileo over a local network as well. Then, we finished up by configuring the Eclipse IDE to quicken the development process for future development. In the next article, we will learn about home automation concepts and technologies. Resources for Article: Further resources on this subject: Hardware configuration [article] Our First Project – A Basic Thermometer [article] Pulse width modulator [article]
Read more
  • 0
  • 0
  • 24738
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-basic-concepts-machine-learning-and-logistic-regression-example-mahout
Packt
30 Mar 2015
33 min read
Save for later

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Packt
30 Mar 2015
33 min read
In this article by Chandramani Tiwary, author of the book, Learning Apache Mahout, we will discuss some core concepts of machine learning and discuss the steps of building a logistic regression classifier in Mahout. (For more resources related to this topic, see here.) The purpose of this article is to understand the core concepts of machine learning. We will focus on understanding the steps involved in, resolving different types of problems and application areas in machine learning. In particular we will cover the following topics: Supervised learning Unsupervised learning The recommender system Model efficacy A wide range of software applications today try to replace or augment human judgment. Artificial Intelligence is a branch of computer science that has long been trying to replicate human intelligence. A subset of AI, referred to as machine learning, tries to build intelligent systems by using the data. For example, a machine learning system can learn to classify different species of flowers or group-related news items together to form categories such as news, sports, politics, and so on, and for each of these tasks, the system will learn using data. For each of the tasks, the corresponding algorithm would look at the data and try to learn from it. Supervised learning Supervised learning deals with training algorithms with labeled data, inputs for which the outcome or target variables are known, and then predicting the outcome/target with the trained model for unseen future data. For example, historical e-mail data will have individual e-mails marked as ham or spam; this data is then used for training a model that can predict future e-mails as ham or spam. Supervised learning problems can be broadly divided into two major areas, classification and regression. Classification deals with predicting categorical variables or classes; for example, whether an e-mail is ham or spam or whether a customer is going to renew a subscription or not, for example a postpaid telecom subscription. This target variable is discrete, and has a predefined set of values. Regression deals with a target variable, which is continuous. For example, when we need to predict house prices, the target variable price is continuous and doesn't have a predefined set of values. In order to solve a given problem of supervised learning, one has to perform the following steps. Determine the objective The first major step is to define the objective of the problem. Identification of class labels, what is the acceptable prediction accuracy, how far in the future is prediction required, is insight more important or is accuracy of classification the driving factor, these are the typical objectives that need to be defined. For example, for a churn classification problem, we could define the objective as identifying customers who are most likely to churn within three months. In this case, the class label from the historical data would be whether a customer has churned or not, with insights into the reasons for the churn and a prediction of churn at least three months in advance. Decide the training data After the objective of the problem has been defined, the next step is to decide what training data should be used. The training data is directly guided by the objective of the problem to be solved. For example, in the case of an e-mail classification system, it would be historical e-mails, related metadata, and a label marking each e-mail as spam or ham. For the problem of churn analysis, different data points collected about a customer such as product usage, support case, and so on, and a target label for whether a customer has churned or is active, together form the training data. Churn Analytics is a major problem area for a lot of businesses domains such as BFSI, telecommunications, and SaaS. Churn is applicable in circumstances where there is a concept of term-bound subscription. For example, postpaid telecom customers subscribe for a monthly term and can choose to renew or cancel their subscription. A customer who cancels this subscription is called a churned customer. Create and clean the training set The next step in a machine learning project is to gather and clean the dataset. The sample dataset needs to be representative of the real-world data, though all available data should be used, if possible. For example, if we assume that 10 percent of e-mails are spam, then our sample should ideally start with 10 percent spam and 90 percent ham. Thus, a set of input rows and corresponding target labels are gathered from data sources such as warehouses, or logs, or operational database systems. If possible, it is advisable to use all the data available rather than sampling the data. Cleaning data for data quality purposes forms part of this process. For example, training data inclusion criteria should also be explored in this step. An example of this in the case of customer analytics is to decide the minimum age or type of customers to use in the training set, for example including customers aged at least six months. Feature extraction Determine and create the feature set from the training data. Features or predictor variables are representations of the training data that is used as input to a model. Feature extraction involves transforming and summarizing that data. The performance of the learned model depends strongly on its input feature set. This process is primarily called feature extraction and requires good understanding of data and is aided by domain expertise. For example, for churn analytics, we use demography information from the CRM, product adoption (phone usage in case of telecom), age of customer, and payment and subscription history as the features for the model. The number of features extracted should neither be too large nor too small; feature extraction is more art than science and, optimum feature representation can be achieved after some iterations. Typically, the dataset is constructed such that each row corresponds to one variable outcome. For example, in the churn problem, the training dataset would be constructed so that every row represents a customer. Train the models We need to try out different supervised learning algorithms. This step is called training the model and is an iterative process where you might try building different training samples and try out different combinations of features. For example, we may choose to use support vector machines or decision trees depending upon the objective of the study, the type of problem, and the available data. Machine learning algorithms can be bucketed into groups based on the ability of a user to interpret how the predictions were arrived at. If the model can be interpreted easily, then it is called a white box, for example decision tree and logistic regression, and if the model cannot be interpreted easily, they belong to the black box models, for example support vector machine (SVM). If the objective is to gain insight, a white box model such as decision tree or logistic regression can be used, and if robust prediction is the criteria, then algorithms such as neural networks or support vector machines can be used. While training a model, there are a few techniques that we should keep in mind, like bagging and boosting. Bagging Bootstrap aggregating, which is also known as bagging, is a technique where the data is taken from the original dataset S times to make S new datasets. The datasets are the same size as the original. Each dataset is built by randomly selecting an example from the original with replacement. By with replacement we mean that you can select the same example more than once. This property allows you to have values in the new dataset that are repeated, and some values from the original won't be present in the new set. Bagging helps in reducing the variance of a model and can be used to train different models using the same datasets. The final conclusion is arrived at after considering the output of each model. For example, let's assume our data is a, b, c, d, e, f, g, and h. By sampling our data five times, we can create five different samples as follows: Sample 1: a, b, c, c, e, f, g, h Sample 2: a, b, c, d, d, f, g, h Sample 3: a, b, c, c, e, f, h, h Sample 4: a, b, c, e, e, f, g, h Sample 5: a, b, b, e, e, f, g, h As we sample with replacement, we get the same examples more than once. Now we can train five different models using the five sample datasets. Now, for the prediction; as each model will provide the output, let's assume classes are yes and no, and the final outcome would be the class with maximum votes. If three models say yes and two no, then the final prediction would be class yes. Boosting Boosting is a technique similar to bagging. In boosting and bagging, you always use the same type of classifier. But in boosting, the different classifiers are trained sequentially. Each new classifier is trained based on the performance of those already trained, but gives greater weight to examples that were misclassified by the previous classifier. Boosting focuses new classifiers in the sequence on previously misclassified data. Boosting also differs from bagging in its approach of calculating the final prediction. The output is calculated from a weighted sum of all classifiers, as opposed to the method of equal weights used in bagging. The weights assigned to the classifier output in boosting are based on the performance of the classifier in the previous iteration. Validation After collecting the training set and extracting the features, you need to train the model and validate it on unseen samples. There are many approaches for creating the unseen sample called the validation set. We will be discussing a couple of them shortly. Holdout-set validation One approach to creating the validation set is to divide the feature set into train and test samples. We use the train set to train the model and test set to validate it. The actual percentage split varies from case to case but commonly it is split at 70 percent train and 30 percent test. It is also not uncommon to create three sets, train, test and validation set. Train and test set is created from data out of all considered time periods but the validation set is created from the most recent data. K-fold cross validation Another approach is to divide the data into k equal size folds or parts and then use k-1 of them for training and one for testing. The process is repeated k times so that each set is used as a validation set once and the metrics are collected over all the runs. The general standard is to use k as 10, which is called 10-fold cross-validation. Evaluation The objective of evaluation is to test the generalization of a classifier. By generalization, we mean how good the model performs on future data. Ideally, evaluation should be done on an unseen sample, separate to the validation sample or by cross-validation. There are standard metrics to evaluate a classifier against. There are a few things to consider while training a classifier that we should keep in mind. Bias-variance trade-off The first aspect to keep in mind is the trade-off between bias and variance. To understand the meaning of bias and variance, let's assume that we have several different, but equally good, training datasets for a specific supervised learning problem. We train different models using the same technique; for example, build different decision trees using the different training datasets available. Bias measures how far off in general a model's predictions are from the correct value. Bias can be measured as the average difference between a predicted output and its actual value. A learning algorithm is biased for a particular input X if, when trained on different training sets, it is incorrect when predicting the correct output for X. Variance is how greatly the predictions for a given point vary between different realizations of the model. A learning algorithm has high variance for a particular input X if it predicts different output values for X when trained on different training sets. Generally, there will be a trade-off between bias and variance. A learning algorithm with low bias must be flexible so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training dataset differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this trade-off between bias and variance. The plot on the top left is the scatter plot of the original data. The plot on the top right is a fit with high bias; the error in prediction in this case will be high. The bottom left image is a fit with high variance; the model is very flexible, and error on the training set is low but the prediction on unseen data will have a much higher degree of error as compared to the training set. The bottom right plot is an optimum fit with a good trade-off of bias and variance. The model explains the data well and will perform in a similar way for unseen data too. If the bias-variance trade-off is not optimized, it leads to problems of under-fitting and over-fitting. The plot shows a visual representation of the bias-variance trade-off. Over-fitting occurs when an estimator is too flexible and tries to fit the data too closely. High variance and low bias leads to over-fitting of data. Under-fitting occurs when a model is not flexible enough to capture the underlying trends in the observed data. Low variance and high bias leads to under-fitting of data. Function complexity and amount of training data The second aspect to consider is the amount of training data needed to properly represent the learning task. The amount of data required is proportional to the complexity of the data and learning task at hand. For example, if the features in the data have low interaction and are smaller in number, we could train a model with a small amount of data. In this case, a learning algorithm with high bias and low variance is better suited. But if the learning task at hand is complex and has a large number of features with higher degree of interaction, then a large amount of training data is required. In this case, a learning algorithm with low bias and high variance is better suited. It is difficult to actually determine the amount of data needed, but the complexity of the task provides some indications. Dimensionality of the input space A third aspect to consider is the dimensionality of the input space. By dimensionality, we mean the number of features the training set has. If the input feature set has a very high number of features, any machine learning algorithm will require a huge amount of data to build a good model. In practice, it is advisable to remove any extra dimensionality before training the model; this is likely to improve the accuracy of the learned function. Techniques like feature selection and dimensionality reduction can be used for this. Noise in data The fourth issue is noise. Noise refers to inaccuracies in data due to various issues. Noise can be present either in the predictor variables, or in the target variable. Both lead to model inaccuracies and reduce the generalization of the model. In practice, there are several approaches to alleviate noise in the data; first would be to identify and then remove the noisy training examples prior to training the supervised learning algorithm, and second would be to have an early stopping criteria to prevent over-fitting. Unsupervised learning Unsupervised learning deals with unlabeled data. The objective is to observe structure in data and find patterns. Tasks like cluster analysis, association rule mining, outlier detection, dimensionality reduction, and so on can be modeled as unsupervised learning problems. As the tasks involved in unsupervised learning vary vastly, there is no single process outline that we can follow. We will follow the process of some of the most common unsupervised learning problems. Cluster analysis Cluster analysis is a subset of unsupervised learning that aims to create groups of similar items from a set of items. Real life examples could be clustering movies according to various attributes like genre, length, ratings, and so on. Cluster analysis helps us identify interesting groups of objects that we are interested in. It could be items we encounter in day-to-day life such as movies, songs according to taste, or interests of users in terms of their demography or purchasing patterns. Let's consider a small example so you understand what we mean by interesting groups and understand the power of clustering. We will use the Iris dataset, which is a standard dataset used for academic research and it contains five variables: sepal length, sepal width, petal length, petal width, and species with 150 observations. The first plot we see shows petal length against petal width. Each color represents a different species. The second plot is the groups identified by clustering the data. Looking at the plot, we can see that the plot of petal length against petal width clearly separates the species of the Iris flower and in the process, it clusters the group's flowers of the same species together. Cluster analysis can be used to identify interesting patterns in data. The process of clustering involves these four steps. We will discuss each of them in the section ahead. Objective Feature representation Algorithm for clustering A stopping criteria Objective What do we want to cluster? This is an important question. Let's assume we have a large customer base for some kind of an e-commerce site and we want to group them together. How do we want to group them? Do we want to group our users according to their demography, such as age, location, income, and so on or are we interested in grouping them together? A clear objective is a good start, though it is not uncommon to start without an objective and see what can be done with the available data. Feature representation As with any machine learning task, feature representation is important for cluster analysis too. Creating derived features, summarizing data, and converting categorical variables to continuous variables are some of the common tasks. The feature representation needs to represent the objective of clustering. For example, if the objective is to cluster users based upon purchasing behavior, then features should be derived from purchase transaction and user demography information. If the objective is to cluster documents, then features should be extracted from the text of the document. Feature normalization To compare the feature vectors, we need to normalize them. Normalization could be across rows or across columns. In most cases, both are normalized. Row normalization The objective of normalizing rows is to make the objects to be clustered, comparable. Let's assume we are clustering organizations based upon their e-mailing behavior. Now organizations are very large and very small, but the objective is to capture the e-mailing behavior, irrespective of size of the organization. In this scenario, we need to figure out a way to normalize rows representing each organization, so that they can be compared. In this case, dividing by user count in each respective organization could give us a good feature representation. Row normalization is mostly driven by the business domain and requires domain expertise. Column normalization The range of data across columns varies across datasets. The unit could be different or the range of columns could be different, or both. There are many ways of normalizing data. Which technique to use varies from case to case and depends upon the objective. A few of them are discussed here. Rescaling The simplest method is to rescale the range of features to make the features independent of each other. The aim is scale the range in [0, 1] or [−1, 1]: Here x is the original value and x', the rescaled valued. Standardization Feature standardization allows for the values of each feature in the data to have zero-mean and unit-variance. In general, we first calculate the mean and standard deviation for each feature and then subtract the mean in each feature. Then, we divide the mean subtracted values of each feature by its standard deviation: Xs = (X – mean(X)) / standard deviation(X). A notion of similarity and dissimilarity Once we have the objective defined, it leads to the idea of similarity and dissimilarity of object or data points. Since we need to group things together based on similarity, we need a way to measure similarity. Likewise to keep dissimilar things apart, we need a notion of dissimilarity. This idea is represented in machine learning by the idea of a distance measure. Distance measure, as the name suggests, is used to measure the distance between two objects or data points. Euclidean distance measure Euclidean distance measure is the most commonly used and intuitive distance measure: Squared Euclidean distance measure The standard Euclidean distance, when squared, places progressively greater weight on objects that are farther apart as compared to the nearer objects. The equation to calculate squared Euclidean measure is shown here: Manhattan distance measure Manhattan distance measure is defined as the sum of the absolute difference of the coordinates of two points. The distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 - x2| + |y1 - y2|: Cosine distance measure The cosine distance measure measures the angle between two points. When this angle is small, the vectors must be pointing in the same direction, and so in some sense the points are close. The cosine of this angle is near one when the angle is small, and decreases as it gets larger. The cosine distance equation subtracts the cosine value from one in order to give a proper distance, which is 0 when close and larger otherwise. The cosine distance measure doesn't account for the length of the two vectors; all that matters is that the points are in the same direction from the origin. Also note that the cosine distance measure ranges from 0.0, if the two vectors are along the same direction, to 2.0, when the two vectors are in opposite directions: Tanimoto distance measure The Tanimoto distance measure, like the cosine distance measure, measures the angle between two points, as well as the relative distance between the points: Apart from the standard distance measure, we can also define our own distance measure. Custom distance measure can be explored when existing ones are not able to measure the similarity between items. Algorithm for clustering The type of clustering algorithm to be used is driven by the objective of the problem at hand. There are several options and the predominant ones are density-based clustering, distance-based clustering, distribution-based clustering, and hierarchical clustering. The choice of algorithm to be used depends upon the objective of the problem. A stopping criteria We need to know when to stop the clustering process. The stopping criteria could be decided in different ways: one way is when the cluster centroids don't move beyond a certain margin after multiple iterations, a second way is when the density of the clusters have stabilized, and third way could be based upon the number of iterations, for example stopping the algorithm after 100 iterations. The stopping criteria depends upon the algorithm used, the goal being to stop when we have good enough clusters. Logistic regression Logistic regression is a probabilistic classification model. It provides the probability of a particular instance belonging to a class. It is used to predict the probability of binary outcomes. Logistic regression is computationally inexpensive, is relatively easier to implement, and can be interpreted easily. Logistic regression belongs to the class of discriminative models. The other class of algorithms is generative models. Let's try to understand the differences between the two. Suppose we have some input data represented by X and a target variable Y, the learning task obviously is P(Y|X), finding the conditional probability of Y occurring given X. A generative model concerns itself with learning the joint probability of P(Y, X), whereas a discriminative model will directly learn the conditional probability of P(Y|X) from the training set. This is the actual objective of classification. A generative model first learns P(Y, X), and then gets to P(Y|X) by conditioning on X by using Bayes' theorem. In more intuitive terms, generative models first learn the distribution of the data, then they model how the data is actually generated. However, discriminative models don't try to learn the underlying data distribution; they are concerned with finding the decision boundaries for the classification. Since generative models learn the distribution, it is possible to generate synthetic samples of X, Y. This is not possible with discriminative models. Some common examples of generative and discriminative models are as follows: Generative: naïve Bayes, Latent Dirichlet allocation Discriminative: Logistic regression, SVM, Neural networks Logistic regression belongs to the family of statistical techniques called regression. For regression problems and few other optimization problems, we first define a hypothesis, then define a cost function, and optimize it using an optimization algorithm such as Gradient descent. The optimization algorithm tries to find the regression coefficient, which best fits the data. Let's assume that the target variable is Y and the predictor variable or feature is X. Any regression problem starts with defining the hypothesis function, for example, an equation of the predictor variable , defines a cost function and then tweaks the weights; in this case, and are tweaked to minimize or maximize the cost function by using an optimization algorithm. For logistic regression, the predicted target needs to fall between zero and one. We start by defining the hypothesis function for it: Here, f(z) is the sigmoid or logistic function that has a range of zero to one, x is a matrix of features, and is the vector of weights. The next step is to define the cost function, which measures the difference between predicted and actual values. The objective of the optimization algorithm here is to find . This fits the regression coefficients so that the difference between predicted and actual target values are minimized. We will discuss gradient descent as the choice for the optimization algorithm shortly. To find the local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of that function at the current point. This will give us the optimum value of vector , once we achieve the stopping criteria. The stopping criteria is when the change in the weight vectors falls below a certain threshold, although sometimes it could be set to a predefined number of iterations. Logistic regression falls into the category of white box techniques and can be interpreted. Features or variables are of two major types, categorical and continuous, defined as follows: Categorical variable: This is a variable or feature that can take on a limited, and usually fixed, number of possible values. Example, variables such as industry, zip code, and country are categorical variables. Continuous variable: This is a variable that can take on any value between its minimum value and maximum value or range. Example, variable such as age, price, and so on, are continuous variables. Mahout logistic regression command line Mahout employs a modified version of gradient descent called stochastic gradient descent. The previous optimization algorithm, gradient ascent, uses the whole dataset on each update. This was fine with 100 examples, but with billions of data points containing thousands of features, it's unnecessarily expensive in terms of computational resources. An alternative to this method is to update the weights using, only one instance at a time. This is known as stochastic gradient ascent. Stochastic gradient ascent is an example of an online learning algorithm. This is known as online learning algorithm because we can incrementally update the classifier as new data comes in, rather than all at once. The all-at-once method is known as batch processing. We will now train and test a logistic regression algorithm using Mahout. We will also discuss both command line and code examples. The first step is to get the data and explore it. Getting the data The dataset required for this article is included in the code repository that comes with this book. It is present in the learningApacheMahout/data/chapter4 directory. If you wish to download the data, the same can be downloaded from the UCI link. The UCI is a repository for many datasets for machine learning. You can check out the other datasets available for further practice via this link http://archive.ics.uci.edu/ml/datasets.html. Create a folder in your home directory with the following command: cd $HOME mkdir bank_data cd bank_data Download the data in the bank_data directory: wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip Unzip the file using whichever utility you like, we use unzip: unzip bank-additional.zip cd bank-additional We are interested in the file bank-additional-full.csv. Copy the file to the learningApacheMahout/data/chapter4 directory. The file is semicolon delimited and the values are enclosed by ", it also has a header line with column name. We will use sed to preprocess the data. The sed editor is a very powerful editor in Linux and the command to use it is as follows: sed -e 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' fileName > Output_fileName For inplace editing, the command is as follows: sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' Command to replace ; with , and remove " are as follows: sed -e 's/;/,/g' bank-additional-full.csv > input_bank_data.csv sed -i 's/"//g' input_bank_data.csv The dataset contains demographic and previous campaign-related data about a client and the outcome of whether or not the client did subscribed to the term deposit. We are interested in training a model, which can predict whether a client will subscribe to a term deposit, given the input data. The following table shows various input variables along with their types: Column name Description Variable type Age This represents the age of the Client Numeric Job This represents their type of the job, for example, entrepreneur, housemaid, management Categorical Marital This represents their marital status Categorical Education This represents their education level Categorical Default States whether the client has defaulted on credit Categorical Housing States whether the client has a housing loan Categorical Loan States whether the client has a personal loan Categorical contact States the contact communication type Categorical Month States the last contact month of the year Categorical day_of_week States the last contact day of the week Categorical duration States the last contact duration, in seconds Numeric campaign This represents the number of contacts Numeric Pdays This represents the number of days that passed since the last contact Numeric previous This represents the number of contacts performed before this campaign Numeric poutcome This represents the outcome of the previous marketing campaign Categorical emp.var.rate States the employment variation rate - quarterly indicator Numeric cons.price.idx States the consumer price index - monthly indicator Numeric cons.conf.idx States the consumer confidence index - monthly indicator Numeric euribor3m States the euribor three month rate - daily indicator Numeric nr.employed This represents the number of employees - quarterly indicator Numeric Model building via command line Mahout uses command line implementation of logistic regression. We will first build a model using the command line implementation. Logistic regression does not have a map to reduce implementation, but as it uses stochastic gradient descent, it is pretty fast, even for large datasets. The Mahout Java class is OnlineLogisticRegression in the org.mahout.classifier.sgd package. Splitting the dataset To split a dataset, we can use the Mahout split command. Let's look at the split command arguments as follows: mahout split ––help We need to remove the first line before running the split command, as the file contains the header file and the split command doesn't make any special allowances for header lines. It will land in any line in the split file. We first remove the header line from the input_bank_data.csv file. sed -i '1d' input_bank_data.csv mkdir input_bank cp input_bank_data.csv input_bank Logistic regression in Mahout is implemented for single-machine execution. We set the variable MAHOUT_LOCAL to instruct Mahout to execute in the local mode. export MAHOUT_LOCAL=TRUE   mahout split --input input_bank --trainingOutput train_data --testOutput test_data -xm sequential --randomSelectionPct 30 This will create different datasets, with the split based on number passed to the argument --randomSelectionPct. The split command can run in both Hadoop and the local file system. For current execution, it runs in the local mode on the local file system and splits the data into two sets, 70 percent as train in the train_data directory and 30 percent as test in test_data directory. Next, we restore the header line to the train and test files as follows: sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,yn/' train_data/input_bank_data.csv sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,yn/' test_data/input_bank_data.csv Train the model command line option Let's have a look at some important and commonly used parameters and their descriptions: mahout trainlogistic ––help   --help print this list --quiet be extra quiet --input "input directory from where to get the training data" --output "output directory to store the model" --target "the name of the target variable" --categories "the number of target categories to be considered" --predictors "a list of predictor variables" --types "a list of predictor variables types (numeric, word or text)" --passes "the number of times to pass over the input data" --lambda "the amount of coeffiecient decay to use" --rate     "learningRate the learning rate" --noBias "do not include a bias term" --features "the number of internal hashed features to use"   mahout trainlogistic --input train_data/input_bank_data.csv --output model --target y --predictors age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed --types n w w w w w w w w w n n n n w n n n n n --features 20 --passes 100 --rate 50 --categories 2 We pass the input filename and the output folder name, identify the target variable name using --target option, the predictors using the --predictors option, and the variable or predictor type using --types option. Numeric predictors are represented using 'n', and categorical variables are predicted using 'w'. Learning rate passed using --rate is used by gradient descent to determine the step size for each descent. We pass the maximum number of passes over data as 100 and categories as 2. The output is given below, which represents 'y', the target variable, as a sum of predictor variables multiplied by coefficient or weights. As we have not included the --noBias option, we see the intercept term in the equation: y ~ -990.322*Intercept Term + -131.624*age + -11.436*campaign + -990.322*cons.conf.idx + -14.006*cons.price.idx + -15.447*contact=cellular + -9.738*contact=telephone + 5.943*day_of_week=fri + -988.624*day_of_week=mon + 10.551*day_of_week=thu + 11.177*day_of_week=tue + -131.624*day_of_week=wed + -8.061*default=no + 12.301*default=unknown + -131.541*default=yes + 6210.316*duration + -17.755*education=basic.4y + 4.618*education=basic.6y + 8.780*education=basic.9y + -11.501*education=high.school + 0.492*education=illiterate + 17.412*education=professional.course + 6202.572*education=university.degree + -979.771*education=unknown + -189.978*emp.var.rate + -6.319*euribor3m + -21.495*housing=no + -14.435*housing=unknown + 6210.316*housing=yes + -190.295*job=admin. + 23.169*job=blue-collar + 6202.200*job=entrepreneur + 6202.200*job=housemaid + -3.208*job=management + -15.447*job=retired + 1.781*job=self-employed + 11.396*job=services + -6.637*job=student + 6202.572*job=technician + -9.976*job=unemployed + -4.575*job=unknown + -12.143*loan=no + -0.386*loan=unknown + -197.722*loan=yes + -12.308*marital=divorced + -9.185*marital=married + -1004.328*marital=single + 8.559*marital=unknown + -11.501*month=apr + 9.110*month=aug + -1180.300*month=dec + -189.978*month=jul + 14.316*month=jun + -124.764*month=mar + 6203.997*month=may + -0.884*month=nov + -9.761*month=oct + 12.301*month=sep + -990.322*nr.employed + -189.978*pdays + -14.323*poutcome=failure + 4.874*poutcome=nonexistent + -7.191*poutcome=success + 1.698*previous Interpreting the output The output of the trainlogistic command is an equation representing the sum of all predictor variables multiplied by their respective coefficient. The coefficients give the change in the log-odds of the outcome for one unit increase in the corresponding feature or predictor variable. Odds are represented as the ratio of probabilities, and they express the relative probabilities of occurrence or nonoccurrence of an event. If we take the base 10 logarithm of odds and multiply the results by 10, it gives us the log-odds. Let's take an example to understand it better. Let's assume that the probability of some event E occurring is 75 percent: P(E)=75%=75/100=3/4 The probability of E not happening is as follows: 1-P(A)=25%=25/100=1/4 The odds in favor of E occurring are P(E)/(1-P(E))=3:1 and odds against it would be 1:3. This shows that the event is three times more likely to occur than to not occur. Log-odds would be 10*log(3). For example, a unit increase in the age will decrease the log-odds of the client subscribing to a term deposit by 97.148 times, whereas a unit increase in cons.conf.idx will increase the log-odds by 1051.996. Here, the change is measured by keeping other variables at the same value. Testing the model After the model is trained, it's time to test the model's performance by using a validation set. Mahout has the runlogistic command for the same, the options are as follows: mahout runlogistic ––help We run the following command on the command line: mahout runlogistic --auc --confusion --input train_data/input_bank_data.csv --model model   AUC = 0.59 confusion: [[25189.0, 2613.0], [424.0, 606.0]] entropy: [[NaN, NaN], [-45.3, -7.1]] To get the scores for each instance, we use the --scores option as follows: mahout runlogistic --scores --input train_data/input_bank_data.csv --model model To test the model on the test data, we will pass on the test file created during the split process as follows: mahout runlogistic --auc --confusion --input test_data/input_bank_data.csv --model model   AUC = 0.60 confusion: [[10743.0, 1118.0], [192.0, 303.0]] entropy: [[NaN, NaN], [-45.2, -7.5]] Prediction Mahout doesn't have an out of the box command line for implementation of logistic regression for prediction of new samples. Note that the new samples for the prediction won't have the target label y, we need to predict that value. There is a way to work around this, though; we can use mahout runlogistic for generating a prediction by adding a dummy column as the y target variable and adding some random values. The runlogistic command expects the target variable to be present, hence the dummy columns are added. We can then get the predicted score using the --scores option. Summary In this article, we covered the basic machine learning concepts. We also saw the logistic regression example in Mahout. Resources for Article:   Further resources on this subject: Implementing the Naïve Bayes classifier in Mahout [article] Learning Random Forest Using Mahout [article] Understanding the HBase Ecosystem [article]
Read more
  • 0
  • 0
  • 4995

Packt
30 Mar 2015
28 min read
Save for later

PostgreSQL – New Features

Packt
30 Mar 2015
28 min read
In this article, Jayadevan Maymala, author of the book, PostgreSQL for Data Architects, you will see how to troubleshoot the initial hiccups faced by people who are new to PostgreSQL. We will look at a few useful, but not commonly used data types. We will also cover pgbadger, a nifty third-party tool that can run through a PostgreSQL log. This tool can tell us a lot about what is happening in the cluster. Also, we will look at a few key features that are part of PostgreSQL 9.4 release. We will cover a couple of useful extensions. (For more resources related to this topic, see here.) Interesting data types We will start with the data types. PostgreSQL does have all the common data types we see in databases. These include: The number data types (smallint, integer, bigint, decimal, numeric, real, and double) The character data types (varchar, char, and text) The binary data types The date/time data types (including date, timestamp without timezone, and timestamp with timezone) BOOLEAN data types However, this is all standard fare. Let's start off by looking at the RANGE data type. RANGE This is a data type that can be used to capture values that fall in a specific range. Let's look at a few examples of use cases. Cars can be categorized as compact, convertible, MPV, SUV, and so on. Each of these categories will have a price range. For example, the price range of a category of cars can start from $15,000 at the lower end and the price range at the upper end can start from $40,000. We can have meeting rooms booked for different time slots. Each room is booked during different time slots and is available accordingly. Then, there are use cases that involve shift timings for employees. Each shift begins at a specific time, ends at a specific time, and involves a specific number of hours on duty. We would also need to capture the swipe-in and swipe-out time for employees. These are some use cases where we can consider range types. Range is a high-level data type; we can use int4range as the appropriate subtype for the car price range scenario. For the booking the meeting rooms and shifting use cases, we can consider tsrange or tstzrange (if we want to capture time zone as well). It makes sense to explore the possibility of using range data types in most scenarios, which involve the following features: From and to timestamps/dates for room reservations Lower and upper limit for price/discount ranges Scheduling jobs Timesheets Let's now look at an example. We have three meeting rooms. The rooms can be booked and the entries for reservations made go into another table (basic normalization principles). How can we find rooms that are not booked for a specific time period, say, 10:45 to 11:15? We will look at this with and without the range data type: CREATE TABLE rooms(id serial, descr varchar(50));   INSERT INTO rooms(descr) SELECT concat('Room ', generate_series(1,3));   CREATE TABLE room_book (id serial , room_id integer, from_time timestamp, to_time timestamp , res tsrange);   INSERT INTO room_book (room_id,from_time,to_time,res) values(1,'2014-7-30 10:00:00', '2014-7-30 11:00:00', '(2014-7-30 10:00:00,2014-7-30 11:00:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 10:00:00', '2014-7-30 10:40:00', '(2014-7-30 10:00,2014-7-30 10:40:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 11:20:00', '2014-7-30 12:00:00', '(2014-7-30 11:20:00,2014-7-30 12:00:00)');   INSERT INTO room_book (room_id,from_time,to_time,res) values(3,'2014-7-30 11:00:00', '2014-7-30 11:30:00', '(2014-7-30 11:00:00,2014-7-30 11:30:00)'); PostgreSQL has the OVERLAPS operator. This can be used to get all the reservations that overlap with the period for which we wanted to book a room: SELECT room_id FROM room_book WHERE (from_time,to_time) OVERLAPS ('2014-07-30 10:45:00','2014-07-30 11:15:00'); If we eliminate these room IDs from the master list, we have the list of rooms available. So, we prefix the following command to the preceding SQL: SELECT id FROM rooms EXCEPT We get a room ID that is not booked from 10:45 to 11:15. This is the old way of doing it. With the range data type, we can write the following SQL statement: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE res && '(2014-07-30 10:45:00,2014-07-30 11:15:00)'; Do look up GIST indexes to improve the performance of queries that use range operators. Another way of achieving the same is to use the following command: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE '2014-07-30 10:45:00' < to_time AND '2014-07-30 11:15:00' > from_time; Now, let's look at the finer points of how a range is represented. The range values can be opened using [ or ( and closed with ] or ). [ means include the lower value and ( means exclude the lower value. The closing (] or )) has a similar effect on the upper values. When we do not specify anything, [) is assumed, implying include the lower value, but exclude the upper value. Note that the lower bound is 3 and upper bound is 6 when we mention 3,5, as shown here: SELECT int4range(3,5,'[)') lowerincl ,int4range(3,5,'[]') bothincl, int4range(3,5,'()') bothexcl , int4range(3,5,'[)') upperexcl; lowerincl | bothincl | bothexcl | upperexcl -----------+----------+----------+----------- [3,5)       | [3,6)       | [4,5)       | [3,5) Using network address types The network address types are cidr, inet, and macaddr. These are used to capture IPv4, IPv6, and Mac addresses. Let's look at a few use cases. When we have a website that is open to public, a number of users from different parts of the world access it. We may want to analyze the access patterns. Very often, websites can be used by users without registering or providing address information. In such cases, it becomes even more important that we get some insight into the users based on the country/city and similar location information. When anonymous users access our website, an IP is usually all we get to link the user to a country or city. Often, this becomes our not-so-accurate unique identifier (along with cookies) to keep track of repeat visits, to analyze website-usage patterns, and so on. The network address types can also be useful when we develop applications that monitor a number of systems in different networks to check whether they are up and running, to monitor resource consumption of the systems in the network, and so on. While data types (such as VARCHAR or BIGINT) can be used to store IP addresses, it's recommended to use one of the built-in types PostgreSQL provides to store network addresses. There are three data types to store network addresses. They are as follows: inet: This data type can be used to store an IPV4 or IPV6 address along with its subnet. The format in which data is to be inserted is Address/y, where y is the number of bits in the netmask. cidr: This data type can also be used to store networks and network addresses. Once we specify the subnet mask for a cidr data type, PostgreSQL will throw an error if we set bits beyond the mask, as shown in the following example: CREATE TABLE nettb (id serial, intclmn inet, cidrclmn cidr); CREATE TABLE INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/32', '192.168.64.2/32'); INSERT 0 1 INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.2/24'); ERROR: invalid cidr value: "192.168.64.2/24" LINE 1: ...b (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.6...                                                              ^ DETAIL: Value has bits set to right of mask. INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.0/24'); INSERT 0 1 SELECT * FROM nettb; id |     intclmn     |   cidrclmn     ----+-----------------+----------------- 1 | 192.168.64.2   | 192.168.64.2/32 2 | 192.168.64.2/24 | 192.168.64.0/24 Let's also look at a couple of useful operators available within network address types. Does an IP fall in a subnet? This can be figured out using <<=, as shown here: SELECT id,intclmn FROM nettb ; id |   intclmn   ----+-------------- 1 | 192.168.64.2 3 | 192.168.12.2 4 | 192.168.13.2 5 | 192.168.12.4   SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/24'; id |   intclmn   3 | 192.168.12.2 5 | 192.168.12.4   SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/32'; id |   intclmn   3 | 192.168.12.2 The operator used in the preceding command checks whether the column value is contained within or equal to the value we provided. Similarly, we have the equality operator, that is, greater than or equal to, bitwise AND, bitwise OR, and other standard operators. The macaddr data type can be used to store Mac addresses in different formats. hstore for key-value pairs A key-value store available in PostgreSQL is hstore. Many applications have requirements that make developers look for a schema-less data store. They end up turning to one of the NoSQL databases (Cassandra) or the simple and more prevalent stores such as Redis or Riak. While it makes sense to opt for one of these if the objective is to achieve horizontal scalability, it does make the system a bit complex because we now have more moving parts. After all, most applications do need a relational database to take care of all the important transactions along with the ability to write SQL to fetch data with different projections. If a part of the application needs to have a key-value store (and horizontal scalability is not the prime objective), the hstore data type in PostgreSQL should serve the purpose. It may not be necessary to make the system more complex by using different technologies that will also add to the maintenance overhead. Sometimes, what we want is not an entirely schema-less database, but some flexibility where we are certain about most of our entities and their attributes but are unsure about a few. For example, a person is sure to have a few key attributes such as first name, date of birth, and a couple of other attributes (irrespective of his nationality). However, there could be other attributes that undergo change. A U.S. citizen is likely to have a Social Security Number (SSN); someone from Canada has a Social Insurance Number (SIN). Some countries may provide more than one identifier. There can be more attributes with a similar pattern. There is usually a master attribute table (which links the IDs to attribute names) and a master table for the entities. Writing queries against tables designed on an EAV approach can get tricky. Using hstore may be an easier way of accomplishing the same. Let's see how we can do this using hstore with a simple example. The hstore key-value store is an extension and has to be installed using CREATE EXTENSION hstore. We will model a customer table with first_name and an hstore column to hold all the dynamic attributes: CREATE TABLE customer(id serial, first_name varchar(50), dynamic_attributes hstore); INSERT INTO customer (first_name ,dynamic_attributes) VALUES ('Michael','ssn=>"123-465-798" '), ('Smith','ssn=>"129-465-798" '), ('James','ssn=>"No data" '), ('Ram','uuid=>"1234567891" , npr=>"XYZ5678", ratnum=>"Somanyidentifiers" '); Now, let's try retrieving all customers with their SSN, as shown here: SELECT first_name, dynamic_attributes FROM customer        WHERE dynamic_attributes ? 'ssn'; first_name | dynamic_attributes Michael   | "ssn"=>"123-465-798" Smith     | "ssn"=>"129-465-798" James     | "ssn"=>"No data" Also, those with a specific SSN: SELECT first_name,dynamic_attributes FROM customer        WHERE dynamic_attributes -> 'ssn'= '123-465-798'; first_name | dynamic_attributes - Michael   | "ssn"=>"123-465-798" If we want to get records that do not contain a specific SSN, just use the following command: WHERE NOT dynamic_attributes -> 'ssn'= '123-465-798' Also, replacing it with WHERE NOT dynamic_attributes ? 'ssn'; gives us the following command: first_name |                          dynamic_attributes         ------------+----------------------------------------------------- Ram       | "npr"=>"XYZ5678", "uuid"=>"1234567891", "ratnum"=>"Somanyidentifiers" As is the case with all data types in PostgreSQL, there are a number of functions and operators available to fetch data selectively, update data, and so on. We must always use the appropriate data types. This is not just for the sake of doing it right, but because of the number of operators and functions available with a focus on each data type; hstore stores only text. We can use it to store numeric values, but these values will be stored as text. We can index the hstore columns to improve performance. The type of index to be used depends on the operators we will be using frequently. json/jsonb JavaScript Object Notation (JSON) is an open standard format used to transmit data in a human-readable format. It's a language-independent data format and is considered an alternative to XML. It's really lightweight compared to XML and has been steadily gaining popularity in the last few years. PostgreSQL added the JSON data type in Version 9.2 with a limited set of functions and operators. Quite a few new functions and operators were added in Version 9.3. Version 9.4 adds one more data type: jsonb.json, which is very similar to JSONB. The jsonb data type stores data in binary format. It also removes white spaces (which are insignificant) and avoids duplicate object keys. As a result of these differences, JSONB has an overhead when data goes in, while JSON has extra processing overhead when data is retrieved (consider how often each data point will be written and read). The number of operators available with each of these data types is also slightly different. As it's possible to cast one data type to the other, which one should we use depends on the use case. If the data will be stored as it is and retrieved without any operations, JSON should suffice. However, if we plan to use operators extensively and want indexing support, JSONB is a better choice. Also, if we want to preserve whitespace, key ordering, and duplicate keys, JSON is the right choice. Now, let's look at an example. Assume that we are doing a proof of concept project for a library management system. There are a number of categories of items (ranging from books to DVDs). We wouldn't have information about all the categories of items and their attributes at the piloting stage. For the pilot stage, we could use a table design with the JSON data type to hold various items and their attributes: CREATE TABLE items (    item_id serial,    details json ); Now, we will add records. All DVDs go into one record, books go into another, and so on: INSERT INTO items (details) VALUES ('{                  "DVDs" :[                         {"Name":"The Making of Thunderstorms", "Types":"Educational",                          "Age-group":"5-10","Produced By":"National Geographic"                          },                          {"Name":"My nightmares", "Types":"Movies", "Categories":"Horror",                          "Certificate":"A", "Director":"Dracula","Actors":                                [{"Name":"Meena"},{"Name":"Lucy"},{"Name":"Van Helsing"}]                          },                          {"Name":"My Cousin Vinny", "Types":"Movies", "Categories":"Suspense",                          "Certificate":"A", "Director": "Jonathan Lynn","Actors":                          [{"Name":"Joe "},{"Name":"Marissa"}] }] }' ); A better approach would be to have one record for each item. Now, let's take a look at a few JSON functions: SELECT   details->>'DVDs' dvds, pg_typeof(details->>'DVDs') datatype      FROM items; SELECT   details->'DVDs' dvds ,pg_typeof(details->'DVDs') datatype      FROM items; Note the difference between ->> and -> in the following screenshot. We are using the pg_typeof function to clearly see the data type returned by the functions. Both return the JSON object field. The first function returns text and the second function returns JSON: Now, let's try something a bit more complex: retrieve all movies in DVDs in which Meena acted with the following SQL statement: WITH tmp (dvds) AS (SELECT json_array_elements(details->'DVDs') det FROM items) SELECT * FROM tmp , json_array_elements(tmp.dvds#>'{Actors}') as a WHERE    a->>'Name'='Meena'; We get the record as shown here: We used one more function and a couple of operators. The json_array_elements expands a JSON array to a set of JSON elements. So, we first extracted the array for DVDs. We also created a temporary table, which ceases to exist as soon as the query is over, using the WITH clause. In the next part, we extracted the elements of the array actors from DVDs. Then, we checked whether the Name element is equal to Meena. XML PostgreSQL added the xml data type in Version 8.3. Extensible Markup Language (XML) has a set of rules to encode documents in a format that is both human-readable and machine-readable. This data type is best used to store documents. XML became the standard way of data exchanging information across systems. XML can be used to represent complex data structures such as hierarchical data. However, XML is heavy and verbose; it takes more bytes per data point compared to the JSON format. As a result, JSON is referred to as fat-free XML. XML structure can be verified against XML Schema Definition Documents (XSD). In short, XML is heavy and more sophisticated, whereas JSON is lightweight and faster to process. We need to configure PostgreSQL with libxml support (./configure --with-libxml) and then restart the cluster for XML features to work. There is no need to reinitialize the database cluster. Inserting and verifying XML data Now, let's take a look at what we can do with the xml data type in PostgreSQL: CREATE TABLE tbl_xml(id serial, docmnt xml); INSERT INTO tbl_xml(docmnt ) VALUES ('Not xml'); INSERT INTO tbl_xml (docmnt)        SELECT query_to_xml( 'SELECT now()',true,false,'') ; SELECT xml_is_well_formed_document(docmnt::text), docmnt        FROM tbl_xml; Then, take a look at the following screenshot: First, we created a table with a column to store the XML data. Then, we inserted a record, which is not in the XML format, into the table. Next, we used the query_to_xml function to get the output of a query in the XML format. We inserted this into the table. Then, we used a function to check whether the data in the table is well-formed XML. Generating XML files for table definitions and data We can use the table_to_xml function if we want to dump the data from a table in the XML format. Append and_xmlschema so that the function becomes table_to_xml_and_xmlschema, which will also generate the schema definition before dumping the content. If we want to generate just the definitions, we can use table_to_xmlschema. PostgreSQL also provides the xpath function to extract data as follows: SELECT xpath('/table/row/now/text()',docmnt) FROM tbl_xml        WHERE id = 2;                xpath               ------------------------------------ {2014-07-29T16:55:00.781533+05:30} Using properly designed tables with separate columns to capture each attribute is always the best approach from a performance standpoint and update/write-options perspective. Data types such as json/xml are best used to temporarily store data when we need to provide feeds/extracts/views to other systems or when we get data from external systems. They can also be used to store documents. The maximum size for a field is 1 GB. We must consider this when we use the database to store text/document data. pgbadger Now, we will look at a must-have tool if we have just started with PostgreSQL and want to analyze the events taking place in the database. For those coming from an Oracle background, this tool provides reports similar to AWR reports, although the information is more query-centric. It does not include data regarding host configuration, wait statistics, and so on. Analyzing the activities in a live cluster provides a lot of insight. It tells us about load, bottlenecks, which queries get executed frequently (we can focus more on them for optimization). It even tells us if the parameters are set right, although a bit indirectly. For example, if we see that there are many temp files getting created while a specific query is getting executed, we know that we either have a buffer issue or have not written the query right. For pgbadger to effectively scan the log file and produce useful reports, we should get our logging configuration right as follows: log_destination = 'stderr' logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d.log' log_min_duration_statement = 0 log_connections = on log_disconnections = on log_duration = on log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' log_lock_waits = on track_activity_query_size = 2048 It might be necessary to restart the cluster for some of these changes to take effect. We will also ensure that there is some load on the database using pgbench. It's a utility that ships with PostgreSQL and can be used to benchmark PostgreSQL on our servers. We can initialize the tables required for pgbench by executing the following command at shell prompt: pgbench -i pgp This creates a few tables on the pgp database. We can log in to psql (database pgp) and check: \dt              List of relations Schema |       Name      | Type | Owner   --------+------------------+-------+---------- public | pgbench_accounts | table | postgres public | pgbench_branches | table | postgres public | pgbench_history | table | postgres    public | pgbench_tellers | table | postgres Now, we can run pgbench to generate load on the database with the following command: pgbench -c 5 -T10 pgp The T option passes the duration for which pgbench should continue execution in seconds, c passes the number of clients, and pgp is the database. At shell prompt, execute: wget https://github.com/dalibo/pgbadger/archive/master.zip Once the file is downloaded, unzip the file using the following command: unzip master.zip Use cd to the directory pgbadger-master as follows: cd pgbadger-master Execute the following command: ./pgbadger /pgdata/9.3/pg_log/postgresql-2014-07-31.log –o myoutput.html Replace the log file name in the command with the actual name. It will generate a myoutput.html file. The HTML file generated will have a wealth of information about what happened in the cluster with great charts/tables. In fact, it takes quite a bit of time to go through the report. Here is a sample chart that provides the distribution of queries based on execution time: The following screenshot gives an idea about the number of performance metrics provided by the report: If our objective is to troubleshoot performance bottlenecks, the slowest individual queries and most frequent queries under the top drop-down list is the right place to start. Once the queries are identified, locks, temporary file generation, and so on can be studied to identify the root cause. Of course, EXPLAIN is the best option when we want to refine individual queries. If the objective is to understand how busy the cluster is, the Overview section and Sessions are the right places to explore. The logging configuration used may create huge log files in systems with a lot of activity. Tweak the parameters appropriately to ensure that this does not happen. With this, we covered most of the interesting data types, an interesting extension and a must-use tool from PostgreSQL ecosystem. Now, let's cover a few interesting features in PostgreSQL Version 9.4. Features over time Applying filters in Versions 8.0, 9.0, and 9.4 gives us a good idea about how quickly features are getting added to the database. Interesting features in 9.4 Each version of PostgreSQL adds many features grouped into different categories (such as performance, backend, data types, and so on). We will look at a few features that are more likely to be of interest (because they help us improve performance or they make maintenance and configuration easy). Keeping the buffer ready As we saw earlier, reads from disk have a significant overhead compared to those from memory. There are quite a few occasions when disk reads are unavoidable. Let's see a few examples. In a data warehouse, the Extract, Transform, Load (ETL) process, which may happen once a day usually, involves a lot of raw data getting processed in memory before being loaded into the final tables. This data is mostly transactional data. The master data, which does not get processed on a regular basis, may be evicted from memory as a result of this churn. Reports typically depend a lot on master data. When users refresh their reports after ETL, it's highly likely that the master data will be read from disk, resulting in a drop in the response time. If we could ensure that the master data as well as the recently processed data is in the buffer, it can really improve user experience. In a transactional system like an airline reservation system, a change to the fare rule may result in most of the fares being recalculated. This is a situation similar to the one described previously, ensuring that the fares and availability data for the most frequently searched routes in the buffer can provide a better user experience. This applies to an e-commerce site selling products also. If the product/price/inventory data is always available in memory, it can be retrieved very fast. You must use PostgreSQL 9.4 for trying out the code in the following sections. So, how can we ensure that the data is available in the buffer? A pg_prewarm module has been added as an extension to provide this functionality. The basic syntax is very simple: SELECT pg_prewarm('tablename');. This command will populate the buffers with data from the table. It's also possible to mention the blocks that should be loaded into the buffer from the table. We will install the extension in a database, create a table, and populate some data. Then, we will stop the server, drop buffers (OS), and restart the server. We will see how much time a SELECT count(*) takes. We will repeat the exercise, but we will use pg_prewarm before executing SELECT count(*) at psql: CREATE EXTENSION pg_prewarm; CREATE TABLE myt(id SERIAL, name VARCHAR(40)); INSERT INTO myt(name) SELECT concat(generate_series(1,10000),'name'); Now, stop the server using pg_ctl at the shell prompt: pg_ctl stop -m immediate Clean OS buffers using the following command at the shell prompt (will need to use sudo to do this): echo 1 > /proc/sys/vm/drop_caches The command may vary depending on the OS. Restart the cluster using pg_ctl start. Then, execute the following command: SELECT COUNT(*) FROM myt; Time: 333.115 ms We should repeat the steps of shutting down the server, dropping the cache, and starting PostgreSQL. Then, execute SELECT pg_prewarm('myt'); before SELECT count(*). The response time goes down significantly. Executing pg_prewarm does take some time, which is close to the time taken to execute the SELECT count(*) against a cold cache. However, the objective is to ensure that the user does not experience a delay. SELECT COUNT(*) FROM myt; count ------- 10000 (1 row) Time: 7.002 ms Better recoverability A new parameter called recovery_min_apply_delay has been added in 9.4. This will go to the recovery.conf file of the slave server. With this, we can control the replay of transactions on the slave server. We can set this to approximately 5 minutes and then the standby will replay the transaction from the master when the standby system time is 5 minutes past the time of commit at the master. This provides a bit more flexibility when it comes to recovering from mistakes. When we keep the value at 1 hour, the changes at the master will be replayed at the slave after one hour. If we realize that something went wrong on the master server, we have about 1 hour to stop the transaction replay so that the action that caused the issue (for example, accidental dropping of a table) doesn't get replayed at the slave. Easy-to-change parameters An ALTER SYSTEM command has been introduced so that we don't have to edit postgresql.conf to change parameters. The entry will go to a file named postgresql.auto.conf. We can execute ALTER SYSTEM SET work_mem='12MB'; and then check the file at psql: \! more postgresql.auto.conf # Do not edit this file manually! # It will be overwritten by ALTER SYSTEM command. work_mem = '12MB' We must execute SELECT pg_reload_conf(); to ensure that the changes are propagated. Logical decoding and consumption of changes Version 9.4 introduces physical and logical replication slots. We will look at logical slots as they let us track changes and filter specific transactions. This lets us pick and choose from the transactions that have been committed. We can grab some of the changes, decode, and possibly replay on a remote server. We do not have to have an all-or-nothing replication. As of now, we will have to do a lot of work to decode/move the changes. Two parameter changes are necessary to set this up. These are as follows: The max_replication_slots parameter (set to at least 1) and wal_level (set to logical). Then, we can connect to a database and create a slot as follows: SELECT * FROM pg_create_logical_replication_slot('myslot','test_decoding'); The first parameter is the name we give to our slot and the second parameter is the plugin to be used. Test_decoding is the sample plugin available, which converts WAL entries into text representations as follows: INSERT INTO myt(id) values (4); INSERT INTO myt(name) values ('abc'); Now, we will try retrieving the entries: SELECT * FROM pg_logical_slot_peek_changes('myslot',NULL,NULL); Then, check the following screenshot: This function lets us take a look at the changes without consuming them so that the changes can be accessed again: SELECT * FROM pg_logical_slot_get_changes('myslot',NULL,NULL); This is shown in the following screenshot: This function is similar to the peek function, but the changes are no longer available to be fetched again as they get consumed. Summary In this article, we covered a few data types that data architects will find interesting. We also covered what is probably the best utility available to parse the PostgreSQL log file to produce excellent reports. We also looked at some of the interesting features in PostgreSQL version 9.4, which will be of interest to data architects. Resources for Article: Further resources on this subject: PostgreSQL as an Extensible RDBMS [article] Getting Started with PostgreSQL [article] PostgreSQL Cookbook - High Availability and Replication [article]
Read more
  • 0
  • 0
  • 3012

article-image-understanding-and-creating-simple-ssrs-reports
Packt
27 Mar 2015
14 min read
Save for later

Understanding and Creating Simple SSRS Reports

Packt
27 Mar 2015
14 min read
In this article by Deepak Agarwal and Chhavi Aggarwal, authors of the book Microsoft Dynamics AX 2012 R3 Reporting Cookbook, we will cover the following topics: Grouping in a report Adding ranges to a report Deploying a report Creating a menu item for a report Creating a report using a query in Warehouse Management (For more resources related to this topic, see here.) Reports are a basic necessity for any business process, as they aid in making critical decisions by analyzing all the data together in a customized manner. Reports can be fetched in many types, such as ad-hoc, analytical, transactional, general statements, and many more by using images, pie charts, and many other graphical representations. These reports help the user to undertake required actions. Microsoft SQL Reporting Services (SSRS) is the basic primary reporting tool of Dynamics AX 2012 R2 and R3. This article will help you to understand the development of SSRS reports in AX 2012 R3 by developing and designing reports using simple steps. These steps have further been detailed into simpler and smaller recipes. In this article, you will design a report using queries with simple formatting, and then deploy the report to the reporting server to make it available for the user. This is made easily accessible inside the rich client. Reporting overview Microsoft SQL Server Reporting Services (SSRS) is the most important feature of Dynamics AX 2012 R2 and R3 reporting. It is the best way to generate analytical, high user scale, transactional, and cost-effective reports. SSRS reports offer ease of customization of reports so that you can get what you want to see. SSRS provides a complete reporting platform that enables the development, design, deployment, and delivery of interactive reports. SSRS reports use Visual Studio (VS) to design and customize reports. They have extensive reporting capabilities and can easily be exported to Excel, Word, and PDF formats. Dynamics AX 2012 has extensive reporting capabilities like Excel, Word, Power Pivot, Management Reporter, and most importantly, SSRS reports. While there are many methodologies to generate reports, SSRS remains the prominent way to generate analytical and transactional reports. SSRS reports were first seen integrated in AX 2009, and today, they have replaced the legacy reporting system in AX 2012. SSRS reports can be developed using classes and queries. In this article, we will discuss query-based reports. In query-based reports, a query is used as the data source to fetch the data from Dynamics AX 2012 R3. We add the grouping and ranges in the query to filter the data. We use the auto design reporting feature to create a report, which is then deployed to the reporting server. After deploying the report, a menu item is attached to the report in Dynamics AX R3 so that the user can display the report from AX R3. Through the recipes in this article, we will build a vendor master report. This report will list all the vendors under each vendor group. It will use the query data source to fetch data from Dynamics AX and subsequently create an auto design-based report. So that this report can be accessed from a rich client, it will then be deployed to the reporting servicer and attached to a menu item in AX. Here are some important links to get started with this article: Install Reporting Services extensions from https://technet.microsoft.com/en-us/library/dd362088.aspx. Install Visual Studio Tools from https://technet.microsoft.com/en-us/library/dd309576.aspx. Connect Microsoft Dynamics AX to the new Reporting Services instance by visiting https://technet.microsoft.com/en-us/library/hh389773.aspx. Before you install the Reporting Services extensions see https://technet.microsoft.com/en-us/library/ee355041.aspx. Grouping in reports Grouping means putting things into groups. Grouping data simplifies the structure of the report and makes it more readable. It also helps you to find details, if required. We can group the data in the query as well as in the auto design node in Visual Studio. In this recipe, we will structure the report by grouping the VendorMaster report based on the VendorGroup to make the report more readable. How to do it... In this recipe, we will add fields under the grouping node of the dataset created earlier in Visual Studio. The fields that have been added in the grouping node will be added and shown automatically in the SSRS report. Go to Dataset and select the VendGroup field. Drag and drop it to the Groupings node under the VendorMaster auto design. This will automatically create a new grouping node and add the VendGroup field to the group. Each grouping has a header row where even fields that don't belong to the group but need to be displayed in the grouped node can be added. This groups the record and also acts like a header, as seen in the following screenshot: How it works… Grouping can also be done based on multiple fields. Use the row header to specify the fields that must be displayed in the header. A grouping can be added manually but dragging and dropping prevents a lot of tasks such as setting the row header. Adding ranges to the report Ranges are very important and useful while developing an SSRS report in AX 2012 R3. They help to show only limited data, which is filtered based on given ranges, in the report. The user can filter the data in a report on the basis of the field added as a range. The range must be specified in the query. In this recipe, we will show how we can filter the data and use a query field as a range. How to do it... In this recipe, we will add the field under the Ranges node in the query that we made in the previous recipe. By adding the field as a range, you can now filter the data on the basis of VendGroup and show only the limited data in the report. Open the PKTVendorDetails query in AOT. Drag the VendGroup and Blocked fields to the Ranges node in AOT and save your query. In the Visual Studio project, right-click on Datasets and select Refresh. Under the parameter node, VendorMaster_DynamicParameter collectively represents any parameter that will be added dynamically through the ranges. This parameter must be set to true to make additional ranges available during runtime. This adds a Select button to the report dialog, which the user can use to specify additional ranges other than what is added. Right-click on the VendorMaster auto design and select Preview. The preview should display the range that was added in the query. Click on the Select button and set the VendGroup value to 10. Click on the OK button, and then select the Report tab, as shown in the following screenshot: Save your changes and rebuild the report from Solution Explorer. Then, deploy the solution. How it works… The report dialog uses the query service UI builder to translate the ranges and to expose additional ranges through the query. Dynamic parameter: The dynamic parameter unanimously represents all the parameters that are added at runtime. It adds the Select button to the dialog from where the user can invoke an advanced query filter window. From this filter window, more ranges and sorting can be added. The dynamic parameter is available per dataset and can be enabled or disabled by setting up the Dynamic Filters property to True or False. The Report Wizard in AX 2012 still uses Morphx reports to auto-create reports using the wizard. The auto report option is available on every form that uses a new AX SSRS report. Deploying a report SSRS, being a server side solution, needs to deploy reports in Dynamics AX 2012 R3. Until the reports are deployed, the user will not be able to see them or the changes made in them, neither from Visual Studio nor from the Dynamics AX rich client. Reports can be deployed in multiple ways and the developer must make this decision. In this recipe, we will show you how we can deploy reports using the following: Microsoft Dynamics AX R3 Microsoft Visual Studio Microsoft PowerShell Getting ready In order to deploy reports, you must have the permission and rights to deploy them to SQL Reporting Services. You must also have the permission to access the reporting manager configuration. Before deploying reports using Microsoft PowerShell, you must ensure that Windows PowerShell 2.0 is installed. How to do it... Microsoft Dynamics AX R3 supports the following ways to deploy SSRS reports. Location of deployment For each of the following deployment locations, let's have a look at the steps that need to be followed: Microsoft Dynamics AX R3: Reports can be deployed individually from a developer workspace in Microsoft Dynamics AX. SSRS reports can be deployed by using the developer client in Microsoft Dynamics AX R3. In AOT, expand the SSRS Reports node, expand the Reports node, select the particular report that needs to be deployed, expand the selected report node, right-click on the report, and then select and click on Deploy Element. The developer can deploy as many reports as need to be deployed, but individually. Reports can be deployed for all the translated languages. Microsoft Visual Studio: Individual reports can be deployed using Visual Studio. Open Visual Studio. In Solution Explorer, right-click on the reporting project that contains the report that you want to deploy, and click on Deploy. The reports are deployed for the neutral (invariant) language only. Microsoft PowerShell: This is used to deploy the default reports that exist within Microsoft Dynamics AX R3. Open Windows PowerShell and by using this, you can deploy multiple reports at the same time. Visit http://msdn.microsoft.com/en-us/library/dd309703.aspx for details on how to deploy reports using PowerShell. To verify whether a report has been deployed, open the report manager in the browser and open the Dynamics AX folder. The PKTVendorDetails report should be found in the list of reports. You can find the report manager URL from System administration | Setup | Business intelligence | Reporting Services | Report servers. The report can be previewed from Reporting Services also. Open Reporting Services and click on the name of the report to preview it. How it works Report deployment is the process of actually moving all the information related to a report to a central location, which is the server, from where it can be made available to the end user. The following list indicates the typical set of actions performed during deployment: The RDL file is copied to the server. The business logic is placed in the server location in the format of a DLL. Deployment ensures that the RDL and business logic are cross-referenced to each other. The Morphx IDE from AX 2009 is still available. Any custom reports that are designed can be imported. This support is only for the purpose of backward compatibility. In AX 2012 R3, there is no concept of Morphx reports. Creating a menu item for a report The final step of developing a report in AX 2012 R3 is creating a menu item inside AX to make it available for users to open from the UI end. This recipe will tell you how to create a new menu item for a report and set the major properties for it. Also, it will teach you to add this menu item to a module to make it available for business users to access this report. How to do it... You can create the new menu item under the Menu Item node in AOT. In this recipe, the output menu item is created and linked with the menu item with SSRS report. Go to AOT | Menu Items | Output, right-click and select New Menu Item. Name it PKTVendorMasterDetails and set the properties as highlighted in the following screenshot: Open the Menu Item to run the report. A dialog appears with the Vendor hold and Group ranges added to the query, followed by a Select button. The Select button is similar to the Morphx reports option where the user can specify additional conditions. To disable the Select option, go to the Dynamic Filter property in the dataset of the query and set it to False. The report output should appear as seen in the following screenshot: How it works… The report viewer in Dynamics AX is actually a form with an embedded browser control. The browser constructs the report URL at runtime and navigates to the reports URL. Unlike in AX 2009, when the report is rendering, the data it doesn't hold up using AX. Instead, the user can use the other parts of the application while the report is rendering. This is particularly beneficial for the end users as they can proceed with other tasks as the report executes. The permission setup is important as it helps in controlling the access to a report. However, SSRS reports inherit user permission from the AX setup itself. Creating a report using a query in Warehouse Management In Dynamics AX 2012 R3, Warehouse Management is a new module. In the earlier version of AX (2012 or R2), there was a single module for Inventory and Warehouse Management. However, in AX R3, there is a separate module. AX queries are the simplest and fastest way to create SSRS reports in Microsoft Dynamics AX R3. In this recipe, we will develop an SSRS report on Warehouse Management. In AX R3, Warehouse Management is integrated with bar-coding devices such as RF-SMART, which supports purchase and receiving processes: picking, packing and shipping, transferring and stock counts, issuing materials for production orders, and reporting production as well. AX R3 also supports the workflow for the Warehouse Management module, which is used to optimize picking, packing, and loading of goods for delivery to customers. Getting ready To work through this recipe, Visual Studio must be installed on your system to design and deploy the report. You must have the permission to access all the rights of the reporting server, and reporting extensions must be installed. How to do it... Similar to other modules, Warehouse Management also has its tables with the "WHS" prefix. We start the recipe by creating a query, which consists of WHSRFMenuTable and WHSRFMenuLine as the data source. We will provide a range of Menus in the query. After creating a query, we will create an SSRS report in Visual Studio and use that query as the data source and will generate the report on warehouse management. Open AOT, add a new query, and name it PKTWarehouseMobileDeviceMenuDetails. Add a WHSRFMenuTable table. Go to Fields and set the Dynamics property to Yes. Add a WHSRFMenuLine table and set the Relation property to Yes. This will create an auto relation that will inherit from table relation node. Go to Fields and set the Dynamics property to Yes. Now open Visual Studio and add a new Dynamics AX report model project. Name it PKTWarehouseMobileDeviceMenuDetails. Add a new report to this project and name it PKTWarehouseMobileDeviceDetails. Add a new dataset and name it MobileDeviceDetails. Select the PKTWarehouseMobileDeviceMenuDetails query in the Dataset property. Select all fields from both tables. Click on OK. Now drag and drop this dataset in the design node. It will automatically create an auto design. Rename it MobileMenuDetails. In the properties, set the layout property to ReportLayoutStyleTemplate. Now preview your report. How it works When we start creating an SSRS report, VS must be connected with Microsoft Dynamics AX R3. If the Microsoft Dynamics AX option is visible in Visual Studio while creating the new project, then the reporting extensions are installed. Otherwise, we need to install the reporting extensions properly. Summary This article helps you to walk through the basis of SSRS reports and create a simple report using queries. It will also help you understand the basic characteristics of reports. Resources for Article: Further resources on this subject: Consuming Web Services using Microsoft Dynamics AX [article] Setting Up and Managing E-mails and Batch Processing [article] Exploring Financial Reporting and Analysis [article]
Read more
  • 0
  • 0
  • 10535

article-image-puppet-and-os-security-tools
Packt
27 Mar 2015
17 min read
Save for later

Puppet and OS Security Tools

Packt
27 Mar 2015
17 min read
In this article by Jason Slagle, author of the book Learning Puppet Security, covers using Puppet to manage SELinux and auditd. We learned a lot so far about using Puppet to secure your systems as, well as how to use it to make groups of systems more secure. However, in all of that, we've not yet covered some of the basic OS-level functions that are available to secure a system. In this article, we'll review several of those functions. (For more resources related to this topic, see here.) SELinux is a powerful tool in the security arsenal. Most administrators experience with it, is along the lines of "how can I turn that off ?" This is born out of frustration with the poor documentation about the tool, as well as the tedious nature of the configuration. While Puppet cannot help you with the documentation (which is getting better all the time), it can help you with some of the other challenges that SELinux can bring. That is, ensuring that the proper contexts and policies are in place on the systems being managed. In this article, we'll cover the following topics related to OS-level security tools: A brief introduction to SELinux and auditd The built-in Puppet support for SELinux Community modules for SELinux Community modules for auditd At the end of this article, you should have enough skills so that you no longer need to disable SELinux. However, if you still need to do so, it is certainly possible to do via the modules presented here. Introducing SELinux and auditd During the course of this article, we'll explore the SELinux framework for Linux and see how to automate it using Puppet. As part of the process, we'll also review auditd, the logging and auditing framework for Linux. Using Puppet, we can automate the configuration of these often-neglected security tools, and even move the configuration of these tools for various services to the modules that configure those services. The SELinux framework SELinux is a security system for Linux originally developed by the United States National Security Agency (NSA). It is an in-kernel protection mechanism designed to provide Mandatory Access Controls (MACs) to the Linux kernel. SELinux isn't the only MAC framework for Linux. AppArmor is an alternative MAC framework included in the Linux kernel since Version 2.6.30. We choose to implement SELinux; since it is the default framework used under Red Hat Linux, which we're using for our examples. More information on AppArmor can be found at http://wiki.apparmor.net/index.php/Main_Page. These access controls work by confining processes to the minimal amount of files and network access that the processes require to run. By doing this, the controls limit the amount of collateral damage that can be done by a process, which becomes compromised. SELinux was first merged to the Linux mainline kernel for the 2.6.0 release. It was introduced into Red Hat Enterprise Linux with Version 4, and into Ubuntu in Version 8.04. With each successive release of the operating systems, support for SELinux grows, and it becomes easier to use. SELinux has a couple of core concepts that we need to understand to properly configure it. The first are the concepts of types and contexts. A type in SELinux is a grouping of similar things. Files used by Apache may be httpd_sys_content_t, for instance, which is a type that all content served by HTTP would have. The httpd process itself is of type httpd_t. These types are applied to objects, which represent discrete things, such as files and ports, and become part of the context of that object. The context of an object represents the object's user, role, type, and optionally data on multilevel security. For this discussion, the type is the most important component of the context. Using a policy, we grant access from the subject, which represents a running process, to various objects that represent files, network ports, memory, and so on. We do that by creating a policy that allows a subject to have access to the types it requires to function. SELinux has three modes that it can operate in. The first of these modes is disabled. As the name implies, the disabled mode runs without any SELinux enforcement. The second mode is called permissive. In permissive mode, SELinux will log any access violations, but will not act on them. This is a good way to get an idea of where you need to modify your policy, or tune Booleans to get proper system operations. The final mode, enforcing, will deny actions that do not have a policy in place. Under Red Hat Linux variants, this is the default SELinux mode. By default, Red Hat 6 runs SELinux with a targeted policy in enforcing mode. This means, that for the targeted daemons, SELinux will enforce its policy by default. An example is in order here, to explain this well. So far, we've been operating with SELinux disabled on our hosts. The first step in experimenting with SELinux is to turn it on. We'll set it to permissive mode at first, while we gather some information. To do this, after starting our master VM, we'll need to modify the SELinux configuration and reboot. While it's possible to change from enforcing mode to either permissive or disabled mode without a reboot, going back requires us to reboot. Let's edit the /etc/sysconfig/selinux file and set the SELINUX variable to permissive on our puppetmaster. Remember to start the vagrant machine and SSH in as it is necessary. Once this is done, the file should look as follows: Once this is complete, we need to reboot. To do so, run the following command: sudo shutdown -r now Wait for the system to come back online. Once the machine is back up and you SSH back into it, run the getenforce command. It should return permissive, which means SELinux is running, but not enforced. Now, we can make sure our master is running and take a look at its context. If it's not running, you can start the service with the sudo service puppetmaster start command. Now, we'll use the -Z flag on the ps command to examine the SELinux flag. Many commands, such as ps and ls use the -Z flag to view the SELinux data. We'll go ahead and run the following command to view the SELinux data for the running puppetmaster: ps -efZ|grep puppet When you do this, you'll see a Linux output, such as follows: unconfined_u:system_r:initrc_t:s0 puppet 1463     1 1 11:41 ? 00:00:29 /usr/bin/ruby /usr/bin/puppet master If you take a look at the first part of the output line, you'll see that Puppet is running in the unconfined_u:system_r:initrc_t context. This is actually somewhat of a bug and a result of the Puppet policy on CentOS 6 being out of date. We should actually be running under the system_u:system_r:puppetmaster_t:s0 context, but the policy is for a much older version of Puppet, so it runs unconfined. Let's take a look at the sshd process to see what it looks like also. To do so, we'll just grep for sshd instead: ps -efZ|grep sshd The output is as follows: system_u:system_r:sshd_t:s0-s0:c0.c1023 root 1206 1 0 11:40 ? 00:00:00 /usr/sbin/sshd This is a more traditional output one would expect. The sshd process is running under the system_u:system_r:sshd_t context. This actually corresponds to the system user, the system role, and the sshd type. The user and role are SELinux constructs that help you allow role-based access controls. The users do not map to system users, but allow us to set a policy based on the SELinux user object. This allows role-based access control, based on the SELinux user. Previously the unconfined user was a user that will not be enforced. Now, we can take a look at some objects. Doing a ls -lZ /etc/ssh command results in the following: As you can see, each of the files belongs to a context that includes the system user, as well as the object role. They are split among the etc type for configuration files and the sshd_key type for keys. The SSH policy allows the sshd process to read both of these file types. Other policies, say, for NTP, would potentially allow the ntpd process to read the etc types, but it would not be able to read the sshd_key files. This very fine-grained control is the power of SELinux. However, with great power comes very complex configuration. Configuration can be confusing to set up, if it doesn't happen correctly. For instance, with Puppet, the wrong type can potentially impact the system if not dealt with. Fortunately, in permissive mode, we will log data that we can use to assist us with this. This leads us into the second half of the system that we wish to discuss, which is auditd. In the meantime, there is a bunch of information on SELinux available on its website at http://selinuxproject.org/page/Main_Page. There's also a very funny, but informative, resource available describing SELinux at https://people.redhat.com/duffy/selinux/selinux-coloring-book_A4-Stapled.pdf. The auditd framework for audit logging SELinux does a great job at limiting access to system components; however, reporting what enforcement took place was not one of its objectives. Enter the auditd. The auditd is an auditing framework developed by Red Hat. It is a complete auditing system using rules to indicate what to audit. This can be used to log SELinux events, as well as much more. Under the hood, auditd has hooks into the kernel to watch system calls and other processes. Using the rules, you can configure logging for any of these events. For instance, you can create a rule that monitors writes to the /etc/passwd file. This would allow you to see if any users were added to the system. We can also add monitoring of files, such as lastlog and wtmp to monitor the login activity. We'll explore this example later when we configure auditd. To quickly see how a rule works, we'll manually configure a quick rule that will log the time when the wtmp file was edited. This will add some system logging around users logging in. To do this, let's edit the /etc/audit/audit.rules file to add a rule to monitor this. Edit the file and add the following lines: -w /var/log/wtmp -p wa -k logins-w /etc/passwd –p wa –k password We'll take a look at what the preceding lines do. These lines both start with the –w clauses. These indicate the files that we are monitoring. Second, we have the –p clauses. This lets you set what file operations we monitor. In this case, it is write and append operations. Finally, with the the –k entries, we're setting a keyword that is logged and can be filtered on. This should go at the end of the file. Once it's done, reload auditd with the following command: sudo service auditd restart Once this is complete, go ahead and log another ssh session in. Once you can simply log, back out. Once this is done, take a look at the /var/log/audit/audit.log file. You should see the content like the following: type=SYSCALL msg=audit(1416795396.816:482): arch=c000003e syscall=2 success=yes exit=8 a0=7fa983c446aa a1=1 a2=2 a3=7fff3f7a6590 items=1 ppid=1206 pid=2202 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=51 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="logins"type=SYSCALL msg=audit(1416795420.057:485): arch=c000003e syscall=2 success=yes exit=7 a0=7fa983c446aa a1=1 a2=2 a3=8 items=1 ppid=1206 pid=2202 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=51 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="logins" There are tons of fields in this output, including the SELinux context, the userID, and so on. Of interest is the auid, which is the audit user ID. On commands run via the sudo command, this will still contain the user ID of the user who called sudo. This is a great way to log commands performed via sudo. Auditd also logs SELinux failures. They get logged under the type AVC. These access vector cache logs will be placed in the auditd log file when a SELinux violation occurs. Much like SELinux, auditd is somewhat complicated. The intricacies of it are beyond the scope of this book. You can get more information at http://people.redhat.com/sgrubb/audit/. SELinux and Puppet Puppet has direct support for several features of SELinux. There are two native Puppet types for SELinux: selboolean and selmodule. These types support setting SELinux Booleans and installing SELinux policy modules. SELinux Booleans are variables that impact on how SELinux behaves. They are set to allow various functions to be permitted. For instance, you set a SELinux Boolean to true to allow the httpd process to access network ports. SELinux modules are groupings of policies. They allow policies to be loaded in a more granular way. The Puppet selmodule type allows Puppet to load these modules. The selboolean type The targeted SELinux policy that most distributions use is based on the SELinux reference policy. One of the features of this policy is the use of Boolean variables that control actions of the policy. There are over 200 of these Booleans on a Red Hat 6-based machine. We can investigate them by installing the policycoreutils-python package on the operating system. You can do this by executing the following command: sudo yum install policycoreutils-python Once installed, we can run the semanage boolean -l command to get a list of the Boolean values, along with their descriptions. The output of this will look as follows: As you can see, there exists a very large number of settings that can be reconfigured, simply by setting the appropriate Boolean value. The selboolean Puppet type supports managing these Boolean values. The provider is fairly simple, accepting the following values: Parameter Description name This contains the name of the Boolean to be set. It defaults to the title. persistent This checks whether to write the value to disk for the next boot. provider This is the provider for the type. Usually, the default getsetsebool value is accepted. value This contains the value of the Boolean, true or false. Usage of this type is rather simple. We'll show an example that will set the puppetmaster_use_db parameter to true value. If we are using the SELinux Puppet policy, this would allow the master to talk to a database. For our use, it's a simple unused variable that we can use for demonstration purposes. As a reminder, the SElinux policy for Puppet on CentOS 6 is outdated, so setting the Boolean does not impact the version of Puppet we're running. It does, however, serve to show how a Boolean is set. To do this, we'll create a sample role and profile for our puppetmaster. This is something that would likely exist in a production environment to manage the configuration of the master. In this example, we'll simply build a small profile and role for the master. Let's start with the profile. Copy over the profiles module we've slowly been building up, and let's add a puppetmaster.pp profile. To do so, edit the profiles/manifests/puppetmaster.pp file and make it look as follows: class profiles::puppetmaster {selboolean { 'puppetmaster_use_db':   value     => on,   persistent => true,}} Then, we'll move on to the role. Copy the roles, and edit the roles/manifests/puppetmaster.pp file there and make it look as follows: class roles::puppetmaster {include profiles::puppetmaster} Once this is done, we can apply it to our host. Edit the /etc/puppet/manifests/site.pp file. We'll apply the puppetmaster role to the puppetmaster machine, as follows: node 'puppet.book.local' {include roles::puppetmaster} Now, we'll run Puppet and get the output as follows: As you can see, it set the value to on when run. Using this method, we can set any of the SELinux Boolean values we need for our system to operate properly. More information on SELinux Booleans with information on how to obtain a list of them can be found at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Working_with_SELinux-Booleans.html. The selmodule type The other native type inside Puppet is a type to manage the SELinux modules. Modules are compiled collections of the SELinux policy. They're loaded into the kernel using the selmodule command. This Puppet type provides support for this mechanism. The available parameters are as follows: Parameter Description name This contains the name of the module— it defaults to the title ensure This is the desired state—present or absent provider This specifies the provider for the type—it should be selmodule selmoduledir This is the directory that contains the module to be installed selmodulepath This provides the complete path to the module to be installed if not present in selmoduledir syncversion This checks whether to resync the module if a new version is found, such as ensure => latest  Using the module, we can take our compiled module and serve it onto the system with Puppet. We can then use the module to ensure that it gets installed on the system. This lets us centrally manage the module with Puppet. We'll see an example where this module compiles a policy and then installs it, so we won't show a specific example here. Instead, we'll move on to talk about the last SELinux-related component in Puppet. File parameters for SELinux The final internal support for SELinux types comes in the form of the file type. The file type parameters are as follows: Parameter Description selinux_ignore_defaults By default, Puppet will use the matchpathcon function to set the context of a file. This overrides that behavior if set to true value. Selrange This sets the SELinux range component. We've not really covered this. It's not used in most mainstream distributions at the time this book was written. Selrole This sets the SELinux role on the file. seltype This sets the SELinux type on the file. seluser This sets the SELinux role on the file. Usually, if you place files in the correct location (the expected location for a service) on the filesystem, Puppet will get the SELinux properties correct via its use of the matchpathcon function. This function (which also has a matching utility) applies a default context based on the policy settings. Setting the context manually is used in cases where you're storing data outside the normal location. For instance, you might be storing web data under the /opt file. The preceding types and providers provide the basics that allow you to manage SELinux on a system. We'll now take a look at a couple of community modules that build on these types and create a more in-depth solution. Summary This article looked at what SELinux and auditd were, and gave a brief example of how they can be used. We looked at what they can do, and how they can be used to secure your systems. After this, we looked at the specific support for SELinux in Puppet. We looked at the two built-in types to support it, as well as the parameters on the file type. Then, we took a look at one of the several community modules for managing SELinux. Using this module, we can store the policies as text instead of compiled blobs. Resources for Article: Further resources on this subject: The anatomy of a report processor [Article] Module, Facts, Types and Reporting tools in Puppet [Article] Designing Puppet Architectures [Article]
Read more
  • 0
  • 0
  • 13336
article-image-overview-horizon-view-architecture-and-its-components
Packt
27 Mar 2015
31 min read
Save for later

An Overview of Horizon View Architecture and its Components

Packt
27 Mar 2015
31 min read
In this article by Peter von Oven and Barry Coombs, authors of the book Mastering VMware Horizon 6, we will introduce you to the architecture and architectural components that make up the core VMware Horizon solution, concentrating on the virtual desktop elements of Horizon with Horizon View Standard. This article will cover the core Horizon View functionality of brokering virtual desktop machines that are hosted on the VMware vSphere platform. In this article, we will discuss the role of each of the Horizon View components and explain how they fit into the overall infrastructure and the benefits they bring, followed by a deep-dive into how Horizon View works. (For more resources related to this topic, see here.) Introducing the key Horizon components To start with, we are going to introduce, at a high level, the main components that make up the Horizon View product. All of the VMware Horizon components described are included as part of the licensed product, and the features that are available to you depend on whether you have the View Standard Edition, the Advanced Edition, or the Enterprise Edition. Horizon licensing also includes ESXi and vCenter licensing to support the ability to deploy the core hosting infrastructure. You can deploy as many ESXi hosts and vCenter Servers as you require to host the desktop infrastructure. The key elements of Horizon View are outlined in the following diagram: In the next section, we are going to start drilling down deeper into the architecture of how these high-level components fit together and how they work. A high-level architectural overview In this article, we will cover the core Horizon View functionality of brokering virtual desktop machines that are hosted on the VMware vSphere platform. The Horizon View architecture is pretty straightforward to understand, as its foundations lie in the standard VMware vSphere products (ESXi and vCenter). So, if you have the necessary skills and experience of working with this platform, then you are already halfway there. Horizon View builds on the vSphere infrastructure, taking advantage of some of the features of the ESX hypervisor and vCenter Server. Horizon View requires adding a number of virtual machines to perform the various View roles and functions. An overview of the View architecture is shown in the following diagram: View components run as applications that are installed on the Microsoft Windows Server operating system, so they could actually run on physical hardware as well. However, there are a great number of benefits available when you run them as virtual machines, such as delivering HA and DR, as well as the typical cost savings that can be achieved through virtualization. The following sections will cover each of these roles/components of the View architecture in greater detail. The Horizon View Connection Server The Horizon View Connection Server, sometimes referred to as Connection Broker or View Manager, is the central component of the View infrastructure. Its primary role is to connect a user to their virtual desktop by means of performing user authentication and then delivering the appropriate desktop resources based on the user's profile and user entitlement. When logging on to your virtual desktop, it is the Connection Server that you are communicating with. How does the Connection Server work? A user typically connects to their virtual desktop from their device by launching the View Client. Once the View Client has launched, the user enters the address details of the View Connection Server, which in turn responds by asking them to provide their network login details (their Active Directory (AD) domain username and password). It's worth noting that Horizon View now supports different AD function levels. These are detailed in the following screenshot: Based on their entitlements, these credentials are authenticated with AD and, if successful, the user is able to continue the logon process. Depending on what they are entitled to, the user could see a launch screen that displays a number of different desktop shortcuts available for login. These desktops represent the desktop pools that the user has been entitled to use. A pool is basically a collection of virtual desktops; for example, it could be a pool for the marketing department where the desktops contain specific applications/software for that department. Once authenticated, the View Manager makes a call to the vCenter Server to create a virtual desktop machine and then vCenter makes a call to View Composer (if you are using linked clones) to start the build process of the virtual desktop if there is not one already available. Once built, the virtual desktop is displayed/delivered within the View Client window, using the chosen display protocol (PCoIP or RDP). The process is described in detail in the following diagram: There are other ways to deploy VDI solutions that do not require a connection broker, and allow a user to connect directly to a virtual desktop; fact, there might be a specific use case for doing this such as having a large number of branches, where having local infrastructure allows trading to continue in the event of a WAN outage or poor network communication with the branch. VMware has a solution for what's referred to as a "Brokerless View": the VMware Horizon View Agent Direct-Connection Plugin. However, don't forget that, in a Horizon View environment, the View Connection Server provides greater functionality and does much more than just connecting users to desktops. The Horizon View Connection Server runs as an application on a Windows Server that, which in turn, could either be a physical or a virtual machine. Running as a virtual machine has many advantages; for example, it means that you can easily add high-availability features, which are key if you think about them, as you could potentially have hundreds of virtual user desktops running on a single-host server. Along with managing the connections for the users, the Connection Server also works with vCenter Server to manage the virtual desktop machines. For example, when using linked clones and powering on virtual desktops, these tasks might be initiated by the Connection Server, but they are executed at the vCenter Server level. Minimum requirements for the Connection Server To install the View Connection Server, you need to meet the following minimum requirements to run on physical or virtual machines: Hardware requirements: The following screenshot shows the hardware required:< Supported operating systems: The View Connection Server must be installed on one of the following operating systems: The Horizon View Security Server Horizon View Security Server is another instance and another version of the View Connection Server but, this time, it sits within your DMZ so that you can allow end users to securely connect to their virtual desktop machine from an external network or the Internet. You cannot install the View Security Server on the same machine that is already running as a View Connection Server or any of the other Horizon View components. How does the Security Server work? The user login process at the start is the same as when using a View Connection Server for internal access but, now we have added an extra security layer with the Security Server. The idea is that users can access their desktop externally without unnecessarily needing a VPN on the network first. The process is described in detail in the following diagram: The Security Server is paired with a View Connection Server that is configured by the use of a one-time password during installation. It's a bit like pairing your phone's Bluetooth with the hands-free kit in your car. When the user logs in from the View Client, they access the View Connection Server, which in turn authenticates the user against AD. If the View Connection Server is configured as a PCoIP gateway, then it will pass the connection and addressing information to the View Client. This connection information will allow the View Client to connect to the View Security Server using PCoIP. This is shown in the diagram by the green arrow (1). The View Security Server will then forward the PCoIP connection to the virtual desktop machine, (2) creating the connection for the user. The virtual desktop machine is displayed/delivered within the View Client window (3) using the chosen display protocol (PCoIP or RDP). The Horizon View Replica Server The Horizon View Replica Server, as the name suggests, is a replica or copy of a View Connection Server that is used to enable high availability to your Horizon View environment. Having a replica of your View Connection Server means that, if the Connection Server fails, users are still able to connect to their virtual desktop machines. You will need to change the IP address or update the DNS record to match this server if you are not using a load balancer. How does the Replica Server work? So, the first question is, what actually gets replicated? The View Connection Broker stores all its information relating to the end users, desktop pools, virtual desktop machines, and other View-related objects, in an Active Directory Application Mode (ADAM) database. Then, using the Lightweight Directory Access Protocol (LDAP) (it uses a method similar to what AD uses for replication), this View information gets copied from the original View Connection Server to the Replica Server. As both, the Connection Server and the Replica Server are now identical to each other, if your Connection Server fails, then you essentially have a backup that steps in and takes over so that end users can still continue to connect to their virtual desktop machines. Just like with the other components, you cannot install the Replica Server role on the same machine that is running as a View Connection Server or any of the other Horizon View components. Persistent or nonpersistent desktops In this section, we are going to talk about the different types of desktop assignments that can be deployed with Horizon View; these could also potentially have an impact on storage requirements, and also the way in which desktops are provisioned to the end users. One of the questions that always get asked is about having a dedicated (persistent) or a floating desktop assignment (nonpersistent). Desktops can either be individual virtual machines, which are dedicated to a user on a 1:1 basis (as we have in a physical desktop deployment, where each user effectively has their own desktop), or a user has a new, vanilla desktop that gets provisioned, personalized, and then assigned at the time of login and can be chosen at random from a pool of available desktops. This is the model that is used to build the user's desktop. The two options are described in more detail as follows: Persistent desktop: Users are allocated a desktop that retains all of their documents, applications, and settings between sessions. The desktop is statically assigned the first time that the user connects and is then used for all subsequent sessions. No other user is permitted access to the desktop. Nonpersistent desktop: Users might be connected to different desktops from the pool, each time that they connect. Environmental or user data does not persist between sessions and is delivered as the user logs on to their desktop. The desktop is refreshed or reset when the user logs off. In most use cases, a nonpersistent configuration is the best option, the key reason is that, in this model, you don't need to build all the desktops upfront for each user. You only need to power on a virtual desktop as and when it's required. All users start with the same basic desktop, which then gets personalized before delivery. This helps with concurrency rates. For example, you might have 5,000 people in your organization, but only 2,000 ever login at the same time; therefore, you only need to have 2,000 virtual desktops available. Otherwise, you would have to build a desktop for each one of the 5,000 users that might ever log in, resulting in more server infrastructure and certainly a lot more storage capacity. We will talk about storage in the next section. The other thing that we often see some confusion over is the difference between dedicated and floating desktops, and how linked clones fit in. Just to make it clear, linked clones and full clones are not what we are talking about when we refer to dedicated and floating desktops. Cloning operations refer to how a desktop is built, whereas the terms persistent and nonpersistent refer to how a desktop is assigned to a user. Dedicated and floating desktops are purely about user assignment and whether they have a dedicated desktop or one allocated from a pool on-demand. Linked clones and full clones are features of Horizon View, which uses View Composer to create a desktop image for each user from a master or parent image. This means, regardless of having a floating or dedicated desktop assignment, the virtual desktop machine could still be a linked or full clone. So, here's a summary of the benefits: It is operationally efficient. All users start from a single or smaller number of desktop images. Organizations reduce the amount of image and patch management. It is efficient storage-wise. The amount of storage required to host the nonpersistent desktop images will be smaller than keeping separate instances of unique user desktops. In the next section, we are going to cover an in-depth overview of Horizon View Composer and linked clones, and the advantages the technology delivers. Horizon View Composer and linked clones One of the main reasons a virtual desktop project fails to deliver, or doesn't even get out of the starting blocks, is heavy infrastructure down to storage requirements. The storage requirements are often seen as a huge cost burden, which can be attributed to the fact that people are approaching this in the same way they would approach a physical desktop environment's requirements. This would mean that each user gets their own dedicated virtual desktop and the hard disk space that comes with it, albeit a virtual disk; this then gets scaled out for the entire user population, so each user is allocated a virtual desktop with some storage. Let's take an example. If you had 1,000 users and allocated 250 GB per user's desktop, you would need 1,000 * 250 GB = 2.5 TB for the virtual desktop environment. That's a lot of storage just for desktops and could result in significant infrastructure costs that could possibly mean that the cost to deploy this amount of storage in the data center would render the project cost-in effective, compared to physical desktop deployments. A new approach to deploying storage for a virtual desktop environment is needed and this is where linked clone technology comes into play. In a nutshell, linked clones are designed to reduce the amount of disk space required, and to simplify the deployment and management of images to multiple virtual desktop machines—a centralized and much easier process. Linked clone technology Starting at a high level, a clone is a copy of an existing or parent virtual machine. This parent virtual machine (VM) is typically your gold build from which you want to create new virtual desktop machines. When a clone is created, it becomes a separate, new virtual desktop machine with its own unique identity. This process is not unique to Horizon View; it's actually a function of vSphere and vCenter, and in the case of Horizon View, we add in another component, View Composer, to manage the desktop images. There are two types of clones that we can deploy, a full clone or a linked clone. We will explain the difference in the next sections. Full clones As the name implies, a full clone disk is an exact, full-sized copy of the parent machine. Once the clone has been created, the virtual desktop machine is unique, with its own identity, and has no links back to the parent virtual machine from which it was cloned. It can operate as a fully independent virtual desktop in its own right and is not reliant on its parent virtual machine. However, as it is a full-sized copy, be aware that it will take up the same amount of storage as its parent virtual machine, which leads back to our discussion earlier in this article about storage capacity requirements. Using a full clone will require larger amounts of storage capacity and will possibly lead to higher infrastructure costs. Before you completely dismiss the idea of using full clone virtual desktop machines, there are some use cases that rely on this model. For example, if you use VMware Mirage to deliver a base layer or application layer, it only works today with full clones, dedicated Horizon View virtual desktop machines. If you have software developers, then they probably need to install specialist tools and a trust code onto a desktop, and therefore, need to "own" their desktop. Or perhaps, the applications that you run in your environment need a dedicated desktop due to the way the applications are licensed. Linked clones Having now discussed full clones, we are going to talk about deploying virtual desktop machines with linked clones. In a linked clone deployment, a delta disk is created and then used by the virtual desktop machine to store the data differences between its own operating system and the operating system of its parent virtual desktop machine. Unlike the full clone method, the linked clone is not a full copy of the virtual disk. The term linked clone refers to the fact that the linked clone will always look to its parent in order to operate, as it continues to read from the replica disk. Basically, the replica is a copy of a snapshot of the parent virtual desktop machine. The linked clone itself could potentially grow to the same size as the replica disk if you allow it to. However, you can set limits on how big it can grow, and should it start to get too big, then you can refresh the virtual desktops that are linked to it. This essentially starts the cloning process again from the initial snapshot. Immediately after a linked clone virtual desktop is deployed, the difference between the parent virtual machine and the newly created virtual desktop machine is extremely small and therefore reduces the storage capacity requirements compared to that of a full clone. This is how linked clones are more space-efficient than their full clone brothers. The underlying technology behind linked clones is more like a snapshot than a clone, but with one key difference: View Composer. With View Composer, you can have more than one active snapshot linked to the parent virtual machine disk. This allows you to create multiple virtual desktop images from just one parent. Best practice would be to deploy an environment with linked clones so as to reduce the storage requirements. However, as we previously mentioned, there are some use cases where you will need to use full clones. One thing to be aware of, which still relates to the storage, is that, rather than capacity, we are now talking about performance. All linked clone virtual desktops are going to be reading from one replica and therefore, will drive a high number of Input /Output Operations Per Second (IOPS) on the storage where the replica lives. Depending on your desktop pool design, you are fairly likely to have more than one replica, as you would typically have more than one data store. This in turn depends on the number of users who will drive the design of the solution. In Horizon View, you are able to choose the location where the replica lives. One of the recommendations is that the replica should sit on fast storage such as a local SSD. Alternative solutions would be to deploy some form of storage acceleration technology to drive the IOPS. Horizon View also has its own integrated solution called View Storage Accelerator (VSA) or Content Based Read Cache (CBRC). This feature allows you to allocate up to 2 GB of memory from the underlying ESXi host server that can be used as a cache for the most commonly read blocks. As we are talking about booting up desktop operating systems, the same blocks are required; as these can be retrieved from memory, the process is accelerated. Another solution is View Composer Array Integration (VCAI), which allows the process of building linked clones to be offloaded to the storage array and its native snapshot mechanism rather than taking CPU cycles from the host server. There are also a number of other third-party solutions that resolve the storage performance bottleneck, such as Atlantis Computing and their ILIO product, Nutanix, Nimble, and Tintri to name a few others. In the next section, we will take a deeper look at how linked clones work. How do linked clones work? The first step is to create your master virtual desktop machine image, which should contain not only the operating system, core applications, and settings, but also the Horizon View Agent components. This virtual desktop machine will become your parent VM or your gold image. This image can now be used as a template to create any new subsequent virtual desktop machines. The gold image or parent image cannot be a VM template. An overview of the linked clone process is shown in the following diagram:   Once you have created the parent virtual desktop or gold image (1), you then take a snapshot (2). When you create your desktop pool, this snapshot is selected and will become the replica (3) and will be set to be read-only. Each virtual desktop is linked back to this replica; hence the term linked clone. When you start creating your virtual desktops, you create linked clones that are unique copies for each user. Try not to create too many snapshots for your parent VM. I would recommend having just a handful, otherwise this could impact the performance of your desktops and make it a little harder to know which snapshot is which. What does View Composer build? During the image building process, and once the replica disk has been created, View Composer creates a number of other virtual disks, including the linked clone (operating system disk) itself. These are described in the following sections. Linked clone disk Not wanting to state the obvious, the main disk that gets created is the linked clone disk itself. This linked clone disk is basically an empty virtual disk container that is attached to the virtual desktop machine as the user logs in and the desktop starts up. This disk will start off small in size, but will grow over time, depending on the block changes that are requested from the replica disk by the virtual desktop machine's operating system. These block changes are stored in the linked clone disk, and this disk is sometimes referred to as the delta disk, or differential disk, due to the fact that it stores all the delta changes that the desktop operating system requests from the parent VM. As mentioned before, the linked clone disk can grow to the maximum size, equal to the parent VM but, following best practice, you would never let this happen. Typically, you can expect the linked clone disk to only increase to a few hundred MBs. We will cover this in the Linked clone processes section later. The replica disk is set as read-only and is used as the primary disk. Any writes and/or block changes that are requested by the virtual desktop are written/read directly from the linked clone disk. It is a recommended best practice to allocate tier-1 storage, such as local SSD drives, to host the replica, as all virtual desktops in the cluster will be referencing this single read-only VMDK file as their base image. Keeping it high in the stack improves performance, by reducing the overall storage IOPS required in a VDI workload. As we mentioned at the start of this section, storage costs are seen as being expensive for VDI. Linked clones reduce the burden of storage capacity but they do drive the requirement to derive a huge amount of IOPS from a single LUN. Persistent disk or user data disk The persistent disk feature of View Composer allows you to configure a separate disk that contains just the user data and user settings, and not the operating system. This allows any user data to be preserved when you update or make changes to the operating system disk, such as a recompose action. It's worth noting that the persistent disk is referenced by the VM name and not username, so bear this in mind if you want to attach the disk to another VM. This disk is also used to store the user's profile. With this in mind, you need to size it accordingly, ensuring that it is large enough to store any user profile type data such as Virtual Desktop Assessments. Disposable disk With the disposable disk option, Horizon View creates what is effectively a temporary disk that gets deleted every time the user powers off their virtual desktop machine. If you think about how the Windows desktop operating system operates and the files it creates, there are several files that are used on a temporary basis. Files such as Temporary Internet files or the Windows pagefile are two such examples. As these are only temporary files, why would you want to keep them? With Horizon View, these type of files are redirected to the disposable disk and then deleted when the VM is powered off. Horizon View provides the option to have a disposable disk for each virtual desktop. This disposable disk is used to contain temporary files that will get deleted when the virtual desktop is powered off. These are files that you don't want to store on the main operating system disk as they would consume unnecessary disk space. For example, files on the disposable disk are things such as the pagefile, Windows system temporary files, and VMware log files. Note that here we are talking about temporary system files and not user files. A user's temporary files are still stored on the user data disk so that they can be preserved. Many applications use the Windows temp folder to store installation CAB files, which can be referenced post-installation. Having said that, you might want to delete the temporary user data to reduce the desktop image size, in which case you could ensure that the user's temporary files are directed to the disposable disk. Internal disk Finally, we have the internal disk. The internal disk is used to store important configuration information, such as the computer account password, that would be needed to join the virtual desktop machine back to the domain if you refreshed the linked clones. It is also used to store Sysprep and Quickprep configurations details. In terms of disk space, the internal disk is relatively small, averaging around 20 MB. By default, the user will not see this disk from their Windows Explorer, as it contains important configuration information that you wouldn't want them to delete. Understanding the linked clone process There are several complex steps performed by View Composer and View Manager and that occur when a user launches a virtual desktop session. So, what's the process to build a linked clone desktop, and what goes on behind the scenes? When a user logs into Horizon View and requests a desktop, View Manager, using vCenter and View Composer, will create a virtual desktop machine. This process is described in the following sections. Creating and provisioning a new desktop An entry for the virtual desktop machine is created in the Active Directory Application Mode (ADAM) database before it is put into provisioning mode: The linked clone virtual desktop machine is created by View Composer. A machine account created in AD with a randomly generated password. View Composer checks for a replica disk and creates one if one does not already exist. A linked clone is created by the vCenter Server API call from View Composer. An internal disk is created to store the configuration information and machine account password. Customizing the desktop Now that you have a newly created, linked clone virtual desktop machine, the next phase is to customize it. The customization steps are as follows: The virtual desktop machine is switched to customization mode. The virtual desktop machine is customized by vCenter Server using the customizeVM_Task command and is joined to the domain with the information you entered in the View Manager console. The linked clone virtual desktop is powered on. The View Composer Agent on the linked clone virtual desktop machine starts up for the first time and joins the machine to the domain, using the NetJoinDomain command and the machine account password that was created on the internal disk. The linked clone virtual desktop machine is now Sysprep'd. Once complete, View Composer tells View Agent that customization has finished, and View Agent tells View Manager that the customization process has finished. The linked clone virtual desktop machine is powered off and a snapshot is taken. The linked clone virtual desktop machine is marked as provisioned and is now available for use. When a linked clone virtual desktop machine is powered on with the View Composer Agent running, the agent tracks any changes that are made to the machine account password. Any changes will be updated and stored on the internal disk. In many AD environments, the machine account password is changed periodically. If the View Composer Agent detects a password change, it updates the machine account password on the internal disk that was created with the linked clone. This is important, as a the linked clone virtual desktop machine is reverted to the snapshot taken after the customization during a refresh operation. For example, the agent will be able to reset the machine account password to the latest one. The linked clone process is depicted in the following diagram:   Additional features and functions of linked clones There are a number of other management functions that you can perform on a linked clone disk from View Composer; these are outlined in this section and are needed in order to deliver the ongoing management of the virtual desktop machines. Recomposing a linked clone Recomposing a linked clone virtual desktop machine or desktop pool allows you to perform updates to the operating system disk, such as updating the image with the latest patches, or software updates. You can only perform updates on the same version of an operating system, so you cannot use the recompose feature to migrate from one operating system to another, such as going from Windows XP to Windows 7. As we covered in the What does View Composer Build? section, we have separate disks for items such as user's data. These disks are not affected during a recompose operation, so all user-specific data on them is preserved. When you initiate the recompose operation, View Composer essentially starts the linked clone building process over again; thus, a new operating system disk is created, which then gets customized and a snapshot, such as the ones shown in the preceding sections, is taken. During the recompose operation, the MAC addresses of the network interface and the Windows SID are not preserved. There are some management tools and security-type solutions that might not work due to this change. However, the UUID will remain the same. The recompose process is described in the following steps: View Manager puts the linked clone into maintenance mode. View Manager calls the View Composer resync API for the linked clones being recomposed, directing View Composer to use the new base image and the snapshot. If there isn't a replica for the base image and snapshot yet, in the target datastore for the linked clone, View Composer creates the replica in the target datastore (unless a separate datastore is being used for replicas, in which case a replica is created in the replica datastore). View Composer destroys the current OS disk for the linked clone and creates a new OS disk linked to the new replica. The rest of the recompose cycle is identical to the customization phase of the provisioning and customization cycles. The following diagram shows a graphical representation of the recompose process. Before the process begins, the first thing you need to do is update your Gold Image (1) with the patch updates or new applications you want to deploy as the virtual desktops. As described in the preceding steps, the snapshot is then taken (2) to create the new replica, Replica V2 (3). The existing OS disk is destroyed, but the User Data disk (4) is maintained during the recompose process:   Refreshing a linked clone By carrying out a refresh of the linked clone virtual desktop, you are effectively reverting it to its initial state, when its original snapshot was taken after it had completed the customization phase. This process only applies to the operating system disk and no other disks are affected. An example use case for refresh operations would be recomposing a nonpersistent desktop two hours after logoff, to return it to its original state and make it available for the next user. The refresh process performs the following tasks: The linked clone virtual desktop is switched into maintenance mode. View Manager reverts the linked clone virtual desktop to the snapshot taken after customization was completed: - vdm-initial-checkpoint. The linked clone virtual desktop starts up, and View Composer Agent detects if the machine account password needs to be updated. If not, and the password on the internal disk is newer than the one in the registry, the agent will update the machine account password using the one on the internal disk. One of the reasons why you would perform a refresh operation is if the linked clone OS disk starts to become bloated. As we previously discussed, the OS-linked clone disk could grow to the full size of its parent image. This means it would be taking up more disk space than is really necessary, which kind of defeats the objective of linked clones. The refresh operation effectively resets the linked clone to a small delta between it and its parent image. The following diagram shows a representation of the refresh operation:   The linked clone on the left-hand side of the diagram (1) has started to grow in size. Refreshing reverts it back to the snapshot as if it was a new virtual desktop, as shown on the right-hand side of the diagram (2). Rebalancing operations with View Composer The rebalance operation in View Composer is used to evenly distribute the linked clone virtual desktop machines across multiple datastores in your environment. You would perform this task in the event that one of your datastores was becoming full while others have ample free space. It might also help with the performance of that particular datastore. For example, if you had 10 virtual desktop machines in one datastore and only two in another, then running a rebalance operation would potentially even this out and leave you with six virtual desktop machines per datastore. You must use the View Administrator console to initiate the rebalance operation in View Composer. If you simply try to vMotion any of your virtual desktop machines, then View Composer will not be able to keep track of them. On the other hand, if you have six virtual desktop machines on one datastore and seven on another, then it is highly likely that initiating a rebalance operation will have no effect, and no virtual desktop machines will be moved, as doing so has no benefit. A virtual desktop machine will only be moved to another datastore if the target datastore has significantly more spare capacity than the source. The rebalance process is described in the following steps: The linked clone is switched to maintenance mode. Virtual machines to be moved are identified based on the free space in the available datastores. The operating system disk and persistent disk are disconnected from the virtual desktop machine. The detached operating system disk and persistent disk are moved to the target datastore. The virtual desktop machine is moved to the target datastore. The operating system disk and persistent disk are reconnected to the linked clone virtual desktop machine. View Composer resynchronizes the linked clone virtual desktop machines. View Composer checks for the replica disk in the datastore and creates one if one does not already exist as per the provisioning steps covered earlier in this article. As per the recompose operation, the operating system disk for the linked clone gets deleted and a new one is created and then customized. The following diagram shows the rebalance operation:   Summary In this article, we discussed the Horizon View architecture and the different components that make up the complete solution. We covered the key technologies, such as how linked clones work to optimize storage. Resources for Article: Further resources on this subject: Importance of Windows RDS in Horizon View [article] Backups in the VMware View Infrastructure [article] Design, Install, and Configure [article]
Read more
  • 0
  • 0
  • 16331

article-image-system-center-reporting
Packt
27 Mar 2015
21 min read
Save for later

System Center Reporting

Packt
27 Mar 2015
21 min read
This article by the lead author Samuel Erskine, along with the co-authors Dieter Gasser, Kurt Van Hoecke, and Nasira Ismail, of the book Microsoft System Center Reporting Cookbook, discusses the drivers of organizational reporting and the general requirements on how to plan for business valued reports, steps for planning for the inputs your report data sources depends on, how you plan to view a report, the components of the System Center product, and preparing your environment for self-service Business Intelligence (BI). A report is only as good as the accuracy of its data source. A data source is populated and updated by an input channel. In this article, we will cover the following recipes: Understanding the goals of reporting Planning and optimizing dependent data inputs Planning report outputs Understanding the reporting schemas of System Center components Configuring Microsoft Excel for System Center data analysis (For more resources related to this topic, see here.) Understanding the goals of reporting This recipe discusses the drivers of organizational reporting and the general requirements on how to plan for business valued reports. Getting ready To prepare for this recipe you need to be ready to make a difference with all the rich data available to you in the area of reporting. This may require a mindset change; be prepared. How to do it... The key to successfully identifying what needs to be reported is a clear understanding of what you or the report requestor is trying to measure and why. Reporting is driven by a number of organizational needs, which may fall into one or more of these sample categories: Information to support a business case Audit and compliance driven request Budget planning and forecasting Current operational service level These categories are examples of the business needs which you must understand. Understanding the business needs of the report increases the value of the report. For example, let us expand on and map the preceding business scenarios to the System Center Product using the following table: Business/organizational objective Objective details System Center Product Information to support a business case Provide a count of computers out of warranty to justify the request to buy additional computers. System Center Configuration Manager Audit and compliance driven request Provide the security compliance state of all windows servers. Provide a list of attempted security breaches by month. System Center Configuration Manager System Center Operations Manager   Budget planning and forecasting How much storage should we plan to invest in next year's budget based on the last 3 years' usage data? System Center Operations Manager Operational Service Level How many incidents were resolved without second tier escalation? System Center Service Manager In a majority of cases for System Center administrators, the requestor does not provide the business objective. Use the preceding table as an example to guide your understanding of a report request. How it works... Reporting is a continual life cycle that begins with a request for information and should ultimately satisfy a real need. The typical life cycle of a request is illustrated in the following figure: The life cycle stages are: Report conception Report request Report creation Report enhancement/retirement The recipe focuses on the report conception stage. This stage is the most important stage of the life cycle. This is due to the fact that a report with a clear business objective will deliver the following: Focused activities: A report that does have a clear objective will reduce the risk of wasted effort usually associated with unclear requirements. Direct or indirect business benefit: The reports you create, for example using System Center data, ultimately should benefit the business. An additional benefit to this stage of report planning is knowing when a report is no longer required. This would reduce the need to manage and support a report that has no value or use. Planning and optimizing dependent data inputs A report is only as good as the accuracy of its data source. A data source is populated and updated by an input channel. This recipe discusses and provides steps for planning for the inputs your report data source(s) depends on. Getting ready Review the Understanding the goals of reporting recipe as a primer for this recipe. How to do it... The inputs of reports depend on the type of output you intend to produce and the definition of the accepted fields in the data source. An example is a report that would provide a total count of computers in a System Center Configuration Manager environment. This report will require an input field which stores a numeric value for computers in the database. Here are the recommended steps you must take to prepare and optimize the data inputs for a report: Identify the data source or sources. Document the source data type properties. Document the process used to populate the data sources (manual or automated process). Agree the authoritative source if there is more than one source for the same data. Identify and document relationship between sources. Document steps 1 to 5. The following table provides a practical example of the steps for a report on the total count of computers by the Windows operating system. Workgroup computers and computers not in the Active Directory domain are out of scope of this report request. Report input type Details Notes Data source Asset Database Populated manually by the purchase order team Data source Active Directory Automatically populated. Orchestrator runbook performs a scheduled clean-up of disabled objects Data source System Center Configuration Manager Requires an agent and currently not used to manage servers Authoritative source Active Directory Based on the report scope Data source relationship Microsoft System Center Configuration Manager is configured to discover all systems in the Active directory domain Alternative source for the report using the All systems collection Plan to document the specific fields you need from the authoritative data source. For example, use a table similar to the following. Required data Description Computer name The Fully Qualified domain name of the computer Operating system Friendly operating system name Operating system environment Server or workstation Date created in data source Date the computer joined the domain Last logon date Date the computer last updated the attributes in Active Directory The steps provided discusses an example of identifying input sources and the fields you plan to use in a requested report. Optimizing Report Inputs Once the required data for your reports have been identified and documented, you must test for validity and consistency. Data sources which are populated by automated processes tend to be less prone to consistency errors. Conversely data sources based on manual entry are prone to errors (for example, correct spelling when typing text into forms used to populate the data source). Here are typical recommended practices for improving consistency in manual and automated system populated data sources: Automated (for example, agent based):     Implement agent health check and remediation.     Include last agent update information in reports. Manual entry:     Avoid free text fields, except description or notes.     Use a list picker.     Implement mandatory constraints on required fields (for example, a request for e-mail address should only accept the right format for e-mail addresses. How it works... The reports you create and manage are only as accurate as the original data source. There may be one or more sources available for a report. The process discussed in this recipe provides steps on how to narrow down the list of requirements. The list must include the data source and the specific data fields which contain the data for the proposed report(s). These input fields are populated by manual, automated processes or a combination of both. The final part of the recipe discussed an example of how to optimize the inputs you select. These steps will assist in answering one of the typical questions often raised about reports: "Can we trust this information?" The answer, if you have performed these steps will be "Yes, and this is why and how." Planning report outputs The preceding recipe, Planning and optimizing dependent inputs, discussed what you need for a report. This recipe builds on the preceding recipes with a focus on how you plan to view a report (output). Getting ready Plan to review the Understanding the goals of reporting and Planning and optimizing dependent inputs recipes. How to do it... The type of report output depends on the input you query from the target data source(s). Typically, the output type is defined by the requestor of the report and may be in one or more of these formats: List of items (tables) Charts (2D, 3D, and formats supported by the reporting program) Geographic representation Dials and gauges A combination of all the listed formats Here is an example of the steps you must perform to plan and agree the reporting output (s): Request the target format from the initiator of the report. Check the data source supports the requested output. Create a sample dataset from the source. Create a sample output in the requestor's format(s). Agree a final format or combination of formats with the requestor. The steps to plan the output of reports are illustrated in the following figure: These are the basic minimal steps you must perform to plan for outputs. How it works... The steps in this recipe are focused on scoping the output of the report. The scope provides you with the following: Ensuring the output is defined before working on a large set of data Validating that the data source can support the requested output Avoids scope creep. The output is agreed and signed off The objective is to ensure that the request can be satisfied based on what is available and not what is desired. The process also provides an additional benefit of identifying any gaps in data before embarking on the actual report creation. There's more... When planning report outputs, you may not always have access to the actual source data. The recommend practice is not to work directly with the original source even if this is possible to avoid negatively impacting the source during the planning stage. In either case, there are other options available to you. One of these options is using a spreadsheet program such as Microsoft Excel. Mock up using Excel An approach to testing and validating report outputs is the use of Microsoft Excel. You can create a representation of the input source data including the data type (numbers, text, and formula). The data can either be a sample you create yourself or an extract from the original source of the data. The added benefit is that the spreadsheet can serve as a part of the portfolio of documentation for the report. Understanding the reporting schemas of System Center components The reporting schema of the System Center product is specific to each component. The components of the System Center product are listed in the following table: System Center component Description Configuration Manager This is configuration life cycle management. It is primarily targeted at client management; however, this is not a technical limitation, and can be used and is also used to manage servers. This component provides configuration management capabilities, which include but are not limited to deploying operating systems, performing hardware and software inventory, and performing application life cycle management. Data Protection Manager This component delivers the capabilities to provide continual protection (backup and recovery) services for servers and clients. Orchestrator This is the automation component of the product. It is a platform to connect the different vendor products in a heterogeneous environment in order to provide task automation and business-process automation. Operations Manager This component provides data center and client monitoring. Monitoring and remediation is performed at the component and deep application levels. Service Manager This provides IT service management capabilities. The capabilities are aligned with the Information Technology Infrastructure Library (ITIL) and the Microsoft Operations Framework (MOF). Virtual Machine Manager This is the component to manage virtualization. The capabilities span the management of private, public, and hybrid clouds. This recipe discusses the reporting capabilities of each of these components. Getting ready You must have a fully deployed configuration of one or more of the System Center product components. Your deployment must include the reporting option provided for the specific component. How to do it... The reporting capability for all the System Center components is rooted in their use of Microsoft SQL databases. The reporting databases for each of the components is listed in the following table: System Center component Default installation reporting database Additional information Configuration Manager CM_<Site Code> There is one database for each Configuration Manager site. Data Protection Manager DPMDB_<DPM Server Name> This is the default database for the DPM server. Additional information is written to the Operations Manager database if this optional integration is configured. Orchestrator Orchestrator This is the default name when you install Orchestrator. Operations Manager OperationsManagerDW You must install the reporting components to create and populate this database. Service Manager DWDataMart This is the default reporting database. You have the option to configure two additional databases known as OMDataMart and CMDataMart. Additionally, SQL Analysis Services creates a database called DWASDataBase that uses DWDataMart as a source. Virtual Machine Manager VirtualManagerDB This is the default database for the VMM server. Additional information is written to the Operations Manager database if this optional integration is configured. Use the steps in the following sections to view the schema of the reporting database of each of the System Center components. Configuration Manager Use the following steps: Identify the database server and instance of the Configuration Manager site. Use Microsoft SQL Server Management Studio (MSSMS) to connect to the database server. You must connect with a user account with the appropriate permission to view the Configuration Manager database. Navigate to Databases | CM_<site code> | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Configuration Manager component. Note that not all the views are listed in the screenshot. Data Protection Manager Use the following steps: Identify the database server and SQL instance of the Data Protection Manager environment. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Configuration Manager database. Navigate to Databases | DPMDB_<Server Name> | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Data Protection Manager component. Note that not all the views are shown in the screenshot. Orchestrator Use the following steps: Identify the database server and instance of the Orchestrator instance server. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Orchestrator database. Navigate to Databases | Orchestrator | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Orchestrator component. Operations Manager Use the following steps: Identify the database server and instance of the Operations Manager management group. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Operations Manager data warehouse reporting database. Navigate to Databases | OperationsManagerDW | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Operations Manager component. Note that not all the views are listed in the screenshot. Service Manager Use the following steps: Identify the database server and instance of the Service Manager data warehouse management group. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Service Manager data warehouse database. Navigate to Databases | DWDataMart | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Configuration Manager component. Note that not all the views are listed in the screenshot. Virtual Machine Manager Perform the following steps: Identify the database server and instance of the Virtual Machine Manager server. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Virtual Machine Manager database. Go to Databases | VirtualManagerDB | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Configuration Manager component. Note that not all the views are listed in the screenshot. How it works... The procedure provided is a simplified approach to gain a baseline of what may seem rather complicated if you are new to or have limited experiences with SQL databases. The view for each respective component is a consistent representation of the data that you can retrieve by writing reports. Each view is created from one or more tables in the database. The recommended practice is to target report construction at the views, as Microsoft ensures that these views remain consistent even when the underlying tables change. An example of how to understand the schema is as follows. Imagine the task of preparing a meal for dinner. The meal will require ingredients and a process to prepare it. Then, you will need to present the output on a plate. The following table provides a comparison of this scenario to the respective schema: Attributes of the meal Attributes of the schema Raw ingredients Database tables Packed single or combined ingredients available from a supermarket shelf SQL Server views that retrieve data from one or a combination of tables Preparing the meal Writing SQL queries using the views; use one or a combination (join) views Presenting the meal on a plate The report(s) in various formats In addition to using MSSMS, as described earlier, Microsoft supplies schema information for the components in the online documentation. This information is specific for each product and varies in the depth of the content. The See also section of this recipe provides useful links to the available information published for the schemas. There's more... It is important to understand the schema for the System Center components, but equally important are the processes that populate the databases. The data population process differs by component, but the results are the same (data is automatically inserted into the respective reporting database). The schema is a map to find the data, but the information available is provided by the agents and processes that transfer the information into the databases. Components with multiple databases System Center Service Manager and Operations Manager have a similar architecture. The data is initially written to the operational database and then transferred to the data warehouse. The operational database information is typically what is available to view in the console. The operational information is, however, not the best candidate for reporting, as this is constantly changing. Additionally, performing queries against the operational database can result in performance issues. You may view the schema of these databases using a process similar to the one described earlier, but this is not recommended for reporting purposes. See also The official online documentation for the schema is updated when Microsoft makes changes to the product, and it should be a point for reference at http://technet.microsoft.com/en-US/systemcenter/bb980621. Configuring Microsoft Excel for System Center data analysis This recipe is focused on preparing your environment for self-service Business Intelligence (BI). Getting ready Self-service BI in Microsoft Excel is made available by enabling or installing add-ins. You must download the add-ins from their respective official sites: Power Query: Download Microsoft Power Query for Excel from http://www.microsoft.com/en-gb/download/details.aspx?id=39379. PowerPivot: PowerPivot is available in the Office Professional Plus and Office 365 Professional Plus editions, and in the standalone edition of Excel 2013. Power View: Power View is also available in the Office Professional Plus and Office 365 Professional Plus editions, and in the standalone edition of Excel 2013. Power Maps: At the time of writing this article, this add-in can be downloaded from the Microsoft website. Power Map Preview for Excel 2013 can be downloaded from http://www.microsoft.com/en-us/download/details.aspx?id=38395. How to do it... The tasks discussed in this recipe are as follows: Installing the Power Query add-in Installing the Power Maps add-in Enabling PowerPivot and Power View in Microsoft Excel Installing the Power Query add-in The Power Query add-in must be installed using an MSI installer package that is available at Microsoft Download Center. The installer deploys the bits and enables the add-in in your Excel installation. The functionality in this add-in is regularly improved by Microsoft. Search for Microsoft Power Query for Excel in Download Center for the latest version. The add-in can be downloaded for 32-bit and 64-bit Microsoft Excel versions. Follow these steps to install the Power Query add-in: Review the system requirements on the download page and update your system if required. Note that when you initiate the setup, you may be prompted to install additional components if you do not have all the requirements installed. Right-click on the MSI installer and click on Install. Click on Next on the Welcome page. Accept the License Agreement and click on Next. Accept the default or click on Change to select the destination installation folder. Click on Next. On the Ready to Install Microsoft Power Query for Excel page, click on Install. The installation progress is displayed. Click on Finish on the Installation Completed page. The Power Query tab is available on the Excel ribbon after this installation. Installing the Power Map add-in The Power Map add-in must be installed using an executable (.exe) installer package that is available at Microsoft Download Center. The functionality in this add-in also is regularly improved by Microsoft. Search for Microsoft Power Map for Excel in the Download Center for the latest version. Follow these steps to install the Power Map add-in: Review the system requirements on the download page and update your system if required. Double-click on the EXE installer (Microsoft Power Map Preview for Excel) and click on Yes if you get the User Access Control dialog prompt. When prompted to install Visual C++ 2013 Runtime Libraries (x86), click on Close under Install. Check to agree to the terms and click on Install. Click on Next on the Welcome page. Click on the I Agree radio button on the License Agreement page, and then click on Next. Accept the default folder or click on Browse to select a different destination installation folder. Make your selection on who the installation should be made available to: Everyone or Just me. Click on Next. Click on Next. On the Confirm Installation page, click on Next. The installation progress is displayed. Click on Close on the Installation Completed page. The Power Map task will be made available in the Insert tab on the Excel ribbon after this installation. Enabling PowerPivot and Power View in Microsoft Excel Perform the following steps in Microsoft Excel to enable PowerPivot and Power View: In the File menu, select Options. In the Add-Ins tab, select COM Add-Ins from the Manage: dropdown at the bottom and click on the Go... button, as shown in this screenshot: Select the Power add-ins from the list of Add-Ins available, as shown in the following screenshot: Click on OK to complete the procedure of enabling add-ins in Microsoft Excel. After you've enabled the required add-ins, the different types of add-in tasks and tabs should be available on the Excel ribbon, as shown in this screenshot: This procedure can be used to enable or disable all the available Excel add-ins. You are now ready to explore System Center data, create queries, and perform analysis on the data. How it works... The add-ins for Microsoft Excel provide additional functionality to gather and analyze System Center data. Wizards can be added, interfaces can be made available to combine different sources, and a common language, Data Analysis SyntaX (DAX), can be made available for calculations and performing different forms of visualizations. The steps discussed in this recipe are required for the use of the Power BI features and functionality using Microsoft Excel. You followed the steps to install Power Query and Power Map, and you enabled PowerPivot and Power Views. These add-ins provide the foundation for self-service Business Intelligence using Microsoft Excel. See also Different types (enhanced) of functionality and integrations are available for you when you use Microsoft SQL Server or SharePoint, which are not discussed in this article. Refer to http://office.microsoft.com for additional information on them. Summary In this article, we covered the goals of reporting and how to plan and optimize dependent data inputs. We also discussed planning of report outputs, the reporting schemas of System Center components, and configuring Microsoft Excel for System Center data analysis. Resources for Article: Further resources on this subject: Adding and Importing Configuration items in System Center 2012 Service Manager [article] Mobility [article] Upgrading from Previous Versions [article]
Read more
  • 0
  • 0
  • 1421

article-image-storm-real-time-high-velocity-computation
Packt
27 Mar 2015
10 min read
Save for later

Storm for Real-time High Velocity Computation

Packt
27 Mar 2015
10 min read
In this article by Shilpi Saxena, author of the book Real-time Analytics with Storm and Cassandra, we will cover the following topics: What's possible with data analysis? Real-time analytics—why is it becoming the need of the hour Why storm—the power of high speed distributed computations We will get you to think about some interesting problems along the lines of Air Traffic Controller (ATC), credit card fraud detection, and so on. First and foremost, you will understand what is big data. Well, big data is the buzzword of the software industry but it's much more than the buzz in reality, it's really a huge amount of data. (For more resources related to this topic, see here.) What is big data? Big data is equal to volume, veracity, variety, and velocity. The descriptions of these are as follows: Volume: Enterprises are awash with ever growing data of all types, easily amassing terabytes even petabytes of information (for example, convert 12 terabytes of tweets created each day into an improved product sentiment analysis or convert 350 billion annual meter readings to better predict power consumption). Velocity: Sometimes, 2 minutes is too late. For time-sensitive processes, such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value (for example, scrutinize 5 million trade events created each day to identify potential fraud or analyze 500 million call detail records daily in real time to predict the customer churn faster). Variety: Big data is any type of data, structured and unstructured data, such as text, sensor data, audio, video, click streams, log files, and many more. New insights are found when analyzing these data types together (for example, monitor hundreds of live video feeds from surveillance cameras to target points of interest or exploit the 80 percent data growth in images, videos, and documents to improve customer satisfaction). Well now that I have described big data, let's have a quick look at where is this data generated and how does it come into existence. The following figure demonstrates a quick snapshot of what all can happen in one second in the world of the internet and social media. Now, we need the power to process all this data at the same rate at which it is generated to gain some meaningful insight out of it, as shown: The power of computation comes with the Storm and Cassandra combination. This technological combo let's us cater to the following use cases: Credit card fraud detection Security breaches Bandwidth allocation Machine failures Supply chain Personalized content Recommendations Get acquainted to few problems that require distributed computing solution Let's do a deep dive and identify some of the problems which require distributed solutions. Real-time business solution for credit or debit card fraud detection Let's get acquainted to the problem depicted in the following figure; when we make any transaction using plastic money and swipe our debit or credit card for payment, the duration within which the bank has to validate or reject the transaction is less than 5 seconds. During this less than 5 seconds, data or transaction details have to be encrypted, travel over secure network from servicing back bank to issuing back bank, then at the issuing back bank the entire fuzzy logic for acceptance or decline of the transaction has to computed, and the result has to travel back over the secure network: The challenges such as network latency and delay can be optimized to some extent, but to achieve the preceding featuring transaction in less than 5 seconds, one has to design an application that is able to churn a considerable amount of data and generate results in 1 to 2 seconds. Aircraft Communications Addressing and Reporting system It is another typical use case that cannot be implemented without having a reliable real-time processing system in place. These systems use Satellite communication (SATCOM), and as per the following figure, they gather voice and packet data from all phases of flight in real-time and are able to generate analytics and alerts on the same data in real-time. Let's take the example from the figure in the preceding case. A flight encounters some real hazardous weather, say, electric Storms on a route, then that information is sent through satellite links and voice or data gateways to the air controller, which in real-time detects and raises the alerts to deviate routes for all other flights passing through that area. Healthcare This is another very important domain where real-time analytics over high volume and velocity data has equipped the healthcare professionals with accurate and exact information in real-time to take informed life-saving actions. The preceding figure depicts the use case where the doctors can take informed action to handle the medical situation of the patients. Data is collated from historic patient database, drug database, and patient records. Once data is collected it is processed, and live statistics and key parameters of the patient are plotted against the same collated data. This data can be used to further generate reports and alerts to aid the health care professionals in real-time. Other applications There are varieties of other applications where power of real-time computing can either optimize or help people take informed decisions. It has become a great utility and aid in following industries: Manufacturing Application performance monitoring Customer relationship management Transportation industry Network optimization Complexity of existing solutions Now that we understand the power that real-time solutions can get into various industry verticals, let's explore and find out what options do we have to process vast amount of data being generated at a very fast pace. The Hadoop Solution The Hadoop solution is a tried, tested, and proven solution in industry which we use the MapReduce jobs in clustered setup to execute jobs and generate results. MapReduce is a programming paradigm where we process large data sets by using a mapper function that processes a key and value pair and thus generate intermediate output again in form of key-value pair. Then a reduce function operates on the mapper output and merges the values associated with same intermediate key and generates result. In the preceding figure, we demonstrate the simple word count MapReduce job where: There is a huge big data store which can go up to zettabytes and petabytes Blocks of the input data are split and replicated onto each of the nodes in Hadoop cluster Each mapper job counts the number of words on the data blocks allocated to it Once the mapper is done, the words (which are actually the keys) and the counts are sent to reducers Reducers combine the mapper output and the results are generated Big data, as we know, did provide a solution to processing and generating results out of humongous volume of data, but that's predominantly a batch processing system and has almost no utility on real-time use case. A custom solution Here we talk about a solution of the kinds twitter used before the advent of Storm. The simplistic version of the problem could be that you need a real-time count of the tweets by each user; Twitter solved the problem by following mechanism shown in the following figure: Here is the detailed information of how the preceding mechanism works: They created a fire hose or queue onto which all the tweets are pushed. A set of workers' nodes read from the queue and decipher the tweet Json and maintain the count of tweets by each user by different workers. At first set of workers the data or the number of tweets are equally distributed amongst the workers, so they are shared randomly. These workers assimilate these first level count into next set of queues. From these queues (the ones mentioned at level 1) second level of workers pick from these queues. Here the sharding is not random an algorithm is in place which ensures that tweet count of one user always goes to same worker. Then the counts are dumped into data store. The queue-worker solution is described in the following: Very complex and specific to the use case Redeployment and reconfiguration is a huge task Scaling is very tedious System is not fault tolerant Paid solution Well this is always an option, lot of big companies have invested in products which let us do this kind of computing but that comes at a heavy license cost. Few solutions to name are from companies such as: IBM Oracle Vertica Gigaspace Open real-time processing tools There are few other technologies which have some similar traits and features such as Apache Storm and S4 from Yahoo, but it lacks guaranteed processing. Spark is one is essentially a batch processing system with some features on micro-batching, which could be utilized as real-time. So finally after evaluation of all these problems, we still find Storm as the best open-source candidate to handle these use cases. Storm persistence Storm processes the streaming data at very high velocity. Cassandra complements the Storms ability to process by providing support to write and read to NoSQL at a very high rate. There are variety of API's available for connecting with Cassandra. In general the API's we are talking are wrappers written over core thrift API, which offer various crud operations over Cassandra cluster using programmer friendly packages. Thrift protocol: The most basic and core of all APIs for access to Cassandra it is the RPC protocol, which provides a language neutral interface and thus exposes flexibility to communicate using Python, Java and so on. Please note almost all other API's we'd discuss are using Thrift under the hood. It is simple to use and provides basic functionality out of the box such as ring discovery, and native access. Complex features such as retry, connection pooling, and so on are not supported out of the box. We have variety of libraries which have extended Thrift and added these much required features, we'd like to touch upon a few widely used ones in this article. Hector: This is has the privilege of being one of the most stable and extensively used API for java based client applications to access the Cassandra. As said earlier it uses Thrift underneath, so it can't essentially offer any feature or functionality not supported by Thrift protocol. The reasons for its wide spread use are number of essential features ready to use and available out of the box. It has implementation for connection pooling It has ring discovery feature with an add on of automatic failover support It has a retry for downed hosts in Cassandra ring Datastax Java Driver: This one is again a recent addition to the stack of client access options to Cassandra and hence gels well with newer version of Cassandra. Here are the salient features: Connection pooling Reconnection policies Load balancing Cursor support Astyanax: It is a very recent addition to bouquet of Cassandra client API's and has been developed by Netflix, which definitely makes it more fabled than others. Let's have a look at its credentials to see where does it qualifies: It supports all Hector functions and is much more easier to use Promises better connection pooling than hector Has a better failover handling than Hector It gives me some out of the box database like features (now that's a big news) At API level it provides me functionality called Recipes in its terms which provides:Parallel all row query executionMessaging queue functionalityObject storePagination It has numerous frequently required utilities such as following: JSON Writer CVS importer Summary In this article, we reviewed the what is big data, how it is analysed, the applications in which it it used, the complexity of the solutions and the monitoring tools of Storm. Resources for Article: Further resources on this subject: Deploying Storm on Hadoop for Advertising Analysis [article] An overview of architecture and modeling in Cassandra [article] Getting Up and Running with Cassandra [article]
Read more
  • 0
  • 0
  • 1959
article-image-testing-android-sdk
Packt
26 Mar 2015
25 min read
Save for later

Testing with the Android SDK

Packt
26 Mar 2015
25 min read
In this article by the author, Paul Blundell, of the book, Learning Android Application Testing, we learn to start digging a bit deeper to recognize the building blocks available to create more useful tests. We will be covering the following topics: Common assertions View assertions Other assertion types Helpers to test User Interfaces Mock objects Instrumentation TestCase class hierarchies Using external libraries We will be analyzing these components and showing examples of their use when applicable. The examples in this article are intentionally split from the original Android project that contains them. This is done to let you concentrate and focus only on the subject being presented, though the complete examples in a single project can be downloaded as explained later. Right now, we are interested in the trees and not the forest. Along with the examples presented, we will be identifying reusable common patterns that will help you in the creation of tests for your own projects. (For more resources related to this topic, see here.) The demonstration application A very simple application has been created to demonstrate the use of some of the tests in this article. The source for the application can be downloaded from XXXXXXXXXXXXX. The following screenshot shows this application running: When reading the explanation of the tests in this article, at any point, you can refer to the demo application that is provided in order to see the test in action. The previous simple application has a clickable link, text input, click on a button and a defined layout UI, we can test these one by one. Assertions in depth Assertions are methods that check for a condition that can be evaluated. If the condition is not met, the assertion method will throw an exception, thereby aborting the execution of the test. The JUnit API includes the class Assert. This is the base class of all the TestCase classes that hold several assertion methods useful for writing tests. These inherited methods test for a variety of conditions and are overloaded to support different parameter types. They can be grouped together in the following different sets, depending on the condition checked, for example: assertEquals assertTrue assertFalse assertNull assertNotNull assertSame assertNotSame fail The condition tested is pretty obvious and is easily identifiable by the method name. Perhaps the ones that deserve some attention are assertEquals() and assertSame(). The former, when used on objects, asserts that both objects passed as parameters are equally calling the objects' equals() method. The latter asserts that both objects refer to the same object. If, in some case, equals() is not implemented by the class, then assertEquals() and assertSame() will do the same thing. When one of these assertions fails inside a test, an AssertionFailedException is thrown, and this indicates that the test has failed. Occasionally, during the development process, you might need to create a test that you are not implementing at that precise time. However, you want to flag that the creation of the test was postponed. In such cases, you can use the fail() method, which always fails and uses a custom message that indicates the condition: public void testNotImplementedYet() {    fail("Not implemented yet"); } Still, there is another common use for fail() that is worth mentioning. If we need to test whether a method throws an exception, we can surround the code with a try-catch block and force a fail if the exception was not thrown. For example: public void testShouldThrowException() {    try {    MyFirstProjectActivity.methodThatShouldThrowException();      fail("Exception was not thrown");    } catch ( Exception ex ) {      // do nothing    } } JUnit4 has the annotation @Test(expected=Exception.class), and this supersedes the need for using fail() when testing exceptions. With this annotation, the test will only pass if the expected exception is thrown. Custom message It is worth knowing that all assert methods provide an overloaded version including a custom String message. Should the assertion fail, this custom message will be printed by the test runner, instead of a default message. The premise behind this is that, sometimes, the generic error message does not reveal enough details, and it is not obvious how the test failed. This custom message can be extremely useful to easily identify the failure once you are looking at the test report, so it's highly recommended as a best practice to use this version. The following is an example of a simple test that uses this recommendation: public void testMax() { int a = 10; int b = 20;   int actual = Math.max(a, b);   String failMsg = "Expected: " + b + " but was: " + actual; assertEquals(failMsg, b, actual); } In the preceding example, we can see another practice that would help you organize and understand your tests easily. This is the use of explicit names for variables that hold the actual values. There are other libraries available that have better default error messages and also a more fluid interface for testing. One of these that is worth looking at is Fest (https://code.google.com/p/fest/). Static imports Though basic assertion methods are inherited from the Assert base class, some other assertions need specific imports. To improve the readability of your tests, there is a pattern to statically import the assert methods from the corresponding classes. Using this pattern instead of having: public void testAlignment() { int margin = 0;    ... android.test.ViewAsserts.assertRightAligned (errorMsg, editText, margin); } We can simplify it by adding the static import: import static android.test.ViewAsserts.assertRightAligned; public void testAlignment() {    int margin = 0;    assertRightAligned(errorMsg, editText, margin); } View assertions The assertions introduced earlier handle a variety of types as parameters, but they are only intended to test simple conditions or simple objects. For example, we have asertEquals(short expected, short actual) to test short values, assertEquals(int expected, int actual) to test integer values, assertEquals(Object expected, Object expected) to test any Object instance, and so on. Usually, while testing user interfaces in Android, you will face the problem of more sophisticated methods, which are mainly related with Views. In this respect, Android provides a class with plenty of assertions in android.test.ViewAsserts (see http://developer.android.com/reference/android/test/ViewAsserts.html for more details), which test relationships between Views and their absolute and relative positions on the screen. These methods are also overloaded to provide different conditions. Among the assertions, we can find the following: assertBaselineAligned: This asserts that two Views are aligned on their baseline; that is, their baselines are on the same y location. assertBottomAligned: This asserts that two views are bottom aligned; that is, their bottom edges are on the same y location. assertGroupContains: This asserts that the specified group contains a specific child once and only once. assertGroupIntegrity: This asserts the specified group's integrity. The child count should be >= 0 and each child should be non-null. assertGroupNotContains: This asserts that the specified group does not contain a specific child. assertHasScreenCoordinates: This asserts that a View has a particular x and y position on the visible screen. assertHorizontalCenterAligned: This asserts that the test View is horizontally center aligned with respect to the reference view. assertLeftAligned: This asserts that two Views are left aligned; that is, their left edges are on the same x location. An optional margin can also be provided. assertOffScreenAbove: This asserts that the specified view is above the visible screen. assertOffScreenBelow: This asserts that the specified view is below the visible screen. assertOnScreen: This asserts that a View is on the screen. assertRightAligned: This asserts that two Views are right-aligned; that is, their right edges are on the same x location. An optional margin can also be specified. assertTopAligned: This asserts that two Views are top aligned; that is, their top edges are on the same y location. An optional margin can also be specified. assertVerticalCenterAligned: This asserts that the test View is vertically center-aligned with respect to the reference View. The following example shows how you can use ViewAssertions to test the user interface layout: public void testUserInterfaceLayout() {    int margin = 0;    View origin = mActivity.getWindow().getDecorView();    assertOnScreen(origin, editText);    assertOnScreen(origin, button);    assertRightAligned(editText, button, margin); } The assertOnScreen method uses an origin to start looking for the requested Views. In this case, we are using the top-level window decor View. If, for some reason, you don't need to go that high in the hierarchy, or if this approach is not suitable for your test, you may use another root View in the hierarchy, for example View.getRootView(), which, in our concrete example, would be editText.getRootView(). Even more assertions If the assertions that are reviewed previously do not seem to be enough for your tests' needs, there is still another class included in the Android framework that covers other cases. This class is MoreAsserts (http://developer.android.com/reference/android/test/MoreAsserts.html). These methods are also overloaded to support different parameter types. Among the assertions, we can find the following: assertAssignableFrom: This asserts that an object is assignable to a class. assertContainsRegex: This asserts that an expected Regex matches any substring of the specified String. It fails with the specified message if it does not. assertContainsInAnyOrder: This asserts that the specified Iterable contains precisely the elements expected, but in any order. assertContainsInOrder: This asserts that the specified Iterable contains precisely the elements expected, but in the same order. assertEmpty: This asserts that an Iterable is empty. assertEquals: This is for some Collections not covered in JUnit asserts. assertMatchesRegex: This asserts that the specified Regex exactly matches the String and fails with the provided message if it does not. assertNotContainsRegex: This asserts that the specified Regex does not match any substring of the specified String, and fails with the provided message if it does. assertNotEmpty: This asserts that some Collections not covered in JUnit asserts are not empty. assertNotMatchesRegex: This asserts that the specified Regex does not exactly match the specified String, and fails with the provided message if it does. checkEqualsAndHashCodeMethods: This is a utility used to test the equals() and hashCode() results at once. This tests whether equals() that is applied to both objects matches the specified result. The following test checks for an error during the invocation of the capitalization method called via a click on the UI button: @UiThreadTest public void testNoErrorInCapitalization() { String msg = "capitalize this text"; editText.setText(msg);   button.performClick();   String actual = editText.getText().toString(); String notExpectedRegexp = "(?i:ERROR)"; String errorMsg = "Capitalization error for " + actual; assertNotContainsRegex(errorMsg, notExpectedRegexp, actual); } If you are not familiar with regular expressions, invest some time and visit http://developer.android.com/reference/java/util/regex/package-summary.html because it will be worth it! In this particular case, we are looking for the word ERROR contained in the result with a case-insensitive match (setting the flag i for this purpose). That is, if for some reason, capitalization doesn't work in our application, and it contains an error message, we can detect this condition with the assertion. Note that because this is a test that modifies the user interface, we must annotate it with @UiThreadTest; otherwise, it won't be able to alter the UI from a different thread, and we will receive the following exception: INFO/TestRunner(610): ----- begin exception ----- INFO/TestRunner(610): android.view.ViewRoot$CalledFromWrongThreadException: Only the original thread that created a view hierarchy can touch its views. INFO/TestRunner(610):     at android.view.ViewRoot.checkThread(ViewRoot.java:2932) [...] INFO/TestRunner(610):     at android.app. Instrumentation$InstrumentationThread.run(Instrumentation.java:1447) INFO/TestRunner(610): ----- end exception ----- The TouchUtils class Sometimes, when testing UIs, it is helpful to simulate different kinds of touch events. These touch events can be generated in many different ways, but probably android.test.TouchUtils is the simplest to use. This class provides reusable methods to generate touch events in test cases that are derived from InstrumentationTestCase. The featured methods allow a simulated interaction with the UI under test. The TouchUtils class provides the infrastructure to inject the events using the correct UI or main thread, so no special handling is needed, and you don't need to annotate the test using @UIThreadTest. TouchUtils supports the following: Clicking on a View and releasing it Tapping on a View (touching it and quickly releasing) Long-clicking on a View Dragging the screen Dragging Views The following test represents a typical usage of TouchUtils:    public void testListScrolling() {        listView.scrollTo(0, 0);          TouchUtils.dragQuarterScreenUp(this, activity);        int actualItemPosition = listView.getFirstVisiblePosition();          assertTrue("Wrong position", actualItemPosition > 0);    } This test does the following: Repositions the list at the beginning to start from a known condition Scrolls the list Checks for the first visible position to see that it was correctly scrolled Even the most complex UIs can be tested in that way, and it would help you detect a variety of conditions that could potentially affect the user experience. Mock objects We have seen the mock objects provided by the Android testing framework, and evaluated the concerns about not using real objects to isolate our tests from the surrounding environment. Martin Fowler calls these two styles the classical and mockist Test-driven Development dichotomy in his great article Mocks aren't stubs, which can be read online at http://www.martinfowler.com/articles/mocksArentStubs.html. Independent of this discussion, we are introducing mock objects as one of the available building blocks because, sometimes, using mock objects in our tests is recommended, desirable, useful, or even unavoidable. The Android SDK provides the following classes in the subpackage android.test.mock to help us: MockApplication: This is a mock implementation of the Application class. All methods are non-functional and throw UnsupportedOperationException. MockContentProvider: This is a mock implementation of ContentProvider. All methods are non-functional and throw UnsupportedOperationException. MockContentResolver: This is a mock implementation of the ContentResolver class that isolates the test code from the real content system. All methods are non-functional and throw UnsupportedOperationException. MockContext: This is a mock context class, and this can be used to inject other dependencies. All methods are non-functional and throw UnsupportedOperationException. MockCursor: This is a mock Cursor class that isolates the test code from real Cursor implementation. All methods are non-functional and throw UnsupportedOperationException. MockDialogInterface: This is a mock implementation of the DialogInterface class. All methods are non-functional and throw UnsupportedOperationException. MockPackageManager: This is a mock implementation of the PackageManager class. All methods are non-functional and throw UnsupportedOperationException. MockResources: This is a mock Resources class. All of these classes have non-functional methods that throw UnsupportedOperationException when used. If you need to use some of these methods, or if you detect that your test is failing with this Exception, you should extend one of these base classes and provide the required functionality. MockContext overview This mock can be used to inject other dependencies, mocks, or monitors into the classes under test. Extend this class to provide your desired behavior, overriding the correspondent methods. The Android SDK provides some prebuilt mock Context objects, each of which has a separate use case. The IsolatedContext class In your tests, you might find the need to isolate the Activity under test from other Android components to prevent unwanted interactions. This can be a complete isolation, but sometimes, this isolation avoids interacting with other components, and for your Activity to still run correctly, some connection with the system is required. For those cases, the Android SDK provides android.test.IsolatedContext, a mock Context that not only prevents interaction with most of the underlying system but also satisfies the needs of interacting with other packages or components such as Services or ContentProviders. Alternate route to file and database operations In some cases, all we need is to be able to provide an alternate route to the file and database operations. For example, if we are testing the application on a real device, we perhaps don't want to affect the existing database but use our own testing data. Such cases can take advantage of another class that is not part of the android.test.mock subpackage but is part of android.test instead, that is, RenamingDelegatingContext. This class lets us alter operations on files and databases by having a prefix that is specified in the constructor. All other operations are delegated to the delegating Context that you must specify in the constructor too. Suppose our Activity under test uses a database we want to control, probably introducing specialized content or fixture data to drive our tests, and we don't want to use the real files. In this case, we create a RenamingDelegatingContext class that specifies a prefix, and our unchanged Activity will use this prefix to create any files. For example, if our Activity tries to access a file named birthdays.txt, and we provide a RenamingDelegatingContext class that specifies the prefix test, then this same Activity will access the file testbirthdays.txt instead when it is being tested. The MockContentResolver class The MockContentResolver class implements all methods in a non-functional way and throws the exception UnsupportedOperationException if you attempt to use them. The reason for this class is to isolate tests from the real content. Let's say your application uses a ContentProvider class to feed your Activity information. You can create unit tests for this ContentProvider using ProviderTestCase2, which we will be analyzing shortly, but when we try to produce functional or integration tests for the Activity against ContentProvider, it's not so evident as to what test case to use. The most obvious choice is ActivityInstrumentationTestCase2, mainly if your functional tests simulate user experience because you might need the sendKeys() method or similar methods, which are readily available on these tests. The first problem you might encounter then is that it's unclear as to where to inject a MockContentResolver in your test to be able to use test data with your ContentProvider. There's no way to inject a MockContext either. The TestCase base class This is the base class of all other test cases in the JUnit framework. It implements the basic methods that we were analyzing in the previous examples (setUp()). The TestCase class also implements the junit.framework.Test interface, meaning it can be run as a JUnit test. Your Android test cases should always extend TestCase or one of its descendants. The default constructor All test cases require a default constructor because, sometimes, depending on the test runner used, this is the only constructor that is invoked, and is also used for serialization. According to the documentation, this method is not intended to be used by "mere mortals" without calling setName(String name). Therefore, to appease the Gods, a common pattern is to use a default test case name in this constructor and invoke the given name constructor afterwards: public class MyTestCase extends TestCase {    public MyTestCase() {      this("MyTestCase Default Name");    }      public MyTestCase(String name) {      super(name);    } } The given name constructor This constructor takes a name as an argument to label the test case. It will appear in test reports and would be of much help when you try to identify where failed tests have come from. The setName() method There are some classes that extend TestCase that don't provide a given name constructor. In such cases, the only alternative is to call setName(String name). The AndroidTestCase base class This class can be used as a base class for general-purpose Android test cases. Use it when you need access to Android resources, databases, or files in the filesystem. Context is stored as a field in this class, which is conveniently named mContext and can be used inside the tests if needed, or the getContext() method can be used too. Tests based on this class can start more than one Activity using Context.startActivity(). There are various test cases in Android SDK that extend this base class: ApplicationTestCase<T extends Application> ProviderTestCase2<T extends ContentProvider> ServiceTestCase<T extends Service> When using the AndroidTestCase Java class, you inherit some base assertion methods that can be used; let's look at these in more detail. The assertActivityRequiresPermission() method The signature for this method is as follows: public void assertActivityRequiresPermission (String packageName, String className, String permission) Description This assertion method checks whether the launching of a particular Activity is protected by a specific permission. It takes the following three parameters: packageName: This is a string that indicates the package name of the activity to launch className: This is a string that indicates the class of the activity to launch permission: This is a string with the permission to check The Activity is launched and then SecurityException is expected, which mentions that the required permission is missing in the error message. The actual instantiation of an activity is not handled by this assertion, and thus, an Instrumentation is not needed. Example This test checks the requirement of the android.Manifest.permission.WRITE_EXTERNAL_STORAGE permission, which is needed to write to external storage, in the MyContactsActivity Activity: public void testActivityPermission() { String pkg = "com.blundell.tut"; String activity = PKG + ".MyContactsActivity"; String permission = android.Manifest.permission.CALL_PHONE; assertActivityRequiresPermission(pkg, activity, permission); } Always use the constants that describe the permissions from android.Manifest.permission, not the strings, so if the implementation changes, your code will still be valid. The assertReadingContentUriRequiresPermission method The signature for this method is as follows: public void assertReadingContentUriRequiresPermission(Uri uri, String permission) Description This assertion method checks whether reading from a specific URI requires the permission provided as a parameter. It takes the following two parameters: uri: This is the Uri that requires a permission to query permission: This is a string that contains the permission to query If a SecurityException class is generated, which contains the specified permission, this assertion is validated. Example This test tries to read contacts and verifies that the correct SecurityException is generated: public void testReadingContacts() {    Uri URI = ContactsContract.AUTHORITY_URI;    String PERMISSION = android.Manifest.permission.READ_CONTACTS;    assertReadingContentUriRequiresPermission(URI, PERMISSION); } The assertWritingContentUriRequiresPermission() method The signature for this method is as follows: public void assertWritingContentUriRequiresPermission (Uri uri,   String permission) Description This assertion method checks whether inserting into a specific Uri requires the permission provided as a parameter. It takes the following two parameters: uri: This is the Uri that requires a permission to query permission: This is a string that contains the permission to query If a SecurityException class is generated, which contains the specified permission, this assertion is validated. Example This test tries to write to Contacts and verifies that the correct SecurityException is generated: public void testWritingContacts() { Uri uri = ContactsContract.AUTHORITY_URI;    String permission = android.Manifest.permission.WRITE_CONTACTS; assertWritingContentUriRequiresPermission(uri, permission); } Instrumentation Instrumentation is instantiated by the system before any of the application code is run, thereby allowing monitoring of all the interactions between the system and the application. As with many other Android application components, instrumentation implementations are described in the AndroidManifest.xml under the tag <instrumentation>. However, with the advent of Gradle, this has now been automated for us, and we can change the properties of the instrumentation in the app's build.gradle file. The AndroidManifest file for your tests will be automatically generated: defaultConfig { testApplicationId 'com.blundell.tut.tests' testInstrumentationRunner   "android.test.InstrumentationTestRunner" } The values mentioned in the preceding code are also the defaults if you do not declare them, meaning that you don't have to have any of these parameters to start writing tests. The testApplicationId attribute defines the name of the package for your tests. As a default, it is your application under the test package name + tests. You can declare a custom test runner using testInstrumentationRunner. This is handy if you want to have tests run in a custom way, for example, parallel test execution. There are also many other parameters in development, and I would advise you to keep your eyes upon the Google Gradle plugin website (http://tools.android.com/tech-docs/new-build-system/user-guide). The ActivityMonitor inner class As mentioned earlier, the Instrumentation class is used to monitor the interaction between the system and the application or the Activities under test. The inner class Instrumentation ActivityMonitor allows the monitoring of a single Activity within an application. Example Let's pretend that we have a TextView in our Activity that holds a URL and has its auto link property set: <TextView        android_id="@+id/link        android_layout_width="match_parent"    android_layout_height="wrap_content"        android_text="@string/home"    android_autoLink="web" " /> If we want to verify that, when clicked, the hyperlink is correctly followed and some browser is invoked, we can create a test like this: public void testFollowLink() {        IntentFilter intentFilter = new IntentFilter(Intent.ACTION_VIEW);        intentFilter.addDataScheme("http");        intentFilter.addCategory(Intent.CATEGORY_BROWSABLE);          Instrumentation inst = getInstrumentation();        ActivityMonitor monitor = inst.addMonitor(intentFilter, null, false);        TouchUtils.clickView(this, linkTextView);        monitor.waitForActivityWithTimeout(3000);        int monitorHits = monitor.getHits();        inst.removeMonitor(monitor);          assertEquals(1, monitorHits);    } Here, we will do the following: Create an IntentFilter for intents that would open a browser. Add a monitor to our Instrumentation based on the IntentFilter class. Click on the hyperlink. Wait for the activity (hopefully the browser). Verify that the monitor hits were incremented. Remove the monitor. Using monitors, we can test even the most complex interactions with the system and other Activities. This is a very powerful tool to create integration tests. The InstrumentationTestCase class The InstrumentationTestCase class is the direct or indirect base class for various test cases that have access to Instrumentation. This is the list of the most important direct and indirect subclasses: ActivityTestCase ProviderTestCase2<T extends ContentProvider> SingleLaunchActivityTestCase<T extends Activity> SyncBaseInstrumentation ActivityInstrumentationTestCase2<T extends Activity> ActivityUnitTestCase<T extends Activity> The InstrumentationTestCase class is in the android.test package, and extends junit.framework.TestCase, which extends junit.framework.Assert. The launchActivity and launchActivityWithIntent method These utility methods are used to launch Activities from a test. If the Intent is not specified using the second option, a default Intent is used: public final T launchActivity (String pkg, Class<T> activityCls,   Bundle extras) The template class parameter T is used in activityCls and as the return type, limiting its use to Activities of that type. If you need to specify a custom Intent, you can use the following code that also adds the intent parameter: public final T launchActivityWithIntent (String pkg, Class<T>   activityCls, Intent intent) The sendKeys and sendRepeatedKeys methods While testing Activities' UI, you will face the need to simulate interaction with qwerty-based keyboards or DPAD buttons to send keys to complete fields, select shortcuts, or navigate throughout the different components. This is what the different sendKeys and sendRepeatedKeys are used for. There is one version of sendKeys that accepts integer keys values. They can be obtained from constants defined in the KeyEvent class. For example, we can use the sendKeys method in this way:    public void testSendKeyInts() {        requestMessageInputFocus();        sendKeys(                KeyEvent.KEYCODE_H,                KeyEvent.KEYCODE_E,                KeyEvent.KEYCODE_E,                KeyEvent.KEYCODE_E,                KeyEvent.KEYCODE_Y,                KeyEvent.KEYCODE_DPAD_DOWN,                KeyEvent.KEYCODE_ENTER);        String actual = messageInput.getText().toString();          assertEquals("HEEEY", actual);    } Here, we are sending H, E, and Y letter keys and then the ENTER key using their integer representations to the Activity under test. Alternatively, we can create a string by concatenating the keys we desire to send, discarding the KEYCODE prefix, and separating them with spaces that are ultimately ignored:      public void testSendKeyString() {        requestMessageInputFocus();          sendKeys("H 3*E Y DPAD_DOWN ENTER");        String actual = messageInput.getText().toString();          assertEquals("HEEEY", actual);    } Here, we did exactly the same as in the previous test but we used a String "H 3* EY DPAD_DOWN ENTER". Note that every key in the String can be prefixed by a repeating factor followed by * and the key to be repeated. We used 3*E in our previous example, which is the same as E E E, that is, three times the letter E. If sending repeated keys is what we need in our tests, there is also another alternative that is precisely intended for these cases: public void testSendRepeatedKeys() {        requestMessageInputFocus();          sendRepeatedKeys(                1, KeyEvent.KEYCODE_H,                3, KeyEvent.KEYCODE_E,                1, KeyEvent.KEYCODE_Y,                1, KeyEvent.KEYCODE_DPAD_DOWN,                1, KeyEvent.KEYCODE_ENTER);        String actual = messageInput.getText().toString();          assertEquals("HEEEY", actual);    } This is the same test implemented in a different manner. The repetition number precedes each key. The runTestOnUiThread helper method The runTestOnUiThread method is a helper method used to run portions of a test on the UI thread. We used this inside the method requestMessageInputFocus(); so that we can set the focus on our EditText before waiting for the application to be idle, using Instrumentation.waitForIdleSync(). Also, the runTestOnUiThread method throws an exception, so we have to deal with this case: private void requestMessageInputFocus() {        try {            runTestOnUiThread(new Runnable() {                @Override                public void run() {                    messageInput.requestFocus();                }            });        } catch (Throwable throwable) {            fail("Could not request focus.");        }        instrumentation.waitForIdleSync();    } Alternatively, as we have discussed before, to run a test on the UI thread, we can annotate it with @UiThreadTest. However, sometimes, we need to run only parts of the test on the UI thread because other parts of it are not suitable to run on that thread, for example, database calls, or we are using other helper methods that provide the infrastructure themselves to use the UI thread, for example the TouchUtils methods. Summary We investigated the most relevant building blocks and reusable patterns to create our tests. Along this journey, we: Understood the common assertions found in JUnit tests Explained the specialized assertions found in the Android SDK Explored Android mock objects and their use in Android tests Now that we have all the building blocks, it is time to start creating more and more tests to acquire the experience needed to master the technique. Resources for Article: Further resources on this subject: Android Virtual Device Manager [article] Signing an application in Android using Maven [article] The AsyncTask and HardwareTask Classes [article]
Read more
  • 0
  • 0
  • 7695

article-image-subscribing-report
Packt
26 Mar 2015
6 min read
Save for later

Subscribing to a report

Packt
26 Mar 2015
6 min read
 In this article by Johan Yu, the author of Salesforce Reporting and Dashboards, we get acquainted to the components used when working with reports on the Salesforce platform. Subscribing to a report is a new feature in Salesforce introduced in the Spring 2015 release. When you subscribe to a report, you will get a notification on weekdays, daily, or weekly, when the reports meet the criteria defined. You just need to subscribe to the report that you most care about. (For more resources related to this topic, see here.) Subscribing to a report is not the same as the report's Schedule Future Run option, where scheduling a report for a future run will keep e-mailing you the report content at a specified frequency defined, without specifying any conditions. But when you subscribe to a report, you will receive notifications when the report output meets the criteria you have defined. Subscribing to a report will not send you the e-mail content, but just an alert that the report you subscribed to meets the conditions specified. To subscribe to a report, you do not need additional permission as our administrator is able to control to enable or disable this feature for the entire organization. By default, this feature will be turned on for customers using the Salesforce Spring 2015 release. If you are an administrator for the organization, you can check out this feature by navigating to Setup | Customize | Reports & Dashboards | Report Notification | Enable report notification subscriptions for all users. Besides receiving notifications via e-mail, you also can opt for Salesforce1 notifications and posts to Chatter feeds, and execute a custom action. Report Subscription To subscribe to a report, you need to define a set of conditions to trigger the notifications. Here is what you need to understand before you subscribe to a report: When: Everytime conditions are met or only the first time conditions are met. Conditions: An aggregate can be a record count or a summarize field. Then define the operator and value you want the aggregate to be compared to. The summarize field means a field that you use in that report to summarize its data as average, smallest, largest, or sum. You can add multiple conditions, but at this moment, you only have the AND condition. Schedule frequency: Schedule weekday, daily, weekly, and the time the report will be run. Actions: E-mail notifications: You will get e-mail alerts when conditions are met. Posts to Chatter feeds: Alerts will be posted to your Chatter feed. Salesforce1 notifications: Alerts in your Salesforce1 app. Execute a custom action: This will trigger a call to the apex class. You will need a developer to write apex code for this. Active: This is a checkbox used to activate or disable subscription. You may just need to disable it when you need to unsubscribe temporarily; otherwise, deleting will remove all the settings defined. The following screenshot shows the conditions set in order to subscribe to a report: Monitoring a report subscription How can you know whether you have subscribed to a report? When you open the report and see the Subscribe button, it means you are not subscribed to that report:   Once you configure the report to subscribe, the button label will turn to Edit Subscription. But, do not get it wrong that not all reports with Edit Subscription, you will get alerts when the report meets the criteria, because the setting may just not be active, remember step above when you subscribe a report. To know all the reports you subscribe to at a glance, as long as you have View Setup and Configuration permissions, navigate to Setup | Jobs | Scheduled Jobs, and look for Type as Reporting Notification, as shown in this screenshot:   Hands-on – subscribing to a report Here is our next use case: you would like to get a notification in your Salesforce1 app—an e-mail notification—and also posts on your Chatter feed once the Closed Won opportunity for the month has reached $50,000. Salesforce should check the report daily, but instead of getting this notification daily, you want to get it only once a week or month; otherwise, it will be disturbing. Creating reports Make sure you set the report with the correct filter, set Close Date as This Month, and summarize the Amount field, as shown in the following screenshot:   Subscribing Click on the Subscribe button and fill in the following details: Type as Only the first time conditions are met Conditions: Aggregate as Sum of Amount Operator as Greater Than or Equal Value as 50000 Schedule: Frequency as Every Weekday Time as 7AM In Actions, select: Send Salesforce1 Notification Post to Chatter Feed Send Email Notification In Active, select the checkbox Testing and saving The good thing of this feature is the ability to test without waiting until the scheduled date or time. Click on the Save & Run Now button. Here is the result: Salesforce1 notifications Open your Salesforce1 mobile app, look for the notification icon, and notice a new alert from the report you subscribed to, as shown in this screenshot: If you click on the notification, it will take you to the report that is shown in the following screenshot:   Chatter feed Since you selected the Post to Chatter Feed action, the same alert will go to your Chatter feed as well. Clicking on the link in the Chatter feed will open the same report in your Salesforce1 mobile app or from the web browser, as shown in this screenshot: E-mail notification The last action we've selected for this exercise is to send an e-mail notification. The following screenshot shows how the e-mail notification would look:   Limitations The following limitations are observed while subscribing to a report: You can set up to five conditions per report, and no OR logic conditions are possible You can subscribe for up to five reports, so use it wisely Summary In this article, you became familiar with components when working with reports on the Salesforce platform. We saw different report formats and the uniqueness of each format. We continued discussions on adding various types of charts to the report with point-and-click effort and no code; all of this can be done within minutes. We saw how to add filters to reports to customize our reports further, including using Filter Logic, Cross Filter, and Row Limit for tabular reports. We walked through managing and customizing custom report types, including how to hide unused report types and report type adoption analysis. In the last part of this article, we saw how easy it is to subscribe to a report and define criteria. Resources for Article: Further resources on this subject: Salesforce CRM – The Definitive Admin Handbook - Third Edition [article] Salesforce.com Customization Handbook [article] Developing Applications with Salesforce Chatter [article]
Read more
  • 0
  • 0
  • 2584
Modal Close icon
Modal Close icon