How-To Tutorials

article-image-hacking-toys-ifttt-and-spark

31 Mar 2015

6 min read

Hacking toys with IFTTT and Spark

31 Mar 2015

Open up even the simplest of toys and you’ll often be amazed at the number of interesting electronic components inside. This is especially true in many of the otherwise “throw away” toys found in fast food kids’ meals. I’ve tried to make it a habit of salvaging as many parts as possible from such toys so I can use them in future projects. (And I recommend you do the same!) But what if we could use the toy itself as a basis for a new project? In this post, we’ll look at one example of how we can Internet-enable a simple LED lantern toy using a wireless Spark Core device and the powerful IFTTT service. This particular LED lantern is operated by a standard on-off switch, and inside is a single LED, three coin batteries, and a simple switch mechanism for connecting and disconnecting power. Like many fast-food premiums, the lantern uses “tamper proof” triangular screws. If you don’t have the appropriate bit, you can usually make do with a small straight edge screwdriver. In addition to screws, some toys are also glued or sonic welded together, which makes it difficult to open without damaging the plastic beyond repair. Not shown in this photo is a small plastic piece that holds all the components in place. To programmatically control our lantern, we want to remove the batteries and run jumper cables to a pin on our microcontroller instead. Here is an exposed view after also removing the switch mechanism and attaching female-male jumper cables to the positive and negative leads of the LED. The next step is to hook our lantern up to the Spark Core. We choose the Spark Core for this project for two primary reasons. First, the Spark’s size is very conducive to toy hacking, especially for projects where you want to completely embed the electronics inside the finished product. Second, there is already a Spark channel on IFTTT that allows us to remotely trigger actions. More on that later! But before we go too far, let’s test our Spark setup to be sure we can power the LED. Run the jumper cable from the positive lead to pin D0 and the negative lead to GND. Now let’s write a simple Spark application that turns the LED on and off. Using Spark’s Web IDE, flash the following program onto your Spark Core. This will cause the LED to blink on and off in one second intervals. int led = D0; void setup() { pinMode(led, OUTPUT); } void loop() { digitalWrite(led, HIGH); delay(1000); digitalWrite(led, LOW); delay(1000); } But to really make our project useful, we need to hook it up to the Internet and respond to remote triggers for controlling the LED. IFTTT (pronounced like “gift” without the “g”) is a web-based service for connecting a variety of other online services and devices through “recipes”. An IFTT recipe is of the form “If [this] then [that]. The services that can be combined to fill in those blanks are called “channels”. IFTTT has dozens of channels to pick from, including email, SMS, Twitter, etc. But especially important to us: there is a Spark channel that allows Spark devices to serve as both triggers and actuators. For this project, we’ll set up our Spark as an actuator that that turns on the LED when the “if this” condition is met. To trigger our lantern, we could use any number of IFTTT channels, but for simplicity, let’s connect it up to the Yo smartphone app. Yo is a (rather silly) app that just lets you send a “yo” message to friends. The Yo channel for IFTTT allows you to trigger recipes by Yo-ing IFTTT. Load the app to your smartphone and add IFTTT as a contact by clicking the + button and typing “IFTTT” in the username field. If you haven’t already done so, create an IFTTT account and go to the “Channels” tab to activate the Yo and Spark channels. In both cases, you’ll have to log in to your respective accounts and authorize IFTTT. The process is straightforward and the IFTTT website walks you through the entire process. Once you’ve done this, you’re ready to create your first recipe. Click the “Create a Recipe” button found on the “My Recipes” tab. IFTTT will walk you through setting up both the trigger and action. For the “if this” condition, select your Yo channel and the “You Yo IFTTT” trigger. For the “then that” action, select the Spark channel and “Publish an event” action. Name the event (I just used “yo”) and select the “private event” option. (It doesn’t matter what you enter as the data field--we’re just going to ignore it anyway.) Name your recipe and click “Create Recipe” to finish the process. Your new recipe will now show up in your personal recipe list. Now we need to modify our Spark code to listen for our “yo” events. Back in the Spark Web IDE, change the code to the following. Now instead of turning the LED on and off in the loop() function, we instead register an event listener using Spark.subscribe() and turn the LED on for five seconds inside the callback function. int led = D0; void setup() { Spark.subscribe("yo", yoHandler, MY_DEVICES); pinMode(led, OUTPUT); } void loop() {} void yoHandler(const char *event, const char *data) { digitalWrite(led, HIGH); delay(5000); digitalWrite(led, LOW); } Once you’ve flashed this update to your Spark, it’s time to test it out! Be sure the Spark is flashing cyan (meaning it has a connection to the Spark cloud) and then use your smartphone to Yo IFTTT. The LED should light up for five seconds, then turn back off and wait again for the next “yo” event. Note that the “yo” events will be broadcast to all your Spark devices if you have more than one, so you could set up multiple hacked toys and send your greetings to several people at once. And if you choose to use public events, you could even trigger events to family and friends around the world. All that’s left to do is package up the lantern by screwing everything back together. For a more permanent solution, instead of running the wires out to the external Spark, you could carefully fit the Spark and a small LiPo battery inside the lantern as well. I hope this post has inspired you to give new life to broken or disposable toys you have around the house. If you build something really cool, I’d love to see it. Consider sharing your project on the hackster.io Spark community. About the author David Resseguie is a member of the Computational Sciences and Engineering Division at Oak Ridge National Laboratory and lead developer for Sensorpedia. His interests include human computer interaction, Internet of Things, robotics, data visualization, and STEAM education. His current research focus is on applying social computing principles to the design of information sharing systems.

0
0
3375

How-To Tutorials

article-image-performing-hand-written-digit-recognition-golearn

Alex Browne

31 Mar 2015

9 min read

Performing hand-written digit recognition with GoLearn

Alex Browne

31 Mar 2015

9 min read

In this step-by-step post, you'll learn how to do basic recognition of hand-written digits using GoLearn, a machine learning library for Go. I'll assume you are already comfortable with Go and have a basic understanding of machine learning. To learn Go, I recommend the interactive tutorial. And to learn about machine learning, I recommend Andrew Ng's Machine Learning course on Coursera. All of the code for this tutorial is available on github. Installation & Set Up To follow along with this post, you will need to install: Go version 1.2 or later The GoLearn package Also, make sure that you follow these intructions for setting up your go work environment. In particular, you will need to have the GOPATH environment variable pointing to a directory where all of your Go code will reside. Project Structure Now is a good time to setup the directory where your code for this project will reside. Somewhere in your $GOPATH/src, create a new directory and call it whatever you want. I recommend $GOPATH/src/github.com/your-github-username/golearn-digit-recognition. Our basic project structure is going to look like this: golearn-digit-recognition/ data/ mnist_train.csv mnist_test.csv main.go The data directory is where we'll put our training and test data, and our program is going to consist of a single file: main.go. Getting the Training Data As I mentioned, in this post we're going to be using GoLearn to recognize hand-written digits. The training data we'll use comes from the popular MNIST handwritten digit database. I've already split the data into training and test sets and formatted it in the way GoLearn expects. You can simply download the CSV files and put them in your data directory: Training Data Test Data The data consists of a series of 28x28 pixel grayscale images and labels for the corresponding digit (0-9). 28x28 = 784 so there are 784 features. In the CSV files, the pixels are labeled pixel0-pixel783. Each pixel can take on a value between 0 and 255, where 0 is white and 255 is black. There are 5,000 rows in the training data, and 500 in the test data. Writing the Code Without further ado, let's write a simple program to detect hand-written digits. Open up the main.go file in your favorite text editor and add the following lines: package main import ( "fmt" "github.com/sjwhitworth/golearn/base" ) func main() { // Load and parse the data from csv files fmt.Println("Loading data...") trainData, err := base.ParseCSVToInstances("data/mnist_train.csv", true) if err != nil { panic(err) } testData, err := base.ParseCSVToInstances("data/mnist_test.csv", true) if err != nil { panic(err) } } The ParseCSVToInstances function reads the CSV file and converts it into "Instances," which is simply a data structure that GoLearn can understand and manipulate. You should run the program with go run main.go to make sure everything works so far. Next, we're going to create a linear Support Vector Classifier, which is a type of Support Vector Machine where the output is the probability that the input belongs to some class. In our case, there are 10 possible classes representing the digits 0 through 9, so our SVC will consist of 10 SVMs, each of which outputs the probability that the input belongs to a certain class. The SVC will then simply output the class with the highest probability. Modify main.go by importing the linear_models package from golearn: import ( // ... "github.com/sjwhitworth/golearn/linear_models" ) Then add the following lines: func main() { // ... // Create a new linear SVC with some good default values classifier, err := linear_models.NewLinearSVC("l1", "l2", true, 1.0, 1e-4) if err != nil { panic(err) } // Don't output information on each iteration base.Silent() // Train the linear SVC fmt.Println("Training...") classifier.Fit(trainData) } You can read more about the different parameters for the SVC here. I found that these parameters give pretty good results. After we've created the classifier, training it is as simple as calling classifier.Fit(). Now might be a good time to run go run main.go again to make sure everything compiles and works as expected. If you want to see some details about what's going on with the classifier, comment out or remove the base.Silent() line. Finally, we can test the accuracy of our SVC by making predictions on the test data and then comparing our predictions to the expected output. GoLearn makes it really easy to do this. Just modify main.go as follows: package main import ( // ... "github.com/sjwhitworth/golearn/evaluation" // ... ) func main() { // ... // Make predictions for the test data fmt.Println("Predicting...") predictions, err := classifier.Predict(testData) if err != nil { panic(err) } // Get a confusion matrix and print out some accuracy stats for our predictions confusionMat, err := evaluation.GetConfusionMatrix(testData, predictions) if err != nil { panic(fmt.Sprintf("Unable to get confusion matrix: %s", err.Error())) } fmt.Println(evaluation.GetSummary(confusionMat)) } After making the predictions for our test data, we use the evaluation package to quickly get some stats about the accuracy of our classifier. You should run the program again with go run main.go. If everything works correctly, you should see output that looks something like this: Loading data...Training...Predicting...Reference Class True Positives False Positives True Negatives Precision Recall F1 Score--------------- -------------- --------------- -------------- --------- ------ --------6 42 4 447 0.9130 0.8571 0.88425 31 15 444 0.6739 0.7561 0.71268 37 7 445 0.8409 0.7708 0.80437 47 5 440 0.9038 0.8545 0.87852 51 6 434 0.8947 0.8500 0.87183 35 9 448 0.7955 0.8140 0.80461 50 5 443 0.9091 0.9615 0.93464 48 4 441 0.9231 0.8727 0.89720 41 3 455 0.9318 0.9762 0.95359 49 11 434 0.8167 0.8909 0.8522Overall accuracy: 0.8620 That's about an 86% accuracy. Not too bad! And all it took was a few lines of code! Summary If you want to do even better, try playing around with the parameters for the SVC or use a different classifier. GoLearn has support for linear and logistic regression, K nearest neighbor, neural networks, and more! About the author Alex Browne is a recent college grad living in Raleigh NC with 4 years of professional software experience. He does software contract work to make ends meet, and spends most of his free time learning new things and working on various side projects. He is passionate about open source technology and has plans to start his own company.

0
0
3064

How-To Tutorials

Packt

31 Mar 2015

16 min read

Dealing with Legacy Code

Packt

31 Mar 2015

16 min read

In this article by Arun Ravindran, author of the book Django Best Practices and Design Patterns, we will discuss the following topics: Reading a Django code base Discovering relevant documentation Incremental changes versus full rewrites Writing tests before changing code Legacy database integration (For more resources related to this topic, see here.) It sounds exciting when you are asked to join a project. Powerful new tools and cutting-edge technologies might await you. However, quite often, you are asked to work with an existing, possibly ancient, codebase. To be fair, Django has not been around for that long. However, projects written for older versions of Django are sufficiently different to cause concern. Sometimes, having the entire source code and documentation might not be enough. If you are asked to recreate the environment, then you might need to fumble with the OS configuration, database settings, and running services locally or on the network. There are so many pieces to this puzzle that you might wonder how and where to start. Understanding the Django version used in the code is a key piece of information. As Django evolved, everything from the default project structure to the recommended best practices have changed. Therefore, identifying which version of Django was used is a vital piece in understanding it. Change of Guards Sitting patiently on the ridiculously short beanbags in the training room, the SuperBook team waited for Hart. He had convened an emergency go-live meeting. Nobody understood the "emergency" part since go live was at least 3 months away. Madam O rushed in holding a large designer coffee mug in one hand and a bunch of printouts of what looked like project timelines in the other. Without looking up she said, "We are late so I will get straight to the point. In the light of last week's attacks, the board has decided to summarily expedite the SuperBook project and has set the deadline to end of next month. Any questions?" "Yeah," said Brad, "Where is Hart?" Madam O hesitated and replied, "Well, he resigned. Being the head of IT security, he took moral responsibility of the perimeter breach." Steve, evidently shocked, was shaking his head. "I am sorry," she continued, "But I have been assigned to head SuperBook and ensure that we have no roadblocks to meet the new deadline." There was a collective groan. Undeterred, Madam O took one of the sheets and began, "It says here that the Remote Archive module is the most high-priority item in the incomplete status. I believe Evan is working on this." "That's correct," said Evan from the far end of the room. "Nearly there," he smiled at others, as they shifted focus to him. Madam O peered above the rim of her glasses and smiled almost too politely. "Considering that we already have an extremely well-tested and working Archiver in our Sentinel code base, I would recommend that you leverage that instead of creating another redundant system." "But," Steve interrupted, "it is hardly redundant. We can improve over a legacy archiver, can't we?" "If it isn't broken, then don't fix it", replied Madam O tersely. He said, "He is working on it," said Brad almost shouting, "What about all that work he has already finished?" "Evan, how much of the work have you completed so far?" asked O, rather impatiently. "About 12 percent," he replied looking defensive. Everyone looked at him incredulously. "What? That was the hardest 12 percent" he added. O continued the rest of the meeting in the same pattern. Everybody's work was reprioritized and shoe-horned to fit the new deadline. As she picked up her papers, readying to leave she paused and removed her glasses. "I know what all of you are thinking... literally. But you need to know that we had no choice about the deadline. All I can tell you now is that the world is counting on you to meet that date, somehow or other." Putting her glasses back on, she left the room. "I am definitely going to bring my tinfoil hat," said Evan loudly to himself. Finding the Django version Ideally, every project will have a requirements.txt or setup.py file at the root directory, and it will have the exact version of Django used for that project. Let's look for a line similar to this: Django==1.5.9 Note that the version number is exactly mentioned (rather than Django>=1.5.9), which is called pinning. Pinning every package is considered a good practice since it reduces surprises and makes your build more deterministic. Unfortunately, there are real-world codebases where the requirements.txt file was not updated or even completely missing. In such cases, you will need to probe for various tell-tale signs to find out the exact version. Activating the virtual environment In most cases, a Django project would be deployed within a virtual environment. Once you locate the virtual environment for the project, you can activate it by jumping to that directory and running the activated script for your OS. For Linux, the command is as follows: $ source venv_path/bin/activate Once the virtual environment is active, start a Python shell and query the Django version as follows: $ python >>> import django >>> print(django.get_version()) 1.5.9 The Django version used in this case is Version 1.5.9. Alternatively, you can run the manage.py script in the project to get a similar output: $ python manage.py --version 1.5.9 However, this option would not be available if the legacy project source snapshot was sent to you in an undeployed form. If the virtual environment (and packages) was also included, then you can easily locate the version number (in the form of a tuple) in the __init__.py file of the Django directory. For example: $ cd envs/foo_env/lib/python2.7/site-packages/django $ cat __init__.py VERSION = (1, 5, 9, 'final', 0) ... If all these methods fail, then you will need to go through the release notes of the past Django versions to determine the identifiable changes (for example, the AUTH_PROFILE_MODULE setting was deprecated since Version 1.5) and match them to your legacy code. Once you pinpoint the correct Django version, then you can move on to analyzing the code. Where are the files? This is not PHP One of the most difficult ideas to get used to, especially if you are from the PHP or ASP.NET world, is that the source files are not located in your web server's document root directory, which is usually named wwwroot or public_html. Additionally, there is no direct relationship between the code's directory structure and the website's URL structure. In fact, you will find that your Django website's source code is stored in an obscure path such as /opt/webapps/my-django-app. Why is this? Among many good reasons, it is often more secure to move your confidential data outside your public webroot. This way, a web crawler would not be able to accidentally stumble into your source code directory. Starting with urls.py Even if you have access to the entire source code of a Django site, figuring out how it works across various apps can be daunting. It is often best to start from the root urls.py URLconf file since it is literally a map that ties every request to the respective views. With normal Python programs, I often start reading from the start of its execution—say, from the top-level main module or wherever the __main__ check idiom starts. In the case of Django applications, I usually start with urls.py since it is easier to follow the flow of execution based on various URL patterns a site has. In Linux, you can use the following find command to locate the settings.py file and the corresponding line specifying the root urls.py: $ find . -iname settings.py -exec grep -H 'ROOT_URLCONF' {} ; ./projectname/settings.py:ROOT_URLCONF = 'projectname.urls' $ ls projectname/urls.py projectname/urls.py Jumping around the code Reading code sometimes feels like browsing the web without the hyperlinks. When you encounter a function or variable defined elsewhere, then you will need to jump to the file that contains that definition. Some IDEs can do this automatically for you as long as you tell it which files to track as part of the project. If you use Emacs or Vim instead, then you can create a TAGS file to quickly navigate between files. Go to the project root and run a tool called Exuberant Ctags as follows: find . -iname "*.py" -print | etags - This creates a file called TAGS that contains the location information, where every syntactic unit such as classes and functions are defined. In Emacs, you can find the definition of the tag, where your cursor (or point as it called in Emacs) is at using the M-. command. While using a tag file is extremely fast for large code bases, it is quite basic and is not aware of a virtual environment (where most definitions might be located). An excellent alternative is to use the elpy package in Emacs. It can be configured to detect a virtual environment. Jumping to a definition of a syntactic element is using the same M-. command. However, the search is not restricted to the tag file. So, you can even jump to a class definition within the Django source code seamlessly. Understanding the code base It is quite rare to find legacy code with good documentation. Even if you do, the documentation might be out of sync with the code in subtle ways that can lead to further issues. Often, the best guide to understand the application's functionality is the executable test cases and the code itself. The official Django documentation has been organized by versions at https://docs.djangoproject.com. On any page, you can quickly switch to the corresponding page in the previous versions of Django with a selector on the bottom right-hand section of the page: In the same way, documentation for any Django package hosted on readthedocs.org can also be traced back to its previous versions. For example, you can select the documentation of django-braces all the way back to v1.0.0 by clicking on the selector on the bottom left-hand section of the page: Creating the big picture Most people find it easier to understand an application if you show them a high-level diagram. While this is ideally created by someone who understands the workings of the application, there are tools that can create very helpful high-level depiction of a Django application. A graphical overview of all models in your apps can be generated by the graph_models management command, which is provided by the django-command-extensions package. As shown in the following diagram, the model classes and their relationships can be understood at a glance: Model classes used in the SuperBook project connected by arrows indicating their relationships This visualization is actually created using PyGraphviz. This can get really large for projects of even medium complexity. Hence, it might be easier if the applications are logically grouped and visualized separately. PyGraphviz Installation and Usage If you find the installation of PyGraphviz challenging, then don't worry, you are not alone. Recently, I faced numerous issues while installing on Ubuntu, starting from Python 3 incompatibility to incomplete documentation. To save your time, I have listed the steps that worked for me to reach a working setup. On Ubuntu, you will need the following packages installed to install PyGraphviz: $ sudo apt-get install python3.4-dev graphviz libgraphviz-dev pkg-config Now activate your virtual environment and run pip to install the development version of PyGraphviz directly from GitHub, which supports Python 3: $ pip install git+http://github.com/pygraphviz/pygraphviz.git#egg=pygraphviz Next, install django-extensions and add it to your INSTALLED_APPS. Now, you are all set. Here is a sample usage to create a GraphViz dot file for just two apps and to convert it to a PNG image for viewing: $ python manage.py graph_models app1 app2 > models.dot $ dot -Tpng models.dot -o models.png Incremental change or a full rewrite? Often, you would be handed over legacy code by the application owners in the earnest hope that most of it can be used right away or after a couple of minor tweaks. However, reading and understanding a huge and often outdated code base is not an easy job. Unsurprisingly, most programmers prefer to work on greenfield development. In the best case, the legacy code ought to be easily testable, well documented, and flexible to work in modern environments so that you can start making incremental changes in no time. In the worst case, you might recommend discarding the existing code and go for a full rewrite. Or, as it is commonly decided, the short-term approach would be to keep making incremental changes, and a parallel long-term effort might be underway for a complete reimplementation. A general rule of thumb to follow while taking such decisions is—if the cost of rewriting the application and maintaining the application is lower than the cost of maintaining the old application over time, then it is recommended to go for a rewrite. Care must be taken to account for all the factors, such as time taken to get new programmers up to speed, the cost of maintaining outdated hardware, and so on. Sometimes, the complexity of the application domain becomes a huge barrier against a rewrite, since a lot of knowledge learnt in the process of building the older code gets lost. Often, this dependency on the legacy code is a sign of poor design in the application like failing to externalize the business rules from the application logic. The worst form of a rewrite you can probably undertake is a conversion, or a mechanical translation from one language to another without taking any advantage of the existing best practices. In other words, you lost the opportunity to modernize the code base by removing years of cruft. Code should be seen as a liability not an asset. As counter-intuitive as it might sound, if you can achieve your business goals with a lesser amount of code, you have dramatically increased your productivity. Having less code to test, debug, and maintain can not only reduce ongoing costs but also make your organization more agile and flexible to change. Code is a liability not an asset. Less code is more maintainable. Irrespective of whether you are adding features or trimming your code, you must not touch your working legacy code without tests in place. Write tests before making any changes In the book Working Effectively with Legacy Code, Michael Feathers defines legacy code as, simply, code without tests. He elaborates that with tests one can easily modify the behavior of the code quickly and verifiably. In the absence of tests, it is impossible to gauge if the change made the code better or worse. Often, we do not know enough about legacy code to confidently write a test. Michael recommends writing tests that preserve and document the existing behavior, which are called characterization tests. Unlike the usual approach of writing tests, while writing a characterization test, you will first write a failing test with a dummy output, say X, because you don't know what to expect. When the test harness fails with an error, such as "Expected output X but got Y", then you will change your test to expect Y. So, now the test will pass, and it becomes a record of the code's existing behavior. Note that we might record buggy behavior as well. After all, this is unfamiliar code. Nevertheless, writing such tests are necessary before we start changing the code. Later, when we know the specifications and code better, we can fix these bugs and update our tests (not necessarily in that order). Step-by-step process to writing tests Writing tests before changing the code is similar to erecting scaffoldings before the restoration of an old building. It provides a structural framework that helps you confidently undertake repairs. You might want to approach this process in a stepwise manner as follows: Identify the area you need to make changes to. Write characterization tests focusing on this area until you have satisfactorily captured its behavior. Look at the changes you need to make and write specific test cases for those. Prefer smaller unit tests to larger and slower integration tests. Introduce incremental changes and test in lockstep. If tests break, then try to analyze whether it was expected. Don't be afraid to break even the characterization tests if that behavior is something that was intended to change. If you have a good set of tests around your code, then you can quickly find the effect of changing your code. On the other hand, if you decide to rewrite by discarding your code but not your data, then Django can help you considerably. Legacy databases There is an entire section on legacy databases in Django documentation and rightly so, as you will run into them many times. Data is more important than code, and databases are the repositories of data in most enterprises. You can modernize a legacy application written in other languages or frameworks by importing their database structure into Django. As an immediate advantage, you can use the Django admin interface to view and change your legacy data. Django makes this easy with the inspectdb management command, which looks as follows: $ python manage.py inspectdb > models.py This command, if run while your settings are configured to use the legacy database, can automatically generate the Python code that would go into your models file. Here are some best practices if you are using this approach to integrate to a legacy database: Know the limitations of Django ORM beforehand. Currently, multicolumn (composite) primary keys and NoSQL databases are not supported. Don't forget to manually clean up the generated models, for example, remove the redundant 'ID' fields since Django creates them automatically. Foreign Key relationships may have to be manually defined. In some databases, the auto-generated models will have them as integer fields (suffixed with _id). Organize your models into separate apps. Later, it will be easier to add the views, forms, and tests in the appropriate folders. Remember that running the migrations will create Django's administrative tables (django_* and auth_*) in the legacy database. In an ideal world, your auto-generated models would immediately start working, but in practice, it takes a lot of trial and error. Sometimes, the data type that Django inferred might not match your expectations. In other cases, you might want to add additional meta information such as unique_together to your model. Eventually, you should be able to see all the data that was locked inside that aging PHP application in your familiar Django admin interface. I am sure this will bring a smile to your face. Summary In this article, we looked at various techniques to understand legacy code. Reading code is often an underrated skill. But rather than reinventing the wheel, we need to judiciously reuse good working code whenever possible. Resources for Article: Further resources on this subject: So, what is Django? [article] Adding a developer with Django forms [article] Introduction to Custom Template Filters and Tags [article]

0
0
7306

Packt

30 Mar 2015

8 min read

GUI Components in Qt 5

Packt

30 Mar 2015

8 min read

In this article by Symeon Huang, author of the book Qt 5 Blueprints, explains typical and basic GUI components in Qt 5 (For more resources related to this topic, see here.) Design UI in Qt Creator Qt Creator is the official IDE for Qt application development and we're going to use it to design application's UI. At first, let's create a new project: Open Qt Creator. Navigate to File | New File or Project. Choose Qt Widgets Application. Enter the project's name and location. In this case, the project's name is layout_demo. You may wish to follow the wizard and keep the default values. After this creating process, Qt Creator will generate the skeleton of the project based on your choices. UI files are under Forms directory. And when you double-click on a UI file, Qt Creator will redirect you to integrated Designer, the mode selector should have Design highlighted and the main window should contains several sub-windows to let you design the user interface. Here we can design the UI by dragging and dropping. Qt Widgets Drag three push buttons from the widget box (widget palette) into the frame of MainWindow in the center. The default text displayed on these buttons is PushButtonbut you can change text if you want, by double-clicking on the button. In this case, I changed them to Hello, Hola, and Bonjouraccordingly. Note that this operation won't affect the objectName property and in order to keep it neat and easy-to-find, we need to change the objectName! The right-hand side of the UI contains two windows. The upper right section includes Object Inspector and the lower-right includes the Property Editor. Just select a push button, we can easily change objectName in the Property Editor. For the sake of convenience, I changed these buttons' objectName properties to helloButton, holaButton, and bonjourButton respectively. Save changes and click on Run on the left-hand side panel, it will build the project automatically then run it as shown in the following screenshot: In addition to the push button, Qt provides lots of commonly used widgets for us. Buttons such as tool button, radio button, and checkbox. Advanced views such as list, tree, and table. Of course there are input widgets, line edit, spin box, font combo box, date and time edit, and so on. Other useful widgets such as progress bar, scroll bar, and slider are also in the list. Besides, you can always subclass QWidget and write your own one. Layouts A quick way to delete a widget is to select it and press the Delete button. Meanwhile, some widgets, such as the menu bar, status bar, and toolbar can't be selected, so we have to right-click on them in Object Inspector and delete them. Since they are useless in this example, it's safe to remove them and we can do this for good. Okay, let's understand what needs to be done after the removal. You may want to keep all these push buttons on the same horizontal axis. To do this, perform the following steps: Select all the push buttons either by clicking on them one by one while keeping the Ctrl key pressed or just drawing an enclosing rectangle containing all the buttons. Right-click and select Layout | LayOut Horizontally. The keyboard shortcut for this is Ctrl + H. Resize the horizontal layout and adjust its layoutSpacing by selecting it and dragging any of the points around the selection box until it fits best. Hmm…! You may have noticed that the text of the Bonjour button is longer than the other two buttons, and it should be wider than the others. How do you do this? You can change the property of the horizontal layout object's layoutStretch property in Property Editor. This value indicates the stretch factors of the widgets inside the horizontal layout. They would be laid out in proportion. Change it to 3,3,4, and there you are. The stretched size definitely won't be smaller than the minimum size hint. This is how the zero factor works when there is a nonzero natural number, which means that you need to keep the minimum size instead of getting an error with a zero divisor. Now, drag Plain Text Edit just below, and not inside, the horizontal layout. Obviously, it would be neater if we could extend the plain text edit's width. However, we don't have to do this manually. In fact, we could change the layout of the parent, MainWindow. That's it! Right-click on MainWindow, and then navigate to Lay out | Lay Out Vertically. Wow! All the children widgets are automatically extended to the inner boundary of MainWindow; they are kept in a vertical order. You'll also find Layout settings in the centralWidget property, which is exactly the same thing as the previous horizontal layout. The last thing to make this application halfway decent is to change the title of the window. MainWindow is not the title you want, right? Click on MainWindow in the object tree. Then, scroll down its properties to find windowTitle. Name it whatever you want. In this example, I changed it to Greeting. Now, run the application again and you will see it looks like what is shown in the following screenshot: Qt Quick Components Since Qt 5, Qt Quick has evolved to version 2.0 which delivers a dynamic and rich experience. The language it used is so-called QML, which is basically an extended version of JavaScript using a JSON-like format. To create a simple Qt Quick application based on Qt Quick Controls 1.2, please follow following procedures: Create a new project named HelloQML. Select Qt Quick Application instead of Qt Widgets Application that we chose previously. Select Qt Quick Controls 1.2 when the wizard navigates you to Select Qt Quick Components Set. Edit the file main.qml under the root of Resources file, qml.qrc, that Qt Creator has generated for our new Qt Quick project. Let's see how the code should be. import QtQuick 2.3 import QtQuick.Controls 1.2 ApplicationWindow { visible: true width: 640 height: 480 title: qsTr("Hello QML") menuBar: MenuBar { Menu { title: qsTr("File") MenuItem { text: qsTr("Exit") shortcut: "Ctrl+Q" onTriggered: Qt.quit() } } } Text { id: hw text: qsTr("Hello World") font.capitalization: Font.AllUppercase anchors.centerIn: parent } Label { anchors { bottom: hw.top; bottomMargin: 5; horizontalCenter: hw.horizontalCenter } text: qsTr("Hello Qt Quick") } } If you ever touched Java or Python, then the first two lines won't be too unfamiliar for you. It simply imports the Qt Quick and Qt Quick Controls. And the number behind is the version of the library. The body of this QML source file is really in JSON style, which enables you understand the hierarchy of the user interface through the code. Here, the root item is ApplicationWindow, which is basically the same thing as QMainWindow in Qt/C++. When you run this application in Windows, you can barely find the difference between the Text item and Label item. But on some platforms, or when you change system font and/or its colour, you'll find that Label follows the font and colour scheme of the system while Text doesn't. Run this application, you'll see there is a menu bar, a text, and a label in the application window. Exactly what we wrote in the QML file: You may miss the Design mode for traditional Qt/C++ development. Well, you can still design Qt Quick application in Design mode! Click on Design in mode selector when you edit main.qml file. Qt Creator will redirect you into Design mode where you can use mouse drag-and-drop UI components: Almost all widgets you use in Qt Widget application can be found here in a Qt Quick application. Moreover, you can use other modern widgets such as busy indicator in Qt Quick while there's no counterpart in Qt Widget application. However, QML is a declarative language whose performance is obviously poor than C++. Therefore, more and more developers choose to write UI with Qt Quick in order to deliver a better visual style, while keep core functions in Qt/C++. Summary In this article, we had a brief contact with various GUI components of Qt 5 and focus on the Design mode in Qt Creator. Two small examples used as a Qt-like "Hello World" demonstrations. Resources for Article: Further resources on this subject: Code interlude – signals and slots [article] Program structure, execution flow, and runtime objects [article] Configuring Your Operating System [article]

0
0
5044

article-image-geocoding-address-based-data

Packt

30 Mar 2015

7 min read

Geocoding Address-based Data

Packt

30 Mar 2015

7 min read

In this article by Kurt Menke, GISP, Dr. Richard Smith Jr., GISP, Dr. Luigi Pirelli, Dr. John Van Hoesen, GISP, authors of the book Mastering QGIS, we'll have a look at how to geocode address-based date using QGIS and MMQGIS. (For more resources related to this topic, see here.) Geocoding addresses has many applications, such as mapping the customer base for a store, members of an organization, public health records, or incidence of crime. Once mapped, the points can be used in many ways to generate information. For example, they can be used as inputs to generate density surfaces, linked to parcels of land, and characterized by socio-economic data. They may also be an important component of a cadastral information system. An address geocoding operation typically involves the tabular address data and a street network dataset. The street network needs to have attribute fields for address ranges on the left- and right-hand side of each road segment. You can geocode within QGIS using a plugin named MMQGIS (http://michaelminn.com/linux/mmqgis/). MMQGIS has many useful tools. For geocoding, we will use the tools found in MMQGIS | Geocode. There are two tools there: Geocode CSV with Google/ OpenStreetMap and Geocode from Street Layer as shown in the following screenshot. The first tool allows you to geocode a table of addresses using either the Google Maps API or the OpenStreetMap Nominatim web service. This tool requires an Internet connection but no local street network data as the web services provide the street network. The second tool requires a local street network dataset with address range attributes to geocode the address data: How address geocoding works The basic mechanics of address geocoding are straightforward. The street network GIS data layer has attribute columns containing the address ranges on both the even and odd side of every street segment. In the following example, you can see a piece of the attribute table for the Streets.shp sample data. The columns LEFTLOW, LEFTHIGH, RIGHTLOW, and RIGHTHIGH contain the address ranges for each street segment: In the following example we are looking at Easy Street. On the odd side of the street, the addresses range from 101 to 199. On the even side, they range from 102 to 200. If you wanted to map 150 Easy Street, QGIS would assume that the address is located halfway down the even side of that block. Similarly, 175 Easy Street would be on the odd side of the street three quarters the way down the block. Address geocoding assumes that the addresses are evenly spaced along the linear network. QGIS should place the address point very close to its actual position, but due to variability in lot sizes not every address point will be perfectly positioned. Now that you've learned the basics, let's work through an example. Here we will geocode addresses using web services. The output will be a point shapefile containing all the attribute fields found in the source Addresses.csv file. An example – geocoding using web services Here are the steps for geocoding the Addresses.csv sample data using web services. Load the Addresses.csv and the Streets.shp sample data into QGIS Desktop. Open Addresses.csv and examine the table. These are addresses of municipal facilities. Notice that the street address (for example, 150 Easy Street) is contained in a single field. There are also fields for the city, state, and country. Since both Google and OpenStreetMap are global services, it is wise to include such fields so that the services can narrow down the geography. Install and enable the MMQGIS plugin. Navigate to MMQGIS | Geocode | Geocode CSV with Google/OpenStreetMap. The Web Service Geocode dialog window will open. Select Input CSV File (UTF-8) by clicking on Browse… and locating the delimited text file on your system. Select the address fields by clicking on the drop-down menu and identifying the Address Field, City Field, State Field, and Country Field fields. MMQGIS may identify some or all of these fields by default if they are named with logical names such as Address or State. Choose the web service. Name the output shapefile by clicking on Browse…. Name Not Found Output List by clicking on Browse…. Any records that are not matched will be written to this file. This allows you to easily see and troubleshoot any unmapped records. Click on OK. The status of the geocoding operation can be seen in the lower-left corner of QGIS. The word Geocoding will be displayed, followed by the number of records that have been processed. The output will be a point shapefile and a CSV file listing that addresses were not matched. Two additional attribute columns will be added to the output address point shapefile: addrtype and addrlocat. These fields provide information on how the web geocoding service obtained the location. These may be useful for accuracy assessment. Addrtype is the Google <type> element or the OpenStreetMap class attribute. This will indicate what kind of address type this is (highway, locality, museum, neighborhood, park, place, premise, route, train_station, university etc.). Addrlocat is the Google <location_type> element or OpenStreetMap type attribute. This indicates the relationship of the coordinates to the addressed feature (approximate, geometric center, node, relation, rooftop, way interpolation, and so on). If the web service returns more than one location for an address, the first of the locations will be used as the output feature. Use of this plugin requires an active Internet connection. Google places both rate and volume restrictions on the number of addresses that can be geocoded within various time limits. You should visit the Google Geocoding API website: (http://code.google.com/apis/maps/documentation/geocoding/) for more details, and current information and Google's terms of service. Geocoding via these web services can be slow. If you don't get the desired results with one service, try the other. Geocoding operations rarely have 100% success. Street names in the street shapefile must match the street names in the CSV file exactly. Any discrepancies between the name of a street in the address table, and the street attribute table will lower the geocoding success rate. The following image shows the results of geocoding addresses via street address ranges. The addresses are shown with the street network used in the geocoding operation: Geocoding is often an iterative process. After the initial geocoding operation, you can review the Not Found CSV file. If it's empty then all the records were matched. If it has records in it, compare them with the attributes of the streets layer. This will help you determine why those records were not mapped. It may be due to inconsistencies in the spelling of street names. It may also be due to a street centerline layer that is not as current as the addresses. Once the errors have been identified they can be corrected by editing the data, or obtaining a different street centreline dataset. The geocoding operation can be re-run on those unmatched addresses. This process can be repeated until all records are matched. Use the Identify tool to inspect the mapped points, and the roads, to ensure that the operation was successful. Never take a GIS operation for granted. Check your results with a critical eye. Summary This article introduced you to the process of address geocoding using QGIS and the MMQGIS plugin. Resources for Article: Further resources on this subject: Editing attributes [article] How Vector Features are Displayed [article] QGIS Feature Selection Tools [article]

0
1
3425

article-image-getting-started-intel-galileo

Packt

30 Mar 2015

12 min read

Getting Started with Intel Galileo

Packt

30 Mar 2015

12 min read

In this article by Onur Dundar, author of the book Home Automation with Intel Galileo, we will see how to develop home automation examples using the Intel Galileo development board along with the existing home automation sensors and devices. In the book, a good review of Intel Galileo will be provided, which will teach you to develop native C/C++ applications for Intel Galileo. (For more resources related to this topic, see here.) After a good introduction to Intel Galileo, we will review home automation's history, concepts, technology, and current trends. When we have an understanding of home automation and the supporting technologies, we will develop some examples on two main concepts of home automation: energy management and security. We will build some examples under energy management using electrical switches, light bulbs and switches, as well as temperature sensors. For security, we will use motion, water leak sensors, and a camera to create some examples. For all the examples, we will develop simple applications with C and C++. Finally, when we are done building good and working examples, we will work on supporting software and technologies to create more user friendly home automation software. In this article, we will take a look at the Intel Galileo development board, which will be the device that we will use to build all our applications; also, we will configure our host PC environment for software development. The following are the prerequisites for this article: A Linux PC for development purposes. All our work has been done on an Ubuntu 12.04 host computer, for this article and others as well. (If you use newer versions of Ubuntu, you might encounter problems with some things in this article.) An Intel Galileo (Gen 2) development board with its power adapter. A USB-to-TTL serial UART converter cable; the suggested cable is TTL-232R-3V3 to connect to the Intel Galileo Gen 2 board and your host system. You can see an example of a USB-to-TTL serial UART cable at http://www.amazon.com/GearMo%C2%AE-3-3v-Header-like-TTL-232R-3V3/dp/B004LBXO2A. If you are going to use Intel Galileo Gen 1, you will need a 3.5 mm jack-to-UART cable. You can see the mentioned cable at http://www.amazon.com/Intel-Galileo-Gen-Serial-cable/dp/B00O170JKY/. An Ethernet cable connected to your modem or switch in order to connect Intel Galileo to the local network of your workplace. A microSD card. Intel Galileo supports microSD cards up to 32 GB storage. Introducing Intel Galileo The Intel Galileo board is the first in a line of Arduino-certified development boards based on Intel x86 architecture. It is designed to be hardware and software pin-compatible with Arduino shields designed for the UNOR3. Arduino is an open source physical computing platform based on a simple microcontroller board, and it is a development environment for writing software for the board. Arduino can be used to develop interactive objects, by taking inputs from a variety of switches or sensors and controlling a variety of lights, motors, and other physical outputs. The Intel Galileo board is based on the Intel Quark X1000 SoC, a 32-bit Intel Pentium processor-class system on a chip (SoC). In addition to Arduino compatible I/O pins, Intel Galileo inherited mini PCI Express slots, a 10/100 Mbps Ethernet RJ45 port, USB 2.0 host, and client I/O ports from the PC world. The Intel Galileo Gen 1 USB host is a micro USB slot. In order to use a generation 1 USB host with USB 2.0 cables, you will need an OTG (On-the-go) cable. You can see an example cable at http://www.amazon.com/Cable-Matters-2-Pack-Micro-USB-Adapter/dp/B00GM0OZ4O. Another good feature of the Intel Galileo board is that it has open source hardware designed together with its software. Hardware design schematics and the bill of materials (BOM) are distributed on the Intel website. Intel Galileo runs on a custom embedded Linux operating system, and its firmware, bootloader, as well as kernel source code can be downloaded from https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=23171. Another helpful URL to identify, locate, and ask questions about the latest changes in the software and hardware is the open source community at https://communities.intel.com/community/makers. Intel delivered two versions of the Intel Galileo development board called Gen 1 and Gen 2. At the moment, only Gen 2 versions are available. There are some hardware changes in Gen 2, as compared to Gen 1. You can see both versions in the following image: The first board (on the left-hand side) is the Intel Galileo Gen 1 version and the second one (on the right-hand side) is Intel Galileo Gen 2. Using Intel Galileo for home automation As mentioned in the previous section, Intel Galileo supports various sets of I/O peripherals. Arduino sensor shields and USB and mini PCI-E devices can be used to develop and create applications. Intel Galileo can be expanded with the help of I/O peripherals, so we can manage the sensors needed to automate our home. When we take a look at the existing home automation modules in the market, we can see that preconfigured hubs or gateways manage these modules to automate homes. A hub or a gateway is programmed to send and receive data to/from home automation devices. Similarly, with the help of a Linux operating system running on Intel Galileo and the support of multiple I/O ports on the board, we will be able to manage home automation devices. We will implement new applications or will port existing Linux applications to connect home automation devices. Connecting to the devices will enable us to collect data as well as receive and send commands to these devices. Being able to send and receive commands to and from these devices will make Intel Galileo a gateway or a hub for home automation. It is also possible to develop simple home automation devices with the help of the existing sensors. Pinout helps us to connect sensors on the board and read/write data to sensors and come up with a device. Finally, the power of open source and Linux on Intel Galileo will enable you to reuse the developed libraries for your projects. It can also be used to run existing open source projects on technologies such as Node.js and Python on the board together with our C application. This will help you to add more features and extend the board's capability, for example, serving a web user interface easily from Intel Galileo with Node.js. Intel Galileo – hardware specifications The Intel Galileo board is an open source hardware design. The schematics, Cadence Allegro board files, and BOM can be downloaded from the Intel Galileo web page. In this section, we will just take a look at some key hardware features for feature references to understand the hardware capability of Intel Galileo in order to make better decisions on software design. Intel Galileo is an embedded system with the required RAM and flash storages included on the board to boot it and run without any additional hardware. The following table shows the features of Intel Galileo: Processor features 1 Core 32-bit Intel Pentium processor-compatible ISA Intel Quark SoC X1000 400 MHz 16 KB L1 Cache 512 KB SRAM Integrated real-time clock (RTC) Storage 8 MB NOR Flash for firmware and bootloader 256 MB DDR3; 800 MT/s SD card, up to 32 GB 8 KB EEPROM Power 7 V to 15 V Power over Ethernet (PoE) requires you to install the PoE module Ports and connectors USB 2.0 host (standard type A), client (micro USB type B) RJ45 Ethernet 10-pin JTAG for debugging 6-pin UART 6-pin ICSP 1 mini-PCI Express slot 1 SDIO Arduino compatible headers 20 digital I/O pins 6 analog inputs 6 PWMs with 12-bit resolution 1 SPI master 2 UARTs (one shared with the console UART) 1 I2C master Intel Galileo – software specifications Intel delivers prebuilt images and binaries along with its board support package (BSP) to download the source code and build all related software with your development system. The running operating system on Intel Galileo is Linux; sometimes, it is called Yocto Linux because of the Linux filesystem, cross-compiled toolchain, and kernel images created by the Yocto Project's build mechanism. The Yocto Project is an open source collaboration project that provides templates, tools, and methods to help you create custom Linux-based systems for embedded products, regardless of the hardware architecture. The following diagram shows the layers of the Intel Galileo development board: Intel Galileo is an embedded Linux product; this means you need to compile your software on your development machine with the help of a cross-compiled toolchain or software development kit (SDK). A cross-compiled toolchain/SDK can be created using the Yocto project; we will go over the instructions in the following sections. The toolchain includes the necessary compiler and linker for Intel Galileo to compile and build C/C++ applications for the Intel Galileo board. The binary created on your host with the Intel Galileo SDK will not work on the host machine since it is created for a different architecture. With the help of the C/C++ APIs and libraries provided with the Intel Galileo SDK, you can build any C/C++ native application for Intel Galileo as well as port any existing native application (without a graphical user interface) to run on Intel Galileo. Intel Galileo doesn't have a graphical processor unit. You can still use OpenCV-like libraries, but the performance of matrix operations is so poor on CPU compared to systems with GPU that it is not wise to perform complex image processing on Intel Galileo. Connecting and booting Intel Galileo We can now proceed to power up Intel Galileo and connect it to its terminal. Before going forward with the board connection, you need to install a modem control program to your host system in order to connect Intel Galileo from its UART interface with minicom. Minicom is a text-based modem control and terminal emulation program for Unix-like operating systems. If you are not comfortable with text-based applications, you can use graphical serial terminals such as CuteCom or GtkTerm. To start with Intel Galileo, perform the following steps: Install minicom: $ sudo apt-get install minicom Attach the USB of your 6-pin TTL cable and start minicom for the first time with the –s option: $ sudo minicom –s Before going into the setup details, check the device is connected to your host. In our case, the serial device is /dev/ttyUSB0 on our host system. You can check it from your host's device messages (dmesg) to see the connected USB. When you start minicom with the –s option, it will prompt you. From minicom's Configuration menu, select Serial port setup to set the values, as follows: After setting up the serial device, select Exit to go to the terminal. This will prompt you with the booting sequence and launch the Linux console when the Intel Galileo serial device is connected and powered up. Next, complete connections on Intel Galileo. Connect the TTL-232R cable to your Intel Galileo board's UART pins. UART pins are just next to the Ethernet port. Make sure that you have connected the cables correctly. The black-colored cable on TTL is the ground connection. It is written on TTL pins which one is ground on Intel Galileo. We are ready to power up Intel Galileo. After you plug the power cable into the board, you will see the Intel Galileo board's boot sequence on the terminal. When the booting process is completed, it will prompt you to log in; log in with the root user, where no password is needed. The final prompt will be as follows; we are in the Intel Galileo Linux console, where you can just use basic Linux commands that already exist on the board to discover the Intel Galileo filesystem: Poky 9.0.2 (Yocto Project 1.4 Reference Distro) 1.4.2 clanton clanton login: root root@clanton:~# Your board will now look like the following image: Connecting to Intel Galileo via Telnet If you have connected Intel Galileo to a local network with an Ethernet cable, you can use Telnet to connect it without using a serial connection, after performing some simple steps: Run the following commands on the Intel Galileo terminal: root@clanton:~# ifup eth0 root@clanton:~# ifconfig root@clanton:~# telnetd The ifup command brings the Ethernet interface up, and the second command starts the Telnet daemon. You can check the assigned IP address with the ifconfig command. From your host system, run the following command with your Intel Galileo board's IP address to start a Telnet session with Intel Galileo: $ telnet 192.168.2.168 Summary In this article, we learned how to use the Intel Galileo development board, its software, and system development environment. It takes some time to get used to all the tools if you are not used to them. A little practice with Eclipse is very helpful to build applications and make remote connections or to write simple applications on the host console with a terminal and build them. Let's go through all the points we have covered in this article. First, we read some general information about Intel Galileo and why we chose Intel Galileo, with some good reasons being Linux and the existing I/O ports on the board. Then, we saw some more details about Intel Galileo's hardware and software specifications and understood how to work with them. I believe understanding the internal working of Intel Galileo in building a Linux image and a kernel is a good practice, leading us to customize and run more tools on Intel Galileo. Finally, we learned how to develop applications for Intel Galileo. First, we built an SDK and set up the development environment. There were more instructions about how to deploy the applications on Intel Galileo over a local network as well. Then, we finished up by configuring the Eclipse IDE to quicken the development process for future development. In the next article, we will learn about home automation concepts and technologies. Resources for Article: Further resources on this subject: Hardware configuration [article] Our First Project – A Basic Thermometer [article] Pulse width modulator [article]

0
0
24738

How-To Tutorials

article-image-basic-concepts-machine-learning-and-logistic-regression-example-mahout

Packt

30 Mar 2015

33 min read

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Packt

30 Mar 2015

33 min read

0
0
4995

Packt

30 Mar 2015

28 min read

PostgreSQL – New Features

Packt

30 Mar 2015

28 min read

In this article, Jayadevan Maymala, author of the book, PostgreSQL for Data Architects, you will see how to troubleshoot the initial hiccups faced by people who are new to PostgreSQL. We will look at a few useful, but not commonly used data types. We will also cover pgbadger, a nifty third-party tool that can run through a PostgreSQL log. This tool can tell us a lot about what is happening in the cluster. Also, we will look at a few key features that are part of PostgreSQL 9.4 release. We will cover a couple of useful extensions. (For more resources related to this topic, see here.) Interesting data types We will start with the data types. PostgreSQL does have all the common data types we see in databases. These include: The number data types (smallint, integer, bigint, decimal, numeric, real, and double) The character data types (varchar, char, and text) The binary data types The date/time data types (including date, timestamp without timezone, and timestamp with timezone) BOOLEAN data types However, this is all standard fare. Let's start off by looking at the RANGE data type. RANGE This is a data type that can be used to capture values that fall in a specific range. Let's look at a few examples of use cases. Cars can be categorized as compact, convertible, MPV, SUV, and so on. Each of these categories will have a price range. For example, the price range of a category of cars can start from $15,000 at the lower end and the price range at the upper end can start from $40,000. We can have meeting rooms booked for different time slots. Each room is booked during different time slots and is available accordingly. Then, there are use cases that involve shift timings for employees. Each shift begins at a specific time, ends at a specific time, and involves a specific number of hours on duty. We would also need to capture the swipe-in and swipe-out time for employees. These are some use cases where we can consider range types. Range is a high-level data type; we can use int4range as the appropriate subtype for the car price range scenario. For the booking the meeting rooms and shifting use cases, we can consider tsrange or tstzrange (if we want to capture time zone as well). It makes sense to explore the possibility of using range data types in most scenarios, which involve the following features: From and to timestamps/dates for room reservations Lower and upper limit for price/discount ranges Scheduling jobs Timesheets Let's now look at an example. We have three meeting rooms. The rooms can be booked and the entries for reservations made go into another table (basic normalization principles). How can we find rooms that are not booked for a specific time period, say, 10:45 to 11:15? We will look at this with and without the range data type: CREATE TABLE rooms(id serial, descr varchar(50)); INSERT INTO rooms(descr) SELECT concat('Room ', generate_series(1,3)); CREATE TABLE room_book (id serial , room_id integer, from_time timestamp, to_time timestamp , res tsrange); INSERT INTO room_book (room_id,from_time,to_time,res) values(1,'2014-7-30 10:00:00', '2014-7-30 11:00:00', '(2014-7-30 10:00:00,2014-7-30 11:00:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 10:00:00', '2014-7-30 10:40:00', '(2014-7-30 10:00,2014-7-30 10:40:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 11:20:00', '2014-7-30 12:00:00', '(2014-7-30 11:20:00,2014-7-30 12:00:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(3,'2014-7-30 11:00:00', '2014-7-30 11:30:00', '(2014-7-30 11:00:00,2014-7-30 11:30:00)'); PostgreSQL has the OVERLAPS operator. This can be used to get all the reservations that overlap with the period for which we wanted to book a room: SELECT room_id FROM room_book WHERE (from_time,to_time) OVERLAPS ('2014-07-30 10:45:00','2014-07-30 11:15:00'); If we eliminate these room IDs from the master list, we have the list of rooms available. So, we prefix the following command to the preceding SQL: SELECT id FROM rooms EXCEPT We get a room ID that is not booked from 10:45 to 11:15. This is the old way of doing it. With the range data type, we can write the following SQL statement: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE res && '(2014-07-30 10:45:00,2014-07-30 11:15:00)'; Do look up GIST indexes to improve the performance of queries that use range operators. Another way of achieving the same is to use the following command: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE '2014-07-30 10:45:00' < to_time AND '2014-07-30 11:15:00' > from_time; Now, let's look at the finer points of how a range is represented. The range values can be opened using [ or ( and closed with ] or ). [ means include the lower value and ( means exclude the lower value. The closing (] or )) has a similar effect on the upper values. When we do not specify anything, [) is assumed, implying include the lower value, but exclude the upper value. Note that the lower bound is 3 and upper bound is 6 when we mention 3,5, as shown here: SELECT int4range(3,5,'[)') lowerincl ,int4range(3,5,'[]') bothincl, int4range(3,5,'()') bothexcl , int4range(3,5,'[)') upperexcl; lowerincl | bothincl | bothexcl | upperexcl -----------+----------+----------+----------- [3,5) | [3,6) | [4,5) | [3,5) Using network address types The network address types are cidr, inet, and macaddr. These are used to capture IPv4, IPv6, and Mac addresses. Let's look at a few use cases. When we have a website that is open to public, a number of users from different parts of the world access it. We may want to analyze the access patterns. Very often, websites can be used by users without registering or providing address information. In such cases, it becomes even more important that we get some insight into the users based on the country/city and similar location information. When anonymous users access our website, an IP is usually all we get to link the user to a country or city. Often, this becomes our not-so-accurate unique identifier (along with cookies) to keep track of repeat visits, to analyze website-usage patterns, and so on. The network address types can also be useful when we develop applications that monitor a number of systems in different networks to check whether they are up and running, to monitor resource consumption of the systems in the network, and so on. While data types (such as VARCHAR or BIGINT) can be used to store IP addresses, it's recommended to use one of the built-in types PostgreSQL provides to store network addresses. There are three data types to store network addresses. They are as follows: inet: This data type can be used to store an IPV4 or IPV6 address along with its subnet. The format in which data is to be inserted is Address/y, where y is the number of bits in the netmask. cidr: This data type can also be used to store networks and network addresses. Once we specify the subnet mask for a cidr data type, PostgreSQL will throw an error if we set bits beyond the mask, as shown in the following example: CREATE TABLE nettb (id serial, intclmn inet, cidrclmn cidr); CREATE TABLE INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/32', '192.168.64.2/32'); INSERT 0 1 INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.2/24'); ERROR: invalid cidr value: "192.168.64.2/24" LINE 1: ...b (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.6... ^ DETAIL: Value has bits set to right of mask. INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.0/24'); INSERT 0 1 SELECT * FROM nettb; id | intclmn | cidrclmn ----+-----------------+----------------- 1 | 192.168.64.2 | 192.168.64.2/32 2 | 192.168.64.2/24 | 192.168.64.0/24 Let's also look at a couple of useful operators available within network address types. Does an IP fall in a subnet? This can be figured out using <<=, as shown here: SELECT id,intclmn FROM nettb ; id | intclmn ----+-------------- 1 | 192.168.64.2 3 | 192.168.12.2 4 | 192.168.13.2 5 | 192.168.12.4 SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/24'; id | intclmn 3 | 192.168.12.2 5 | 192.168.12.4 SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/32'; id | intclmn 3 | 192.168.12.2 The operator used in the preceding command checks whether the column value is contained within or equal to the value we provided. Similarly, we have the equality operator, that is, greater than or equal to, bitwise AND, bitwise OR, and other standard operators. The macaddr data type can be used to store Mac addresses in different formats. hstore for key-value pairs A key-value store available in PostgreSQL is hstore. Many applications have requirements that make developers look for a schema-less data store. They end up turning to one of the NoSQL databases (Cassandra) or the simple and more prevalent stores such as Redis or Riak. While it makes sense to opt for one of these if the objective is to achieve horizontal scalability, it does make the system a bit complex because we now have more moving parts. After all, most applications do need a relational database to take care of all the important transactions along with the ability to write SQL to fetch data with different projections. If a part of the application needs to have a key-value store (and horizontal scalability is not the prime objective), the hstore data type in PostgreSQL should serve the purpose. It may not be necessary to make the system more complex by using different technologies that will also add to the maintenance overhead. Sometimes, what we want is not an entirely schema-less database, but some flexibility where we are certain about most of our entities and their attributes but are unsure about a few. For example, a person is sure to have a few key attributes such as first name, date of birth, and a couple of other attributes (irrespective of his nationality). However, there could be other attributes that undergo change. A U.S. citizen is likely to have a Social Security Number (SSN); someone from Canada has a Social Insurance Number (SIN). Some countries may provide more than one identifier. There can be more attributes with a similar pattern. There is usually a master attribute table (which links the IDs to attribute names) and a master table for the entities. Writing queries against tables designed on an EAV approach can get tricky. Using hstore may be an easier way of accomplishing the same. Let's see how we can do this using hstore with a simple example. The hstore key-value store is an extension and has to be installed using CREATE EXTENSION hstore. We will model a customer table with first_name and an hstore column to hold all the dynamic attributes: CREATE TABLE customer(id serial, first_name varchar(50), dynamic_attributes hstore); INSERT INTO customer (first_name ,dynamic_attributes) VALUES ('Michael','ssn=>"123-465-798" '), ('Smith','ssn=>"129-465-798" '), ('James','ssn=>"No data" '), ('Ram','uuid=>"1234567891" , npr=>"XYZ5678", ratnum=>"Somanyidentifiers" '); Now, let's try retrieving all customers with their SSN, as shown here: SELECT first_name, dynamic_attributes FROM customer WHERE dynamic_attributes ? 'ssn'; first_name | dynamic_attributes Michael | "ssn"=>"123-465-798" Smith | "ssn"=>"129-465-798" James | "ssn"=>"No data" Also, those with a specific SSN: SELECT first_name,dynamic_attributes FROM customer WHERE dynamic_attributes -> 'ssn'= '123-465-798'; first_name | dynamic_attributes - Michael | "ssn"=>"123-465-798" If we want to get records that do not contain a specific SSN, just use the following command: WHERE NOT dynamic_attributes -> 'ssn'= '123-465-798' Also, replacing it with WHERE NOT dynamic_attributes ? 'ssn'; gives us the following command: first_name | dynamic_attributes ------------+----------------------------------------------------- Ram | "npr"=>"XYZ5678", "uuid"=>"1234567891", "ratnum"=>"Somanyidentifiers" As is the case with all data types in PostgreSQL, there are a number of functions and operators available to fetch data selectively, update data, and so on. We must always use the appropriate data types. This is not just for the sake of doing it right, but because of the number of operators and functions available with a focus on each data type; hstore stores only text. We can use it to store numeric values, but these values will be stored as text. We can index the hstore columns to improve performance. The type of index to be used depends on the operators we will be using frequently. json/jsonb JavaScript Object Notation (JSON) is an open standard format used to transmit data in a human-readable format. It's a language-independent data format and is considered an alternative to XML. It's really lightweight compared to XML and has been steadily gaining popularity in the last few years. PostgreSQL added the JSON data type in Version 9.2 with a limited set of functions and operators. Quite a few new functions and operators were added in Version 9.3. Version 9.4 adds one more data type: jsonb.json, which is very similar to JSONB. The jsonb data type stores data in binary format. It also removes white spaces (which are insignificant) and avoids duplicate object keys. As a result of these differences, JSONB has an overhead when data goes in, while JSON has extra processing overhead when data is retrieved (consider how often each data point will be written and read). The number of operators available with each of these data types is also slightly different. As it's possible to cast one data type to the other, which one should we use depends on the use case. If the data will be stored as it is and retrieved without any operations, JSON should suffice. However, if we plan to use operators extensively and want indexing support, JSONB is a better choice. Also, if we want to preserve whitespace, key ordering, and duplicate keys, JSON is the right choice. Now, let's look at an example. Assume that we are doing a proof of concept project for a library management system. There are a number of categories of items (ranging from books to DVDs). We wouldn't have information about all the categories of items and their attributes at the piloting stage. For the pilot stage, we could use a table design with the JSON data type to hold various items and their attributes: CREATE TABLE items ( item_id serial, details json ); Now, we will add records. All DVDs go into one record, books go into another, and so on: INSERT INTO items (details) VALUES ('{ "DVDs" :[ {"Name":"The Making of Thunderstorms", "Types":"Educational", "Age-group":"5-10","Produced By":"National Geographic" }, {"Name":"My nightmares", "Types":"Movies", "Categories":"Horror", "Certificate":"A", "Director":"Dracula","Actors": [{"Name":"Meena"},{"Name":"Lucy"},{"Name":"Van Helsing"}] }, {"Name":"My Cousin Vinny", "Types":"Movies", "Categories":"Suspense", "Certificate":"A", "Director": "Jonathan Lynn","Actors": [{"Name":"Joe "},{"Name":"Marissa"}] }] }' ); A better approach would be to have one record for each item. Now, let's take a look at a few JSON functions: SELECT details->>'DVDs' dvds, pg_typeof(details->>'DVDs') datatype FROM items; SELECT details->'DVDs' dvds ,pg_typeof(details->'DVDs') datatype FROM items; Note the difference between ->> and -> in the following screenshot. We are using the pg_typeof function to clearly see the data type returned by the functions. Both return the JSON object field. The first function returns text and the second function returns JSON: Now, let's try something a bit more complex: retrieve all movies in DVDs in which Meena acted with the following SQL statement: WITH tmp (dvds) AS (SELECT json_array_elements(details->'DVDs') det FROM items) SELECT * FROM tmp , json_array_elements(tmp.dvds#>'{Actors}') as a WHERE a->>'Name'='Meena'; We get the record as shown here: We used one more function and a couple of operators. The json_array_elements expands a JSON array to a set of JSON elements. So, we first extracted the array for DVDs. We also created a temporary table, which ceases to exist as soon as the query is over, using the WITH clause. In the next part, we extracted the elements of the array actors from DVDs. Then, we checked whether the Name element is equal to Meena. XML PostgreSQL added the xml data type in Version 8.3. Extensible Markup Language (XML) has a set of rules to encode documents in a format that is both human-readable and machine-readable. This data type is best used to store documents. XML became the standard way of data exchanging information across systems. XML can be used to represent complex data structures such as hierarchical data. However, XML is heavy and verbose; it takes more bytes per data point compared to the JSON format. As a result, JSON is referred to as fat-free XML. XML structure can be verified against XML Schema Definition Documents (XSD). In short, XML is heavy and more sophisticated, whereas JSON is lightweight and faster to process. We need to configure PostgreSQL with libxml support (./configure --with-libxml) and then restart the cluster for XML features to work. There is no need to reinitialize the database cluster. Inserting and verifying XML data Now, let's take a look at what we can do with the xml data type in PostgreSQL: CREATE TABLE tbl_xml(id serial, docmnt xml); INSERT INTO tbl_xml(docmnt ) VALUES ('Not xml'); INSERT INTO tbl_xml (docmnt) SELECT query_to_xml( 'SELECT now()',true,false,'') ; SELECT xml_is_well_formed_document(docmnt::text), docmnt FROM tbl_xml; Then, take a look at the following screenshot: First, we created a table with a column to store the XML data. Then, we inserted a record, which is not in the XML format, into the table. Next, we used the query_to_xml function to get the output of a query in the XML format. We inserted this into the table. Then, we used a function to check whether the data in the table is well-formed XML. Generating XML files for table definitions and data We can use the table_to_xml function if we want to dump the data from a table in the XML format. Append and_xmlschema so that the function becomes table_to_xml_and_xmlschema, which will also generate the schema definition before dumping the content. If we want to generate just the definitions, we can use table_to_xmlschema. PostgreSQL also provides the xpath function to extract data as follows: SELECT xpath('/table/row/now/text()',docmnt) FROM tbl_xml WHERE id = 2; xpath ------------------------------------ {2014-07-29T16:55:00.781533+05:30} Using properly designed tables with separate columns to capture each attribute is always the best approach from a performance standpoint and update/write-options perspective. Data types such as json/xml are best used to temporarily store data when we need to provide feeds/extracts/views to other systems or when we get data from external systems. They can also be used to store documents. The maximum size for a field is 1 GB. We must consider this when we use the database to store text/document data. pgbadger Now, we will look at a must-have tool if we have just started with PostgreSQL and want to analyze the events taking place in the database. For those coming from an Oracle background, this tool provides reports similar to AWR reports, although the information is more query-centric. It does not include data regarding host configuration, wait statistics, and so on. Analyzing the activities in a live cluster provides a lot of insight. It tells us about load, bottlenecks, which queries get executed frequently (we can focus more on them for optimization). It even tells us if the parameters are set right, although a bit indirectly. For example, if we see that there are many temp files getting created while a specific query is getting executed, we know that we either have a buffer issue or have not written the query right. For pgbadger to effectively scan the log file and produce useful reports, we should get our logging configuration right as follows: log_destination = 'stderr' logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d.log' log_min_duration_statement = 0 log_connections = on log_disconnections = on log_duration = on log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' log_lock_waits = on track_activity_query_size = 2048 It might be necessary to restart the cluster for some of these changes to take effect. We will also ensure that there is some load on the database using pgbench. It's a utility that ships with PostgreSQL and can be used to benchmark PostgreSQL on our servers. We can initialize the tables required for pgbench by executing the following command at shell prompt: pgbench -i pgp This creates a few tables on the pgp database. We can log in to psql (database pgp) and check: \dt List of relations Schema | Name | Type | Owner --------+------------------+-------+---------- public | pgbench_accounts | table | postgres public | pgbench_branches | table | postgres public | pgbench_history | table | postgres public | pgbench_tellers | table | postgres Now, we can run pgbench to generate load on the database with the following command: pgbench -c 5 -T10 pgp The T option passes the duration for which pgbench should continue execution in seconds, c passes the number of clients, and pgp is the database. At shell prompt, execute: wget https://github.com/dalibo/pgbadger/archive/master.zip Once the file is downloaded, unzip the file using the following command: unzip master.zip Use cd to the directory pgbadger-master as follows: cd pgbadger-master Execute the following command: ./pgbadger /pgdata/9.3/pg_log/postgresql-2014-07-31.log –o myoutput.html Replace the log file name in the command with the actual name. It will generate a myoutput.html file. The HTML file generated will have a wealth of information about what happened in the cluster with great charts/tables. In fact, it takes quite a bit of time to go through the report. Here is a sample chart that provides the distribution of queries based on execution time: The following screenshot gives an idea about the number of performance metrics provided by the report: If our objective is to troubleshoot performance bottlenecks, the slowest individual queries and most frequent queries under the top drop-down list is the right place to start. Once the queries are identified, locks, temporary file generation, and so on can be studied to identify the root cause. Of course, EXPLAIN is the best option when we want to refine individual queries. If the objective is to understand how busy the cluster is, the Overview section and Sessions are the right places to explore. The logging configuration used may create huge log files in systems with a lot of activity. Tweak the parameters appropriately to ensure that this does not happen. With this, we covered most of the interesting data types, an interesting extension and a must-use tool from PostgreSQL ecosystem. Now, let's cover a few interesting features in PostgreSQL Version 9.4. Features over time Applying filters in Versions 8.0, 9.0, and 9.4 gives us a good idea about how quickly features are getting added to the database. Interesting features in 9.4 Each version of PostgreSQL adds many features grouped into different categories (such as performance, backend, data types, and so on). We will look at a few features that are more likely to be of interest (because they help us improve performance or they make maintenance and configuration easy). Keeping the buffer ready As we saw earlier, reads from disk have a significant overhead compared to those from memory. There are quite a few occasions when disk reads are unavoidable. Let's see a few examples. In a data warehouse, the Extract, Transform, Load (ETL) process, which may happen once a day usually, involves a lot of raw data getting processed in memory before being loaded into the final tables. This data is mostly transactional data. The master data, which does not get processed on a regular basis, may be evicted from memory as a result of this churn. Reports typically depend a lot on master data. When users refresh their reports after ETL, it's highly likely that the master data will be read from disk, resulting in a drop in the response time. If we could ensure that the master data as well as the recently processed data is in the buffer, it can really improve user experience. In a transactional system like an airline reservation system, a change to the fare rule may result in most of the fares being recalculated. This is a situation similar to the one described previously, ensuring that the fares and availability data for the most frequently searched routes in the buffer can provide a better user experience. This applies to an e-commerce site selling products also. If the product/price/inventory data is always available in memory, it can be retrieved very fast. You must use PostgreSQL 9.4 for trying out the code in the following sections. So, how can we ensure that the data is available in the buffer? A pg_prewarm module has been added as an extension to provide this functionality. The basic syntax is very simple: SELECT pg_prewarm('tablename');. This command will populate the buffers with data from the table. It's also possible to mention the blocks that should be loaded into the buffer from the table. We will install the extension in a database, create a table, and populate some data. Then, we will stop the server, drop buffers (OS), and restart the server. We will see how much time a SELECT count(*) takes. We will repeat the exercise, but we will use pg_prewarm before executing SELECT count(*) at psql: CREATE EXTENSION pg_prewarm; CREATE TABLE myt(id SERIAL, name VARCHAR(40)); INSERT INTO myt(name) SELECT concat(generate_series(1,10000),'name'); Now, stop the server using pg_ctl at the shell prompt: pg_ctl stop -m immediate Clean OS buffers using the following command at the shell prompt (will need to use sudo to do this): echo 1 > /proc/sys/vm/drop_caches The command may vary depending on the OS. Restart the cluster using pg_ctl start. Then, execute the following command: SELECT COUNT(*) FROM myt; Time: 333.115 ms We should repeat the steps of shutting down the server, dropping the cache, and starting PostgreSQL. Then, execute SELECT pg_prewarm('myt'); before SELECT count(*). The response time goes down significantly. Executing pg_prewarm does take some time, which is close to the time taken to execute the SELECT count(*) against a cold cache. However, the objective is to ensure that the user does not experience a delay. SELECT COUNT(*) FROM myt; count ------- 10000 (1 row) Time: 7.002 ms Better recoverability A new parameter called recovery_min_apply_delay has been added in 9.4. This will go to the recovery.conf file of the slave server. With this, we can control the replay of transactions on the slave server. We can set this to approximately 5 minutes and then the standby will replay the transaction from the master when the standby system time is 5 minutes past the time of commit at the master. This provides a bit more flexibility when it comes to recovering from mistakes. When we keep the value at 1 hour, the changes at the master will be replayed at the slave after one hour. If we realize that something went wrong on the master server, we have about 1 hour to stop the transaction replay so that the action that caused the issue (for example, accidental dropping of a table) doesn't get replayed at the slave. Easy-to-change parameters An ALTER SYSTEM command has been introduced so that we don't have to edit postgresql.conf to change parameters. The entry will go to a file named postgresql.auto.conf. We can execute ALTER SYSTEM SET work_mem='12MB'; and then check the file at psql: \! more postgresql.auto.conf # Do not edit this file manually! # It will be overwritten by ALTER SYSTEM command. work_mem = '12MB' We must execute SELECT pg_reload_conf(); to ensure that the changes are propagated. Logical decoding and consumption of changes Version 9.4 introduces physical and logical replication slots. We will look at logical slots as they let us track changes and filter specific transactions. This lets us pick and choose from the transactions that have been committed. We can grab some of the changes, decode, and possibly replay on a remote server. We do not have to have an all-or-nothing replication. As of now, we will have to do a lot of work to decode/move the changes. Two parameter changes are necessary to set this up. These are as follows: The max_replication_slots parameter (set to at least 1) and wal_level (set to logical). Then, we can connect to a database and create a slot as follows: SELECT * FROM pg_create_logical_replication_slot('myslot','test_decoding'); The first parameter is the name we give to our slot and the second parameter is the plugin to be used. Test_decoding is the sample plugin available, which converts WAL entries into text representations as follows: INSERT INTO myt(id) values (4); INSERT INTO myt(name) values ('abc'); Now, we will try retrieving the entries: SELECT * FROM pg_logical_slot_peek_changes('myslot',NULL,NULL); Then, check the following screenshot: This function lets us take a look at the changes without consuming them so that the changes can be accessed again: SELECT * FROM pg_logical_slot_get_changes('myslot',NULL,NULL); This is shown in the following screenshot: This function is similar to the peek function, but the changes are no longer available to be fetched again as they get consumed. Summary In this article, we covered a few data types that data architects will find interesting. We also covered what is probably the best utility available to parse the PostgreSQL log file to produce excellent reports. We also looked at some of the interesting features in PostgreSQL version 9.4, which will be of interest to data architects. Resources for Article: Further resources on this subject: PostgreSQL as an Extensible RDBMS [article] Getting Started with PostgreSQL [article] PostgreSQL Cookbook - High Availability and Replication [article]

0
0
3012

article-image-understanding-and-creating-simple-ssrs-reports

Packt

27 Mar 2015

14 min read

Understanding and Creating Simple SSRS Reports

Packt

27 Mar 2015

14 min read

0
0
10535

How-To Tutorials

article-image-puppet-and-os-security-tools

Packt

27 Mar 2015

17 min read

Puppet and OS Security Tools

Packt

27 Mar 2015

17 min read

In this article by Jason Slagle, author of the book Learning Puppet Security, covers using Puppet to manage SELinux and auditd. We learned a lot so far about using Puppet to secure your systems as, well as how to use it to make groups of systems more secure. However, in all of that, we've not yet covered some of the basic OS-level functions that are available to secure a system. In this article, we'll review several of those functions. (For more resources related to this topic, see here.) SELinux is a powerful tool in the security arsenal. Most administrators experience with it, is along the lines of "how can I turn that off ?" This is born out of frustration with the poor documentation about the tool, as well as the tedious nature of the configuration. While Puppet cannot help you with the documentation (which is getting better all the time), it can help you with some of the other challenges that SELinux can bring. That is, ensuring that the proper contexts and policies are in place on the systems being managed. In this article, we'll cover the following topics related to OS-level security tools: A brief introduction to SELinux and auditd The built-in Puppet support for SELinux Community modules for SELinux Community modules for auditd At the end of this article, you should have enough skills so that you no longer need to disable SELinux. However, if you still need to do so, it is certainly possible to do via the modules presented here. Introducing SELinux and auditd During the course of this article, we'll explore the SELinux framework for Linux and see how to automate it using Puppet. As part of the process, we'll also review auditd, the logging and auditing framework for Linux. Using Puppet, we can automate the configuration of these often-neglected security tools, and even move the configuration of these tools for various services to the modules that configure those services. The SELinux framework SELinux is a security system for Linux originally developed by the United States National Security Agency (NSA). It is an in-kernel protection mechanism designed to provide Mandatory Access Controls (MACs) to the Linux kernel. SELinux isn't the only MAC framework for Linux. AppArmor is an alternative MAC framework included in the Linux kernel since Version 2.6.30. We choose to implement SELinux; since it is the default framework used under Red Hat Linux, which we're using for our examples. More information on AppArmor can be found at http://wiki.apparmor.net/index.php/Main_Page. These access controls work by confining processes to the minimal amount of files and network access that the processes require to run. By doing this, the controls limit the amount of collateral damage that can be done by a process, which becomes compromised. SELinux was first merged to the Linux mainline kernel for the 2.6.0 release. It was introduced into Red Hat Enterprise Linux with Version 4, and into Ubuntu in Version 8.04. With each successive release of the operating systems, support for SELinux grows, and it becomes easier to use. SELinux has a couple of core concepts that we need to understand to properly configure it. The first are the concepts of types and contexts. A type in SELinux is a grouping of similar things. Files used by Apache may be httpd_sys_content_t, for instance, which is a type that all content served by HTTP would have. The httpd process itself is of type httpd_t. These types are applied to objects, which represent discrete things, such as files and ports, and become part of the context of that object. The context of an object represents the object's user, role, type, and optionally data on multilevel security. For this discussion, the type is the most important component of the context. Using a policy, we grant access from the subject, which represents a running process, to various objects that represent files, network ports, memory, and so on. We do that by creating a policy that allows a subject to have access to the types it requires to function. SELinux has three modes that it can operate in. The first of these modes is disabled. As the name implies, the disabled mode runs without any SELinux enforcement. The second mode is called permissive. In permissive mode, SELinux will log any access violations, but will not act on them. This is a good way to get an idea of where you need to modify your policy, or tune Booleans to get proper system operations. The final mode, enforcing, will deny actions that do not have a policy in place. Under Red Hat Linux variants, this is the default SELinux mode. By default, Red Hat 6 runs SELinux with a targeted policy in enforcing mode. This means, that for the targeted daemons, SELinux will enforce its policy by default. An example is in order here, to explain this well. So far, we've been operating with SELinux disabled on our hosts. The first step in experimenting with SELinux is to turn it on. We'll set it to permissive mode at first, while we gather some information. To do this, after starting our master VM, we'll need to modify the SELinux configuration and reboot. While it's possible to change from enforcing mode to either permissive or disabled mode without a reboot, going back requires us to reboot. Let's edit the /etc/sysconfig/selinux file and set the SELINUX variable to permissive on our puppetmaster. Remember to start the vagrant machine and SSH in as it is necessary. Once this is done, the file should look as follows: Once this is complete, we need to reboot. To do so, run the following command: sudo shutdown -r now Wait for the system to come back online. Once the machine is back up and you SSH back into it, run the getenforce command. It should return permissive, which means SELinux is running, but not enforced. Now, we can make sure our master is running and take a look at its context. If it's not running, you can start the service with the sudo service puppetmaster start command. Now, we'll use the -Z flag on the ps command to examine the SELinux flag. Many commands, such as ps and ls use the -Z flag to view the SELinux data. We'll go ahead and run the following command to view the SELinux data for the running puppetmaster: ps -efZ|grep puppet When you do this, you'll see a Linux output, such as follows: unconfined_u:system_r:initrc_t:s0 puppet 1463 1 1 11:41 ? 00:00:29 /usr/bin/ruby /usr/bin/puppet master If you take a look at the first part of the output line, you'll see that Puppet is running in the unconfined_u:system_r:initrc_t context. This is actually somewhat of a bug and a result of the Puppet policy on CentOS 6 being out of date. We should actually be running under the system_u:system_r:puppetmaster_t:s0 context, but the policy is for a much older version of Puppet, so it runs unconfined. Let's take a look at the sshd process to see what it looks like also. To do so, we'll just grep for sshd instead: ps -efZ|grep sshd The output is as follows: system_u:system_r:sshd_t:s0-s0:c0.c1023 root 1206 1 0 11:40 ? 00:00:00 /usr/sbin/sshd This is a more traditional output one would expect. The sshd process is running under the system_u:system_r:sshd_t context. This actually corresponds to the system user, the system role, and the sshd type. The user and role are SELinux constructs that help you allow role-based access controls. The users do not map to system users, but allow us to set a policy based on the SELinux user object. This allows role-based access control, based on the SELinux user. Previously the unconfined user was a user that will not be enforced. Now, we can take a look at some objects. Doing a ls -lZ /etc/ssh command results in the following: As you can see, each of the files belongs to a context that includes the system user, as well as the object role. They are split among the etc type for configuration files and the sshd_key type for keys. The SSH policy allows the sshd process to read both of these file types. Other policies, say, for NTP, would potentially allow the ntpd process to read the etc types, but it would not be able to read the sshd_key files. This very fine-grained control is the power of SELinux. However, with great power comes very complex configuration. Configuration can be confusing to set up, if it doesn't happen correctly. For instance, with Puppet, the wrong type can potentially impact the system if not dealt with. Fortunately, in permissive mode, we will log data that we can use to assist us with this. This leads us into the second half of the system that we wish to discuss, which is auditd. In the meantime, there is a bunch of information on SELinux available on its website at http://selinuxproject.org/page/Main_Page. There's also a very funny, but informative, resource available describing SELinux at https://people.redhat.com/duffy/selinux/selinux-coloring-book_A4-Stapled.pdf. The auditd framework for audit logging SELinux does a great job at limiting access to system components; however, reporting what enforcement took place was not one of its objectives. Enter the auditd. The auditd is an auditing framework developed by Red Hat. It is a complete auditing system using rules to indicate what to audit. This can be used to log SELinux events, as well as much more. Under the hood, auditd has hooks into the kernel to watch system calls and other processes. Using the rules, you can configure logging for any of these events. For instance, you can create a rule that monitors writes to the /etc/passwd file. This would allow you to see if any users were added to the system. We can also add monitoring of files, such as lastlog and wtmp to monitor the login activity. We'll explore this example later when we configure auditd. To quickly see how a rule works, we'll manually configure a quick rule that will log the time when the wtmp file was edited. This will add some system logging around users logging in. To do this, let's edit the /etc/audit/audit.rules file to add a rule to monitor this. Edit the file and add the following lines: -w /var/log/wtmp -p wa -k logins-w /etc/passwd –p wa –k password We'll take a look at what the preceding lines do. These lines both start with the –w clauses. These indicate the files that we are monitoring. Second, we have the –p clauses. This lets you set what file operations we monitor. In this case, it is write and append operations. Finally, with the the –k entries, we're setting a keyword that is logged and can be filtered on. This should go at the end of the file. Once it's done, reload auditd with the following command: sudo service auditd restart Once this is complete, go ahead and log another ssh session in. Once you can simply log, back out. Once this is done, take a look at the /var/log/audit/audit.log file. You should see the content like the following: type=SYSCALL msg=audit(1416795396.816:482): arch=c000003e syscall=2 success=yes exit=8 a0=7fa983c446aa a1=1 a2=2 a3=7fff3f7a6590 items=1 ppid=1206 pid=2202 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=51 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="logins"type=SYSCALL msg=audit(1416795420.057:485): arch=c000003e syscall=2 success=yes exit=7 a0=7fa983c446aa a1=1 a2=2 a3=8 items=1 ppid=1206 pid=2202 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=51 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="logins" There are tons of fields in this output, including the SELinux context, the userID, and so on. Of interest is the auid, which is the audit user ID. On commands run via the sudo command, this will still contain the user ID of the user who called sudo. This is a great way to log commands performed via sudo. Auditd also logs SELinux failures. They get logged under the type AVC. These access vector cache logs will be placed in the auditd log file when a SELinux violation occurs. Much like SELinux, auditd is somewhat complicated. The intricacies of it are beyond the scope of this book. You can get more information at http://people.redhat.com/sgrubb/audit/. SELinux and Puppet Puppet has direct support for several features of SELinux. There are two native Puppet types for SELinux: selboolean and selmodule. These types support setting SELinux Booleans and installing SELinux policy modules. SELinux Booleans are variables that impact on how SELinux behaves. They are set to allow various functions to be permitted. For instance, you set a SELinux Boolean to true to allow the httpd process to access network ports. SELinux modules are groupings of policies. They allow policies to be loaded in a more granular way. The Puppet selmodule type allows Puppet to load these modules. The selboolean type The targeted SELinux policy that most distributions use is based on the SELinux reference policy. One of the features of this policy is the use of Boolean variables that control actions of the policy. There are over 200 of these Booleans on a Red Hat 6-based machine. We can investigate them by installing the policycoreutils-python package on the operating system. You can do this by executing the following command: sudo yum install policycoreutils-python Once installed, we can run the semanage boolean -l command to get a list of the Boolean values, along with their descriptions. The output of this will look as follows: As you can see, there exists a very large number of settings that can be reconfigured, simply by setting the appropriate Boolean value. The selboolean Puppet type supports managing these Boolean values. The provider is fairly simple, accepting the following values: Parameter Description name This contains the name of the Boolean to be set. It defaults to the title. persistent This checks whether to write the value to disk for the next boot. provider This is the provider for the type. Usually, the default getsetsebool value is accepted. value This contains the value of the Boolean, true or false. Usage of this type is rather simple. We'll show an example that will set the puppetmaster_use_db parameter to true value. If we are using the SELinux Puppet policy, this would allow the master to talk to a database. For our use, it's a simple unused variable that we can use for demonstration purposes. As a reminder, the SElinux policy for Puppet on CentOS 6 is outdated, so setting the Boolean does not impact the version of Puppet we're running. It does, however, serve to show how a Boolean is set. To do this, we'll create a sample role and profile for our puppetmaster. This is something that would likely exist in a production environment to manage the configuration of the master. In this example, we'll simply build a small profile and role for the master. Let's start with the profile. Copy over the profiles module we've slowly been building up, and let's add a puppetmaster.pp profile. To do so, edit the profiles/manifests/puppetmaster.pp file and make it look as follows: class profiles::puppetmaster {selboolean { 'puppetmaster_use_db': value => on, persistent => true,}} Then, we'll move on to the role. Copy the roles, and edit the roles/manifests/puppetmaster.pp file there and make it look as follows: class roles::puppetmaster {include profiles::puppetmaster} Once this is done, we can apply it to our host. Edit the /etc/puppet/manifests/site.pp file. We'll apply the puppetmaster role to the puppetmaster machine, as follows: node 'puppet.book.local' {include roles::puppetmaster} Now, we'll run Puppet and get the output as follows: As you can see, it set the value to on when run. Using this method, we can set any of the SELinux Boolean values we need for our system to operate properly. More information on SELinux Booleans with information on how to obtain a list of them can be found at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Working_with_SELinux-Booleans.html. The selmodule type The other native type inside Puppet is a type to manage the SELinux modules. Modules are compiled collections of the SELinux policy. They're loaded into the kernel using the selmodule command. This Puppet type provides support for this mechanism. The available parameters are as follows: Parameter Description name This contains the name of the module— it defaults to the title ensure This is the desired state—present or absent provider This specifies the provider for the type—it should be selmodule selmoduledir This is the directory that contains the module to be installed selmodulepath This provides the complete path to the module to be installed if not present in selmoduledir syncversion This checks whether to resync the module if a new version is found, such as ensure => latest Using the module, we can take our compiled module and serve it onto the system with Puppet. We can then use the module to ensure that it gets installed on the system. This lets us centrally manage the module with Puppet. We'll see an example where this module compiles a policy and then installs it, so we won't show a specific example here. Instead, we'll move on to talk about the last SELinux-related component in Puppet. File parameters for SELinux The final internal support for SELinux types comes in the form of the file type. The file type parameters are as follows: Parameter Description selinux_ignore_defaults By default, Puppet will use the matchpathcon function to set the context of a file. This overrides that behavior if set to true value. Selrange This sets the SELinux range component. We've not really covered this. It's not used in most mainstream distributions at the time this book was written. Selrole This sets the SELinux role on the file. seltype This sets the SELinux type on the file. seluser This sets the SELinux role on the file. Usually, if you place files in the correct location (the expected location for a service) on the filesystem, Puppet will get the SELinux properties correct via its use of the matchpathcon function. This function (which also has a matching utility) applies a default context based on the policy settings. Setting the context manually is used in cases where you're storing data outside the normal location. For instance, you might be storing web data under the /opt file. The preceding types and providers provide the basics that allow you to manage SELinux on a system. We'll now take a look at a couple of community modules that build on these types and create a more in-depth solution. Summary This article looked at what SELinux and auditd were, and gave a brief example of how they can be used. We looked at what they can do, and how they can be used to secure your systems. After this, we looked at the specific support for SELinux in Puppet. We looked at the two built-in types to support it, as well as the parameters on the file type. Then, we took a look at one of the several community modules for managing SELinux. Using this module, we can store the policies as text instead of compiled blobs. Resources for Article: Further resources on this subject: The anatomy of a report processor [Article] Module, Facts, Types and Reporting tools in Puppet [Article] Designing Puppet Architectures [Article]

0
0
13336

article-image-overview-horizon-view-architecture-and-its-components

Packt

27 Mar 2015

31 min read

An Overview of Horizon View Architecture and its Components

Packt

27 Mar 2015

31 min read

0
0
16331

Packt

27 Mar 2015

21 min read

System Center Reporting

Packt

27 Mar 2015

21 min read

0
0
1421

article-image-storm-real-time-high-velocity-computation

Packt

27 Mar 2015

10 min read

Storm for Real-time High Velocity Computation

Packt

27 Mar 2015

10 min read

In this article by Shilpi Saxena, author of the book Real-time Analytics with Storm and Cassandra, we will cover the following topics: What's possible with data analysis? Real-time analytics—why is it becoming the need of the hour Why storm—the power of high speed distributed computations We will get you to think about some interesting problems along the lines of Air Traffic Controller (ATC), credit card fraud detection, and so on. First and foremost, you will understand what is big data. Well, big data is the buzzword of the software industry but it's much more than the buzz in reality, it's really a huge amount of data. (For more resources related to this topic, see here.) What is big data? Big data is equal to volume, veracity, variety, and velocity. The descriptions of these are as follows: Volume: Enterprises are awash with ever growing data of all types, easily amassing terabytes even petabytes of information (for example, convert 12 terabytes of tweets created each day into an improved product sentiment analysis or convert 350 billion annual meter readings to better predict power consumption). Velocity: Sometimes, 2 minutes is too late. For time-sensitive processes, such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value (for example, scrutinize 5 million trade events created each day to identify potential fraud or analyze 500 million call detail records daily in real time to predict the customer churn faster). Variety: Big data is any type of data, structured and unstructured data, such as text, sensor data, audio, video, click streams, log files, and many more. New insights are found when analyzing these data types together (for example, monitor hundreds of live video feeds from surveillance cameras to target points of interest or exploit the 80 percent data growth in images, videos, and documents to improve customer satisfaction). Well now that I have described big data, let's have a quick look at where is this data generated and how does it come into existence. The following figure demonstrates a quick snapshot of what all can happen in one second in the world of the internet and social media. Now, we need the power to process all this data at the same rate at which it is generated to gain some meaningful insight out of it, as shown: The power of computation comes with the Storm and Cassandra combination. This technological combo let's us cater to the following use cases: Credit card fraud detection Security breaches Bandwidth allocation Machine failures Supply chain Personalized content Recommendations Get acquainted to few problems that require distributed computing solution Let's do a deep dive and identify some of the problems which require distributed solutions. Real-time business solution for credit or debit card fraud detection Let's get acquainted to the problem depicted in the following figure; when we make any transaction using plastic money and swipe our debit or credit card for payment, the duration within which the bank has to validate or reject the transaction is less than 5 seconds. During this less than 5 seconds, data or transaction details have to be encrypted, travel over secure network from servicing back bank to issuing back bank, then at the issuing back bank the entire fuzzy logic for acceptance or decline of the transaction has to computed, and the result has to travel back over the secure network: The challenges such as network latency and delay can be optimized to some extent, but to achieve the preceding featuring transaction in less than 5 seconds, one has to design an application that is able to churn a considerable amount of data and generate results in 1 to 2 seconds. Aircraft Communications Addressing and Reporting system It is another typical use case that cannot be implemented without having a reliable real-time processing system in place. These systems use Satellite communication (SATCOM), and as per the following figure, they gather voice and packet data from all phases of flight in real-time and are able to generate analytics and alerts on the same data in real-time. Let's take the example from the figure in the preceding case. A flight encounters some real hazardous weather, say, electric Storms on a route, then that information is sent through satellite links and voice or data gateways to the air controller, which in real-time detects and raises the alerts to deviate routes for all other flights passing through that area. Healthcare This is another very important domain where real-time analytics over high volume and velocity data has equipped the healthcare professionals with accurate and exact information in real-time to take informed life-saving actions. The preceding figure depicts the use case where the doctors can take informed action to handle the medical situation of the patients. Data is collated from historic patient database, drug database, and patient records. Once data is collected it is processed, and live statistics and key parameters of the patient are plotted against the same collated data. This data can be used to further generate reports and alerts to aid the health care professionals in real-time. Other applications There are varieties of other applications where power of real-time computing can either optimize or help people take informed decisions. It has become a great utility and aid in following industries: Manufacturing Application performance monitoring Customer relationship management Transportation industry Network optimization Complexity of existing solutions Now that we understand the power that real-time solutions can get into various industry verticals, let's explore and find out what options do we have to process vast amount of data being generated at a very fast pace. The Hadoop Solution The Hadoop solution is a tried, tested, and proven solution in industry which we use the MapReduce jobs in clustered setup to execute jobs and generate results. MapReduce is a programming paradigm where we process large data sets by using a mapper function that processes a key and value pair and thus generate intermediate output again in form of key-value pair. Then a reduce function operates on the mapper output and merges the values associated with same intermediate key and generates result. In the preceding figure, we demonstrate the simple word count MapReduce job where: There is a huge big data store which can go up to zettabytes and petabytes Blocks of the input data are split and replicated onto each of the nodes in Hadoop cluster Each mapper job counts the number of words on the data blocks allocated to it Once the mapper is done, the words (which are actually the keys) and the counts are sent to reducers Reducers combine the mapper output and the results are generated Big data, as we know, did provide a solution to processing and generating results out of humongous volume of data, but that's predominantly a batch processing system and has almost no utility on real-time use case. A custom solution Here we talk about a solution of the kinds twitter used before the advent of Storm. The simplistic version of the problem could be that you need a real-time count of the tweets by each user; Twitter solved the problem by following mechanism shown in the following figure: Here is the detailed information of how the preceding mechanism works: They created a fire hose or queue onto which all the tweets are pushed. A set of workers' nodes read from the queue and decipher the tweet Json and maintain the count of tweets by each user by different workers. At first set of workers the data or the number of tweets are equally distributed amongst the workers, so they are shared randomly. These workers assimilate these first level count into next set of queues. From these queues (the ones mentioned at level 1) second level of workers pick from these queues. Here the sharding is not random an algorithm is in place which ensures that tweet count of one user always goes to same worker. Then the counts are dumped into data store. The queue-worker solution is described in the following: Very complex and specific to the use case Redeployment and reconfiguration is a huge task Scaling is very tedious System is not fault tolerant Paid solution Well this is always an option, lot of big companies have invested in products which let us do this kind of computing but that comes at a heavy license cost. Few solutions to name are from companies such as: IBM Oracle Vertica Gigaspace Open real-time processing tools There are few other technologies which have some similar traits and features such as Apache Storm and S4 from Yahoo, but it lacks guaranteed processing. Spark is one is essentially a batch processing system with some features on micro-batching, which could be utilized as real-time. So finally after evaluation of all these problems, we still find Storm as the best open-source candidate to handle these use cases. Storm persistence Storm processes the streaming data at very high velocity. Cassandra complements the Storms ability to process by providing support to write and read to NoSQL at a very high rate. There are variety of API's available for connecting with Cassandra. In general the API's we are talking are wrappers written over core thrift API, which offer various crud operations over Cassandra cluster using programmer friendly packages. Thrift protocol: The most basic and core of all APIs for access to Cassandra it is the RPC protocol, which provides a language neutral interface and thus exposes flexibility to communicate using Python, Java and so on. Please note almost all other API's we'd discuss are using Thrift under the hood. It is simple to use and provides basic functionality out of the box such as ring discovery, and native access. Complex features such as retry, connection pooling, and so on are not supported out of the box. We have variety of libraries which have extended Thrift and added these much required features, we'd like to touch upon a few widely used ones in this article. Hector: This is has the privilege of being one of the most stable and extensively used API for java based client applications to access the Cassandra. As said earlier it uses Thrift underneath, so it can't essentially offer any feature or functionality not supported by Thrift protocol. The reasons for its wide spread use are number of essential features ready to use and available out of the box. It has implementation for connection pooling It has ring discovery feature with an add on of automatic failover support It has a retry for downed hosts in Cassandra ring Datastax Java Driver: This one is again a recent addition to the stack of client access options to Cassandra and hence gels well with newer version of Cassandra. Here are the salient features: Connection pooling Reconnection policies Load balancing Cursor support Astyanax: It is a very recent addition to bouquet of Cassandra client API's and has been developed by Netflix, which definitely makes it more fabled than others. Let's have a look at its credentials to see where does it qualifies: It supports all Hector functions and is much more easier to use Promises better connection pooling than hector Has a better failover handling than Hector It gives me some out of the box database like features (now that's a big news) At API level it provides me functionality called Recipes in its terms which provides:Parallel all row query executionMessaging queue functionalityObject storePagination It has numerous frequently required utilities such as following: JSON Writer CVS importer Summary In this article, we reviewed the what is big data, how it is analysed, the applications in which it it used, the complexity of the solutions and the monitoring tools of Storm. Resources for Article: Further resources on this subject: Deploying Storm on Hadoop for Advertising Analysis [article] An overview of architecture and modeling in Cassandra [article] Getting Up and Running with Cassandra [article]

0
0
1959

Packt

26 Mar 2015

25 min read

Testing with the Android SDK

Packt

26 Mar 2015

25 min read

0
0
7695

How-To Tutorials

Packt

26 Mar 2015

6 min read

Subscribing to a report

Packt

26 Mar 2015

6 min read

In this article by Johan Yu, the author of Salesforce Reporting and Dashboards, we get acquainted to the components used when working with reports on the Salesforce platform. Subscribing to a report is a new feature in Salesforce introduced in the Spring 2015 release. When you subscribe to a report, you will get a notification on weekdays, daily, or weekly, when the reports meet the criteria defined. You just need to subscribe to the report that you most care about. (For more resources related to this topic, see here.) Subscribing to a report is not the same as the report's Schedule Future Run option, where scheduling a report for a future run will keep e-mailing you the report content at a specified frequency defined, without specifying any conditions. But when you subscribe to a report, you will receive notifications when the report output meets the criteria you have defined. Subscribing to a report will not send you the e-mail content, but just an alert that the report you subscribed to meets the conditions specified. To subscribe to a report, you do not need additional permission as our administrator is able to control to enable or disable this feature for the entire organization. By default, this feature will be turned on for customers using the Salesforce Spring 2015 release. If you are an administrator for the organization, you can check out this feature by navigating to Setup | Customize | Reports & Dashboards | Report Notification | Enable report notification subscriptions for all users. Besides receiving notifications via e-mail, you also can opt for Salesforce1 notifications and posts to Chatter feeds, and execute a custom action. Report Subscription To subscribe to a report, you need to define a set of conditions to trigger the notifications. Here is what you need to understand before you subscribe to a report: When: Everytime conditions are met or only the first time conditions are met. Conditions: An aggregate can be a record count or a summarize field. Then define the operator and value you want the aggregate to be compared to. The summarize field means a field that you use in that report to summarize its data as average, smallest, largest, or sum. You can add multiple conditions, but at this moment, you only have the AND condition. Schedule frequency: Schedule weekday, daily, weekly, and the time the report will be run. Actions: E-mail notifications: You will get e-mail alerts when conditions are met. Posts to Chatter feeds: Alerts will be posted to your Chatter feed. Salesforce1 notifications: Alerts in your Salesforce1 app. Execute a custom action: This will trigger a call to the apex class. You will need a developer to write apex code for this. Active: This is a checkbox used to activate or disable subscription. You may just need to disable it when you need to unsubscribe temporarily; otherwise, deleting will remove all the settings defined. The following screenshot shows the conditions set in order to subscribe to a report: Monitoring a report subscription How can you know whether you have subscribed to a report? When you open the report and see the Subscribe button, it means you are not subscribed to that report: Once you configure the report to subscribe, the button label will turn to Edit Subscription. But, do not get it wrong that not all reports with Edit Subscription, you will get alerts when the report meets the criteria, because the setting may just not be active, remember step above when you subscribe a report. To know all the reports you subscribe to at a glance, as long as you have View Setup and Configuration permissions, navigate to Setup | Jobs | Scheduled Jobs, and look for Type as Reporting Notification, as shown in this screenshot: Hands-on – subscribing to a report Here is our next use case: you would like to get a notification in your Salesforce1 app—an e-mail notification—and also posts on your Chatter feed once the Closed Won opportunity for the month has reached $50,000. Salesforce should check the report daily, but instead of getting this notification daily, you want to get it only once a week or month; otherwise, it will be disturbing. Creating reports Make sure you set the report with the correct filter, set Close Date as This Month, and summarize the Amount field, as shown in the following screenshot: Subscribing Click on the Subscribe button and fill in the following details: Type as Only the first time conditions are met Conditions: Aggregate as Sum of Amount Operator as Greater Than or Equal Value as 50000 Schedule: Frequency as Every Weekday Time as 7AM In Actions, select: Send Salesforce1 Notification Post to Chatter Feed Send Email Notification In Active, select the checkbox Testing and saving The good thing of this feature is the ability to test without waiting until the scheduled date or time. Click on the Save & Run Now button. Here is the result: Salesforce1 notifications Open your Salesforce1 mobile app, look for the notification icon, and notice a new alert from the report you subscribed to, as shown in this screenshot: If you click on the notification, it will take you to the report that is shown in the following screenshot: Chatter feed Since you selected the Post to Chatter Feed action, the same alert will go to your Chatter feed as well. Clicking on the link in the Chatter feed will open the same report in your Salesforce1 mobile app or from the web browser, as shown in this screenshot: E-mail notification The last action we've selected for this exercise is to send an e-mail notification. The following screenshot shows how the e-mail notification would look: Limitations The following limitations are observed while subscribing to a report: You can set up to five conditions per report, and no OR logic conditions are possible You can subscribe for up to five reports, so use it wisely Summary In this article, you became familiar with components when working with reports on the Salesforce platform. We saw different report formats and the uniqueness of each format. We continued discussions on adding various types of charts to the report with point-and-click effort and no code; all of this can be done within minutes. We saw how to add filters to reports to customize our reports further, including using Filter Logic, Cross Filter, and Row Limit for tabular reports. We walked through managing and customizing custom report types, including how to hide unused report types and report type adoption analysis. In the last part of this article, we saw how easy it is to subscribe to a report and define criteria. Resources for Article: Further resources on this subject: Salesforce CRM – The Definitive Admin Handbook - Third Edition [article] Salesforce.com Customization Handbook [article] Developing Applications with Salesforce Chatter [article]

0
0
2584

Hacking toys with IFTTT and Spark

Performing hand-written digit recognition with GoLearn

Dealing with Legacy Code

GUI Components in Qt 5

Geocoding Address-based Data

Getting Started with Intel Galileo

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

PostgreSQL – New Features

Understanding and Creating Simple SSRS Reports

Puppet and OS Security Tools

Trending Topics

An Overview of Horizon View Architecture and its Components

System Center Reporting

Storm for Real-time High Velocity Computation

Testing with the Android SDK

Subscribing to a report

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access