How-To Tutorials

article-image-adding-real-time-functionality-using-socketio

22 Sep 2014

18 min read

Adding Real-time Functionality Using Socket.io

22 Sep 2014

0
0
15879

article-image-visualization-tool-understand-data

Packt

22 Sep 2014

23 min read

Visualization as a Tool to Understand Data

Packt

22 Sep 2014

23 min read

In this article by Nazmus Saquib, the author of Mathematica Data Visualization, we will look at a few simple examples that demonstrate the importance of data visualization. We will then discuss the types of datasets that we will encounter over the course of this book, and learn about the Mathematica interface to get ourselves warmed up for coding. (For more resources related to this topic, see here.) In the last few decades, the quick growth in the volume of information we produce and the capacity of digital information storage have opened a new door for data analytics. We have moved on from the age of terabytes to that of petabytes and exabytes. Traditional data analysis is now augmented with the term big data analysis, and computer scientists are pushing the bounds for analyzing this huge sea of data using statistical, computational, and algorithmic techniques. Along with the size, the types and categories of data have also evolved. Along with the typical and popular data domain in Computer Science (text, image, and video), graphs and various categorical data that arise from Internet interactions have become increasingly interesting to analyze. With the advances in computational methods and computing speed, scientists nowadays produce an enormous amount of numerical simulation data that has opened up new challenges in the field of Computer Science. Simulation data tends to be structured and clean, whereas data collected or scraped from websites can be quite unstructured and hard to make sense of. For example, let's say we want to analyze some blog entries in order to find out which blogger gets more follows and referrals from other bloggers. This is not as straightforward as getting some friends' information from social networking sites. Blog entries consist of text and HTML tags; thus, a combination of text analytics and tag parsing, coupled with a careful observation of the results would give us our desired outcome. Regardless of whether the data is simulated or empirical, the key word here is observation. In order to make intelligent observations, data scientists tend to follow a certain pipeline. The data needs to be acquired and cleaned to make sure that it is ready to be analyzed using existing tools. Analysis may take the route of visualization, statistics, and algorithms, or a combination of any of the three. Inference and refining the analysis methods based on the inference is an iterative process that needs to be carried out several times until we think that a set of hypotheses is formed, or a clear question is asked for further analysis, or a question is answered with enough evidence. Visualization is a very effective and perceptive method to make sense of our data. While statistics and algorithmic techniques provide good insights about data, an effective visualization makes it easy for anyone with little training to gain beautiful insights about their datasets. The power of visualization resides not only in the ease of interpretation, but it also reveals visual trends and patterns in data, which are often hard to find using statistical or algorithmic techniques. It can be used during any step of the data analysis pipeline—validation, verification, analysis, and inference—to aid the data scientist. How have you visualized your data recently? If you still have not, it is okay, as this book will teach you exactly that. However, if you had the opportunity to play with any kind of data already, I want you to take a moment and think about the techniques you used to visualize your data so far. Make a list of them. Done? Do you have 2D and 3D plots, histograms, bar charts, and pie charts in the list? If yes, excellent! We will learn how to style your plots and make them more interactive using Mathematica. Do you have chord diagrams, graph layouts, word cloud, parallel coordinates, isosurfaces, and maps somewhere in that list? If yes, then you are already familiar with some modern visualization techniques, but if you have not had the chance to use Mathematica as a data visualization language before, we will explore how visualization prototypes can be built seamlessly in this software using very little code. The aim of this book is to teach a Mathematica beginner the data-analysis and visualization powerhouse built into Mathematica, and at the same time, familiarize the reader with some of the modern visualization techniques that can be easily built with Mathematica. We will learn how to load, clean, and dissect different types of data, visualize the data using Mathematica's built-in tools, and then use the Mathematica graphics language and interactivity functions to build prototypes of a modern visualization. The importance of visualization Visualization has a broad definition, and so does data. The cave paintings drawn by our ancestors can be argued as visualizations as they convey historical data through a visual medium. Map visualizations were commonly used in wars since ancient times to discuss the past, present, and future states of a war, and to come up with new strategies. Astronomers in the 17th century were believed to have built the first visualization of their statistical data. In the 18th century, William Playfair invented many of the popular graphs we use today (line, bar, circle, and pie charts). Therefore, it appears as if many, since ancient times, have recognized the importance of visualization in giving some meaning to data. To demonstrate the importance of visualization in a simple mathematical setting, consider fitting a line to a given set of points. Without looking at the data points, it would be unwise to try to fit them with a model that seemingly lowers the error bound. It should also be noted that sometimes, the data needs to be changed or transformed to the correct form that allows us to use a particular tool. Visualizing the data points ensures that we do not fall into any trap. The following screenshot shows the visualization of a polynomial as a circle: Figure1.1 Fitting a polynomial In figure 1.1, the points are distributed around a circle. Imagine we are given these points in a Cartesian space (orthogonal x and y coordinates), and we are asked to fit a simple linear model. There is not much benefit if we try to fit these points to any polynomial in a Cartesian space; what we really need to do is change the parameter space to polar coordinates. A 1-degree polynomial in polar coordinate space (essentially a circle) would nicely fit these points when they are converted to polar coordinates, as shown in figure 1.1. Visualizing the data points in more complicated but similar situations can save us a lot of trouble. The following is a screenshot of Anscombe's quartet: Figure1.2 Anscombe's quartet, generated using Mathematica Downloading the color images of this book We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/2999OT_coloredimages.PDF. Anscombe's quartet (figure 1.2), named after the statistician Francis Anscombe, is a classic example of how simple data visualization like plotting can save us from making wrong statistical inferences. The quartet consists of four datasets that have nearly identical statistical properties (such as mean, variance, and correlation), and gives rise to the same linear model when a regression routine is run on these datasets. However, the second dataset does not really constitute a linear relationship; a spline would fit the points better. The third dataset (at the bottom-left corner of figure 1.2) actually has a different regression line, but the outlier exerts enough influence to force the same regression line on the data. The fourth dataset is not even a linear relationship, but the outlier enforces the same regression line again. These two examples demonstrate the importance of "seeing" our data before we blindly run algorithms and statistics. Fortunately, for visualization scientists like us, the world of data types is quite vast. Every now and then, this gives us the opportunity to create new visual tools other than the traditional graphs, plots, and histograms. These visual signatures and tools serve the same purpose that the graph plotting examples previously just did—spy and investigate data to infer valuable insights—but on different types of datasets other than just point clouds. Another important use of visualization is to enable the data scientist to interactively explore the data. Two features make today's visualization tools very attractive—the ability to view data from different perspectives (viewing angles) and at different resolutions. These features facilitate the investigator in understanding both the micro- and macro-level behavior of their dataset. Types of datasets There are many different types of datasets that a visualization scientist encounters in their work. This book's aim is to prepare an enthusiastic beginner to delve into the world of data visualization. Certainly, we will not comprehensively cover each and every visualization technique out there. Our aim is to learn to use Mathematica as a tool to create interactive visualizations. To achieve that, we will focus on a general classification of datasets that will determine which Mathematica functions and programming constructs we should learn in order to visualize the broad class of data covered in this book. Tables The table is one of the most common data structures in Computer Science. You might have already encountered this in a computer science, database, or even statistics course, but for the sake of completeness, we will describe the ways in which one could use this structure to represent different kinds of data. Consider the following table as an example: Attribute 1 Attribute 2 … Item 1 Item 2 Item 3 When storing datasets in tables, each row in the table represents an instance of the dataset, and each column represents an attribute of that data point. For example, a set of two-dimensional Cartesian vectors can be represented as a table with two attributes, where each row represents a vector, and the attributes are the x and y coordinates relative to an origin. For three-dimensional vectors or more, we could just increase the number of attributes accordingly. Tables can be used to store more advanced forms of scientific, time series, and graph data. We will cover some of these datasets over the course of this book, so it is a good idea for us to get introduced to them now. Here, we explain the general concepts. Scalar fields There are many kinds of scientific dataset out there. In order to aid their investigations, scientists have created their own data formats and mathematical tools to analyze the data. Engineers have also developed their own visualization language in order to convey ideas in their community. In this book, we will cover a few typical datasets that are widely used by scientists and engineers. We will eventually learn how to create molecular visualizations and biomedical dataset exploration tools when we feel comfortable manipulating these datasets. In practice, multidimensional data (just like vectors in the previous example) is usually augmented with one or more characteristic variable values. As an example, let's think about how a physicist or an engineer would keep track of the temperature of a room. In order to tackle the problem, they would begin by measuring the geometry and the shape of the room, and put temperature sensors at certain places to measure the temperature. They will note the exact positions of those sensors relative to the room's coordinate system, and then, they will be all set to start measuring the temperature. Thus, the temperature of a room can be represented, in a discrete sense, by using a set of points that represent the temperature sensor locations and the actual temperature at those points. We immediately notice that the data is multidimensional in nature (the location of a sensor can be considered as a vector), and each data point has a scalar value associated with it (temperature). Such a discrete representation of multidimensional data is quite widely used in the scientific community. It is called a scalar field. The following screenshot shows the representation of a scalar field in 2D and 3D: Figure1.3 In practice, scalar fields are discrete and ordered Figure 1.3 depicts how one would represent an ordered scalar field in 2D or 3D. Each point in the 2D field has a well-defined x and y location, and a single temperature value gets associated with it. To represent a 3D scalar field, we can think of it as a set of 2D scalar field slices placed at a regular interval along the third dimension. Each point in the 3D field is a point that has {x, y, z} values, along with a temperature value. A scalar field can be represented using a table. We will denote each {x, y} point (for 2D) or {x, y, z} point values (for 3D) as a row, but this time, an additional attribute for the scalar value will be created in the table. Thus, a row will have the attributes {x, y, z, T}, where T is the temperature associated with the point defined by the x, y, and z coordinates. This is the most common representation of scalar fields. A widely used visualization technique to analyze scalar fields is to find out the isocontours or isosurfaces of interest. However, for now, let's take a look at the kind of application areas such analysis will enable one to pursue. Instead of temperature, one could think of associating regularly spaced points with any relevant scalar value to form problem-specific scalar fields. In an electrochemical simulation, it is important to keep track of the charge density in the simulation space. Thus, the chemist would create a scalar field with charge values at specific points. For an aerospace engineer, it is quite important to understand how air pressure varies across airplane wings; they would keep track of the pressure by forming a scalar field of pressure values. Scalar field visualization is very important in many other significant areas, ranging from from biomedical analysis to particle physics. In this book, we will cover how to visualize this type of data using Mathematica. Time series Another widely used data type is the time series. A time series is a sequence of data points that are measured usually over a uniform interval of time. Time series arise in many fields, but in today's world, they are mostly known for their applications in Economics and Finance. Other than these, they are frequently used in statistics, weather prediction, signal processing, astronomy, and so on. It is not the purpose of this book to describe the theory and mathematics of time series data. However, we will cover some of Mathematica's excellent capabilities for visualizing time series, and in the course of this book, we will construct our own visualization tool to view time series data. Time series can be easily represented using tables. Each row of the time series table will represent one point in the series, with one attribute denoting the time stamp—the time at which the data point was recorded, and the other attribute storing the actual data value. If the starting time and the time interval are known, then we can get rid of the time attribute and simply store the data value in each row. The actual timestamp of each value can be calculated using the initial time and time interval. Images and videos can be represented as tables too, with pixel-intensity values occupying each entry of the table. As we focus on visualization and not image processing, we will skip those types of data. Graphs Nowadays, graphs arise in all contexts of computer science and social science. This particular data structure provides a way to convert real-world problems into a set of entities and relationships. Once we have a graph, we can use a plethora of graph algorithms to find beautiful insights about the dataset. Technically, a graph can be stored as a table. However, Mathematica has its own graph data structure, so we will stick to its norm. Sometimes, visualizing the graph structure reveals quite a lot of hidden information. Graph visualization itself is a challenging problem, and is an active research area in computer science. A proper visualization layout, along with proper color maps and size distribution, can produce very useful outputs. Text The most common form of data that we encounter everywhere is text. Mathematica does not provide any specific visualization package for state-of-the-art text visualization methods. Cartographic data As mentioned before, map visualization is one of the ancient forms of visualization known to us. Nowadays, with the advent of GPS, smartphones, and publicly available country-based data repositories, maps are providing an excellent way to contrast and compare different countries, cities, or even communities. Cartographic data comes in various forms. A common form of a single data item is one that includes latitude, longitude, location name, and an attribute (usually numerical) that records a relevant quantity. However, instead of a latitude and longitude coordinate, we may be given a set of polygons that describe the geographical shape of the place. The attributable quantity may not be numerical, but rather something qualitative, like text. Thus, there is really no standard form that one can expect when dealing with cartographic data. Fortunately, Mathematica provides us with excellent data-mining and dissecting capabilities to build custom formats out of the data available to us. . Mathematica as a tool for visualization At this point, you might be wondering why Mathematica is suited for visualizing all the kinds of datasets that we have mentioned in the preceding examples. There are many excellent tools and packages out there to visualize data. Mathematica is quite different from other languages and packages because of the unique set of capabilities it presents to its user. Mathematica has its own graphics language, with which graphics primitives can be interactively rendered inside the worksheet. This makes Mathematica's capability similar to many widely used visualization languages. Mathematica provides a plethora of functions to combine these primitives and make them interactive. Speaking of interactivity, Mathematica provides a suite of functions to interactively display any of its process. Not only visualization, but any function or code evaluation can be interactively visualized. This is particularly helpful when managing and visualizing big datasets. Mathematica provides many packages and functions to visualize the kinds of datasets we have mentioned so far. We will learn to use the built-in functions to visualize structured and unstructured data. These functions include point, line, and surface plots; histograms; standard statistical charts; and so on. Other than these, we will learn to use the advanced functions that will let us build our own visualization tools. Another interesting feature is the built-in datasets that this software provides to its users. This feature provides a nice playground for the user to experiment with different datasets and visualization functions. From our discussion so far, we have learned that visualization tools are used to analyze very large datasets. While Mathematica is not really suited for dealing with petabytes or exabytes of data (and many other popularly used visualization tools are not suited for that either), often, one needs to build quick prototypes of such visualization tools using smaller sample datasets. Mathematica is very well suited to prototype such tools because of its efficient and fast data-handling capabilities, along with its loads of convenient functions and user-friendly interface. It also supports GPU and other high-performance computing platforms. Although it is not within the scope of this book, a user who knows how to harness the computing power of Mathematica can couple that knowledge with visualization techniques to build custom big data visualization solutions. Another feature that Mathematica presents to a data scientist is the ability to keep the workflow within one worksheet. In practice, many data scientists tend to do their data analysis with one package, visualize their data with another, and export and present their findings using something else. Mathematica provides a complete suite of a core language, mathematical and statistical functions, a visualization platform, and versatile data import and export features inside a single worksheet. This helps the user focus on the data instead of irrelevant details. By now, I hope you are convinced that Mathematica is worth learning for your data-visualization needs. If you still do not believe me, I hope I will be able to convince you again at the end of the book, when we will be done developing several visualization prototypes, each requiring only few lines of code! Getting started with Mathematica We will need to know a few basic Mathematica notebook essentials. Assuming you already have Mathematica installed on your computer, let's open a new notebook by navigating to File|New|Notebook, and do the following experiments. Creating and selecting cells In Mathematica, a chunk of code or any number of mathematical expressions can be written within a cell. Each cell in the notebook can be evaluated to see the output immediately below it. To start a new cell, simply start typing at the position of the blinking cursor. Each cell can be selected by clicking on the respective rightmost bracket. To select multiple cells, press Ctrl + right-mouse button in Windows or Linux (or cmd + right-mouse button on a Mac) on each of the cells. The following screenshot shows several cells selected together, along with the output from each cell: Figure1.4 Selecting and evaluating cells in Mathematica We can place a new cell in between any set of cells in order to change the sequence of instruction execution. Use the mouse to place the cursor in between two cells, and start typing your commands to create a new cell. We can also cut, copy, and paste cells by selecting them and applying the usual shortcuts (for example, Ctrl + C, Ctrl + X, and Ctrl + V in Windows/Linux, or cmd + C, cmd + X, and cmd + V in Mac) or using the Edit menu bar. In order to delete cell(s), select the cell(s) and press the Delete key. Evaluating a cell A cell can be evaluated by pressing Shift + Enter. Multiple cells can be selected and evaluated in the same way. To evaluate the full notebook, press Ctrl + A (to select all the cells) and then press Shift + Enter. In this case, the cells will be evaluated one after the other in the sequence in which they appear in the notebook. To see examples of notebooks filled with commands, code, and mathematical expressions, you can open the notebooks supplied with this article, which are the polar coordinates fitting and Anscombe's quartet examples, and select each cell (or all of them) and evaluate them. If we evaluate a cell that uses variables declared in a previous cell, and the previous cell was not already evaluated, then we may get errors. It is possible that Mathematica will treat the unevaluated variables as a symbolic expression; in that case, no error will be displayed, but the results will not be numeric anymore. Suppressing output from a cell If we don't wish to see the intermediate output as we load data or assign values to variables, we can add a semicolon (;) at the end of each line that we want to leave out from the output. Cell formatting Mathematica input cells treat everything inside them as mathematical and/or symbolic expressions. By default, every new cell you create by typing at the horizontal cursor will be an input expression cell. However, you can convert the cell to other formats for convenient typesetting. In order to change the format of cell(s), select the cell(s) and navigate to Format|Style from the menu bar, and choose a text format style from the available options. You can add mathematical symbols to your text by selecting Palettes|Basic Math Assistant. Note that evaluating a text cell will have no effect/output. Commenting We can write any comment in a text cell as it will be ignored during the evaluation of our code. However, if we would like to write a comment inside an input cell, we use the (* operator to open a comment and the *) operator to close it, as shown in the following code snippet: (* This is a comment *) The shortcut Ctrl + / (cmd + / in Mac) is used to comment/uncomment a chunk of code too. This operation is also available in the menu bar. Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. Aborting evaluation We can abort the currently running evaluation of a cell by navigating to Evaluation|Abort Evaluation in the menu bar, or simply by pressing Alt + . (period). This is useful when you want to end a time-consuming process that you suddenly realize will not give you the correct results at the end of the evaluation, or end a process that might use up the available memory and shut down the Mathematica kernel. Further reading The history of visualization deserves a separate book, as it is really fascinating how the field has matured over the centuries, and it is still growing very strongly. Michael Friendly, from York University, published a historical development paper that is freely available online, titled Milestones in History of Data Visualization: A Case Study in Statistical Historiography. This is an entertaining compilation of the history of visualization methods. The book The Visual Display of Quantitative Information by Edward R. Tufte published by Graphics Press USA, is an excellent resource and a must-read for every data visualization practitioner. This is a classic book on the theory and practice of data graphics and visualization. Since we will not have the space to discuss the theory of visualization, the interested reader can consider reading this book for deeper insights. Summary In this article, we discussed the importance of data visualization in different contexts. We also introduced the types of dataset that will be visualized over the course of this book. The flexibility and power of Mathematica as a visualization package was discussed, and we will see the demonstration of these properties throughout the book with beautiful visualizations. Finally, we have taken the first step to writing code in Mathematica. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [article] Importing Dynamic Data [article] Interacting with Data for Dashboards [article]

0
0
9283

article-image-building-publishing-and-supporting-your-forcecom-application

Packt

22 Sep 2014

39 min read

Building, Publishing, and Supporting Your Force.com Application

Packt

22 Sep 2014

39 min read

0
0
3412

Packt

22 Sep 2014

18 min read

Improving Code Quality

Packt

22 Sep 2014

18 min read

In this article by Alexandru Vlăduţu, author of Mastering Web Application Development with Express, we are going to see how to test Express applications and how to improve the code quality of our code by leveraging existing NPM modules. (For more resources related to this topic, see here.) Creating and testing an Express file-sharing application Now, it's time to see how to develop and test an Express application with what we have learned previously. We will create a file-sharing application that allows users to upload files and password-protect them if they choose to. After uploading the files to the server, we will create a unique ID for that file, store the metadata along with the content (as a separate JSON file), and redirect the user to the file's information page. When trying to access a password-protected file, an HTTP basic authentication pop up will appear, and the user will have to only enter the password (no username in this case). The package.json file, so far, will contain the following code: { "name": "file-uploading-service", "version": "0.0.1", "private": true, "scripts": { "start": "node ./bin/www" }, "dependencies": { "express": "~4.2.0", "static-favicon": "~1.0.0", "morgan": "~1.0.0", "cookie-parser": "~1.0.1", "body-parser": "~1.0.0", "debug": "~0.7.4", "ejs": "~0.8.5", "connect-multiparty": "~1.0.5", "cuid": "~1.2.4", "bcrypt": "~0.7.8", "basic-auth-connect": "~1.0.0", "errto": "~0.2.1", "custom-err": "0.0.2", "lodash": "~2.4.1", "csurf": "~1.2.2", "cookie-session": "~1.0.2", "secure-filters": "~1.0.5", "supertest": "~0.13.0", "async": "~0.9.0" }, "devDependencies": { } } When bootstrapping an Express application using the CLI, a /bin/www file will be automatically created for you. The following is the version we have adopted to extract the name of the application from the package.json file. This way, in case we decide to change it we won't have to alter our debugging code because it will automatically adapt to the new name, as shown in the following code: #!/usr/bin/env node var pkg = require('../package.json'); var debug = require('debug')(pkg.name + ':main'); var app = require('../app'); app.set('port', process.env.PORT || 3000); var server = app.listen(app.get('port'), function() { debug('Express server listening on port ' + server.address().port); }); The application configurations will be stored inside config.json: { "filesDir": "files", "maxSize": 5 } The properties listed in the preceding code refer to the files folder (where the files will be updated), which is relative to the root and the maximum allowed file size. The main file of the application is named app.js and lives in the root. We need the connect-multiparty module to support file uploads, and the csurf and cookie-session modules for CSRF protection. The rest of the dependencies are standard and we have used them before. The full code for the app.js file is as follows: var express = require('express'); var path = require('path'); var favicon = require('static-favicon'); var logger = require('morgan'); var cookieParser = require('cookie-parser'); var session = require('cookie-session'); var bodyParser = require('body-parser'); var multiparty = require('connect-multiparty'); var Err = require('custom-err'); var csrf = require('csurf'); var ejs = require('secure-filters').configure(require('ejs')); var csrfHelper = require('./lib/middleware/csrf-helper'); var homeRouter = require('./routes/index'); var filesRouter = require('./routes/files'); var config = require('./config.json'); var app = express(); var ENV = app.get('env'); // view engine setup app.engine('html', ejs.renderFile); app.set('views', path.join(__dirname, 'views')); app.set('view engine', 'html'); app.use(favicon()); app.use(bodyParser.json()); app.use(bodyParser.urlencoded()); // Limit uploads to X Mb app.use(multiparty({ maxFilesSize: 1024 * 1024 * config.maxSize })); app.use(cookieParser()); app.use(session({ keys: ['rQo2#0s!qkE', 'Q.ZpeR49@9!szAe'] })); app.use(csrf()); // add CSRF helper app.use(csrfHelper); app.use('/', homeRouter); app.use('/files', filesRouter); app.use(express.static(path.join(__dirname, 'public'))); /// catch 404 and forward to error handler app.use(function(req, res, next) { next(Err('Not Found', { status: 404 })); }); /// error handlers // development error handler // will print stacktrace if (ENV === 'development') { app.use(function(err, req, res, next) { res.status(err.status || 500); res.render('error', { message: err.message, error: err }); }); } // production error handler // no stacktraces leaked to user app.use(function(err, req, res, next) { res.status(err.status || 500); res.render('error', { message: err.message, error: {} }); }); module.exports = app; Instead of directly binding the application to a port, we are exporting it, which makes our lives easier when testing with supertest. We won't need to care about things such as the default port availability or specifying a different port environment variable when testing. To avoid having to create the whole input when including the CSRF token, we have created a helper for that inside lib/middleware/csrf-helper.js: module.exports = function(req, res, next) { res.locals.csrf = function() { return "<input type='hidden' name='_csrf' value='" + req.csrfToken() + "' />"; } next(); }; For the password–protection functionality, we will use the bcrypt module and create a separate file inside lib/hash.js for the hash generation and password–compare functionality: var bcrypt = require('bcrypt'); var errTo = require('errto'); var Hash = {}; Hash.generate = function(password, cb) { bcrypt.genSalt(10, errTo(cb, function(salt) { bcrypt.hash(password, salt, errTo(cb, function(hash) { cb(null, hash); })); })); }; Hash.compare = function(password, hash, cb) { bcrypt.compare(password, hash, cb); }; module.exports = Hash; The biggest file of our application will be the file model, because that's where most of the functionality will reside. We will use the cuid() module to create unique IDs for files, and the native fs module to interact with the filesystem. The following code snippet contains the most important methods for models/file.js: function File(options, id) { this.id = id || cuid(); this.meta = _.pick(options, ['name', 'type', 'size', 'hash', 'uploadedAt']); this.meta.uploadedAt = this.meta.uploadedAt || new Date(); }; File.prototype.save = function(path, password, cb) { var _this = this; this.move(path, errTo(cb, function() { if (!password) { return _this.saveMeta(cb); } hash.generate(password, errTo(cb, function(hashedPassword) { _this.meta.hash = hashedPassword; _this.saveMeta(cb); })); })); }; File.prototype.move = function(path, cb) { fs.rename(path, this.path, cb); }; For the full source code of the file, browse the code bundle. Next, we will create the routes for the file (routes/files.js), which will export an Express router. As mentioned before, the authentication mechanism for password-protected files will be the basic HTTP one, so we will need the basic-auth-connect module. At the beginning of the file, we will include the dependencies and create the router: var express = require('express'); var basicAuth = require('basic-auth-connect'); var errTo = require('errto'); var pkg = require('../package.json'); var File = require('../models/file'); var debug = require('debug')(pkg.name + ':filesRoute'); var router = express.Router(); We will have to create two routes that will include the id parameter in the URL, one for displaying the file information and another one for downloading the file. In both of these cases, we will need to check if the file exists and require user authentication in case it's password-protected. This is an ideal use case for the router.param() function because these actions will be performed each time there is an id parameter in the URL. The code is as follows: router.param('id', function(req, res, next, id) { File.find(id, errTo(next, function(file) { debug('file', file); // populate req.file, will need it later req.file = file; if (file.isPasswordProtected()) { // Password – protected file, check for password using HTTP basic auth basicAuth(function(user, pwd, fn) { if (!pwd) { return fn(); } // ignore user file.authenticate(pwd, errTo(next, function(match) { if (match) { return fn(null, file.id); } fn(); })); })(req, res, next); } else { // Not password – protected, proceed normally next(); } })); }); The rest of the routes are fairly straightforward, using response.download() to send the file to the client, or using response.redirect() after uploading the file: router.get('/', function(req, res, next) { res.render('files/new', { title: 'Upload file' }); }); router.get('/:id.html', function(req, res, next) { res.render('files/show', { id: req.params.id, meta: req.file.meta, isPasswordProtected: req.file.isPasswordProtected(), hash: hash, title: 'Download file ' + req.file.meta.name }); }); router.get('/download/:id', function(req, res, next) { res.download(req.file.path, req.file.meta.name); }); router.post('/', function(req, res, next) { var tempFile = req.files.file; if (!tempFile.size) { return res.redirect('/files'); } var file = new File(tempFile); file.save(tempFile.path, req.body.password, errTo(next, function() { res.redirect('/files/' + file.id + '.html'); })); }); module.exports = router; The view for uploading a file contains a multipart form with a CSRF token inside (views/files/new.html): <%- include ../layout/header.html %> <form action="/files" method="POST" enctype="multipart/form-data"> <div class="form-group"> <label>Choose file:</label> <input type="file" name="file" /> </div> <div class="form-group"> <label>Password protect (leave blank otherwise):</label> <input type="password" name="password" /> </div> <div class="form-group"> <%- csrf() %> <input type="submit" /> </div> </form> <%- include ../layout/footer.html %> To display the file's details, we will create another view (views/files/show.html). Besides showing the basic file information, we will display a special message in case the file is password-protected, so that the client is notified that a password should also be shared along with the link: <%- include ../layout/header.html %> <p> <table> <tr> <th>Name</th> <td><%= meta.name %></td> </tr> <th>Type</th> <td><%= meta.type %></td> </tr> <th>Size</th> <td><%= meta.size %> bytes</td> </tr> <th>Uploaded at</th> <td><%= meta.uploadedAt %></td> </tr> </table> </p> <p> <a href="/files/download/<%- id %>">Download file</a> | <a href="/files">Upload new file</a> </p> <p> To share this file with your friends use the <a href="/files/<%- id %>">current link</a>. <% if (isPasswordProtected) { %> <br /> Don't forget to tell them the file password as well! <% } %> </p> <%- include ../layout/footer.html %> Running the application To run the application, we need to install the dependencies and run the start script: $ npm i $ npm start The default port for the application is 3000, so if we visit http://localhost:3000/files, we should see the following page: After uploading the file, we should be redirected to the file's page, where its details will be displayed: Unit tests Unit testing allows us to test individual parts of our code in isolation and verify their correctness. By making our tests focused on these small components, we decrease the complexity of the setup, and most likely, our tests should execute faster. Using the following command, we'll install a few modules to help us in our quest: $ npm i mocha should sinon––save-dev We are going to write unit tests for our file model, but there's nothing stopping us from doing the same thing for our routes or other files from /lib. The dependencies will be listed at the top of the file (test/unit/file-model.js): var should = require('should'); var path = require('path'); var config = require('../../config.json'); var sinon = require('sinon'); We will also need to require the native fs module and the hash module, because these modules will be stubbed later on. Apart from these, we will create an empty callback function and reuse it, as shown in the following code: // will be stubbing methods on these modules later on var fs = require('fs'); var hash = require('../../lib/hash'); var noop = function() {}; The tests for the instance methods will be created first: describe('models', function() { describe('File', function() { var File = require('../../models/file'); it('should have default properties', function() { var file = new File(); file.id.should.be.a.String; file.meta.uploadedAt.should.be.a.Date; }); it('should return the path based on the root and the file id', function() { var file = new File({}, '1'); file.path.should.eql(File.dir + '/1'); }); it('should move a file', function() { var stub = sinon.stub(fs, 'rename'); var file = new File({}, '1'); file.move('/from/path', noop); stub.calledOnce.should.be.true; stub.calledWith('/from/path', File.dir + '/1', noop).should.be.true; stub.restore(); }); it('should save the metadata', function() { var stub = sinon.stub(fs, 'writeFile'); var file = new File({}, '1'); file.meta = { a: 1, b: 2 }; file.saveMeta(noop); stub.calledOnce.should.be.true; stub.calledWith(File.dir + '/1.json', JSON.stringify(file.meta), noop).should.be.true; stub.restore(); }); it('should check if file is password protected', function() { var file = new File({}, '1'); file.meta.hash = 'y'; file.isPasswordProtected().should.be.true; file.meta.hash = null; file.isPasswordProtected().should.be.false; }); it('should allow access if matched file password', function() { var stub = sinon.stub(hash, 'compare'); var file = new File({}, '1'); file.meta.hash = 'hashedPwd'; file.authenticate('password', noop); stub.calledOnce.should.be.true; stub.calledWith('password', 'hashedPwd', noop).should.be.true; stub.restore(); }); We are stubbing the functionalities of the fs and hash modules because we want to test our code in isolation. Once we are done with the tests, we restore the original functionality of the methods. Now that we're done testing the instance methods, we will go on to test the static ones (assigned directly onto the File object): describe('.dir', function() { it('should return the root of the files folder', function() { path.resolve(__dirname + '/../../' + config.filesDir).should.eql(File.dir); }); }); describe('.exists', function() { var stub; beforeEach(function() { stub = sinon.stub(fs, 'exists'); }); afterEach(function() { stub.restore(); }); it('should callback with an error when the file does not exist', function(done) { File.exists('unknown', function(err) { err.should.be.an.instanceOf(Error).and.have.property('status', 404); done(); }); // call the function passed as argument[1] with the parameter `false` stub.callArgWith(1, false); }); it('should callback with no arguments when the file exists', function(done) { File.exists('existing-file', function(err) { (typeof err === 'undefined').should.be.true; done(); }); // call the function passed as argument[1] with the parameter `true` stub.callArgWith(1, true); }); }); }); }); To stub asynchronous functions and execute their callback, we use the stub.callArgWith() function provided by sinon, which executes the callback provided by the argument with the index <<number>> of the stub with the subsequent arguments. For more information, check out the official documentation at http://sinonjs.org/docs/#stubs. When running tests, Node developers expect the npm test command to be the command that triggers the test suite, so we need to add that script to our package.json file. However, since we are going to have different tests to be run, it would be even better to add a unit-tests script and make npm test run that for now. The scripts property should look like the following code: "scripts": { "start": "node ./bin/www", "unit-tests": "mocha --reporter=spec test/unit", "test": "npm run unit-tests" }, Now, if we run the tests, we should see the following output in the terminal: Functional tests So far, we have tested each method to check whether it works fine on its own, but now, it's time to check whether our application works according to the specifications when wiring all the things together. Besides the existing modules, we will need to install and use the following ones: supertest: This is used to test the routes in an expressive manner cheerio: This is used to extract the CSRF token out of the form and pass it along when uploading the file rimraf: This is used to clean up our files folder once we're done with the testing We will create a new file called test/functional/files-routes.js for the functional tests. As usual, we will list our dependencies first: var fs = require('fs'); var request = require('supertest'); var should = require('should'); var async = require('async'); var cheerio = require('cheerio'); var rimraf = require('rimraf'); var app = require('../../app'); There will be a couple of scenarios to test when uploading a file, such as: Checking whether a file that is uploaded without a password can be publicly accessible Checking that a password-protected file can only be accessed with the correct password We will create a function called uploadFile that we can reuse across different tests. This function will use the same supertest agent when making requests so it can persist the cookies, and will also take care of extracting and sending the CSRF token back to the server when making the post request. In case a password argument is provided, it will send that along with the file. The function will assert that the status code for the upload page is 200 and that the user is redirected to the file page after the upload. The full code of the function is listed as follows: function uploadFile(agent, password, done) { agent .get('/files') .expect(200) .end(function(err, res) { (err == null).should.be.true; var $ = cheerio.load(res.text); var csrfToken = $('form input[name=_csrf]').val(); csrfToken.should.not.be.empty; var req = agent .post('/files') .field('_csrf', csrfToken) .attach('file', __filename); if (password) { req = req.field('password', password); } req .expect(302) .expect('Location', /files/(.*).html/) .end(function(err, res) { (err == null).should.be.true; var fileUid = res.headers['location'].match(/files/(.*).html/)[1]; done(null, fileUid); }); }); } Note that we will use rimraf in an after function to clean up the files folder, but it would be best to have a separate path for uploading files while testing (other than the one used for development and production): describe('Files-Routes', function(done) { after(function() { var filesDir = __dirname + '/../../files'; rimraf.sync(filesDir); fs.mkdirSync(filesDir); When testing the file uploads, we want to make sure that without providing the correct password, access will not be granted to the file pages: describe("Uploading a file", function() { it("should upload a file without password protecting it", function(done) { var agent = request.agent(app); uploadFile(agent, null, done); }); it("should upload a file and password protect it", function(done) { var agent = request.agent(app); var pwd = 'sample-password'; uploadFile(agent, pwd, function(err, filename) { async.parallel([ function getWithoutPwd(next) { agent .get('/files/' + filename + '.html') .expect(401) .end(function(err, res) { (err == null).should.be.true; next(); }); }, function getWithPwd(next) { agent .get('/files/' + filename + '.html') .set('Authorization', 'Basic ' + new Buffer(':' + pwd).toString('base64')) .expect(200) .end(function(err, res) { (err == null).should.be.true; next(); }); } ], function(err) { (err == null).should.be.true; done(); }); }); }); }); }); It's time to do the same thing we did for the unit tests: make a script so we can run them with npm by using npm run functional-tests. At the same time, we should update the npm test script to include both our unit tests and our functional tests: "scripts": { "start": "node ./bin/www", "unit-tests": "mocha --reporter=spec test/unit", "functional-tests": "mocha --reporter=spec --timeout=10000 --slow=2000 test/functional", "test": "npm run unit-tests && npm run functional-tests" } If we run the tests, we should see the following output: Running tests before committing in Git It's a good practice to run the test suite before committing to git and only allowing the commit to pass if the tests have been executed successfully. The same applies for other version control systems. To achieve this, we should add the .git/hooks/pre-commit file, which should take care of running the tests and exiting with an error in case they failed. Luckily, this is a repetitive task (which can be applied to all Node applications), so there is an NPM module that creates this hook file for us. All we need to do is install the pre-commit module (https://www.npmjs.org/package/pre-commit) as a development dependency using the following command: $ npm i pre-commit ––save-dev This should automatically create the pre-commit hook file so that all the tests are run before committing (using the npm test command). The pre-commit module also supports running custom scripts specified in the package.json file. For more details on how to achieve that, read the module documentation at https://www.npmjs.org/package/pre-commit. Summary In this article, we have learned about writing tests for Express applications and in the process, explored a variety of helpful modules. Resources for Article: Further resources on this subject: Web Services Testing and soapUI [article] ExtGWT Rich Internet Application: Crafting UI Real Estate [article] Rendering web pages to PDF using Railo Open Source [article]

0
0
1842

article-image-handling-selinux-aware-applications

Packt

19 Sep 2014

5 min read

Handling SELinux-aware Applications

Packt

19 Sep 2014

5 min read

This article is written by Sven Vermeulen, the author of SELinux Cookbook. In this article, we will cover how to control D-Bus message flows. (For more resources related to this topic, see here.) Controlling D-Bus message flows D-Bus implementation on Linux is an example of an SELinux-aware application, acting as a user space object manager. Applications can register themselves on a bus and can send messages between applications through D-Bus. These messages can be controlled through the SELinux policy as well. Getting ready Before looking at the SELinux access controls related to message flows, it is important to focus on a D-Bus service and see how its authentication is done (and how messages are relayed in D-Bus) as this is reflected in the SELinux integration. Go to /etc/dbus-1/system.d/ (which hosts the configuration files for D-Bus services) and take a look at a configuration file. For instance, the service configuration file for dnsmasq looks like the following: <!DOCTYPE busconfig PUBLIC "-//freedesktop//DTD D-BUS Bus Configuration 1.0//EN" "http://www.freedesktop.org/standards/dbus/1.0/busconfig.dtd"> <busconfig> <policy user="root"> <allow own="uk.org.thekelleys.dnsmasq"/> <allow send_destination="uk.org.thekelleys.dnsmasq"/> </policy> <policy context="default"> <deny own="uk.org.thekelleys.dnsmasq"/> <deny send_destination="uk.org.thekelleys.dnsmasq"/> </policy> </busconfig> This configuration tells D-Bus that only the root Linux user is allowed to have a service own the uk.org.thekelleys.dnsmasq service and send messages to this service. Others (as managed through the default policy) are denied these operations. On a system with SELinux enabled, having root as the finest granularity doesn't cut it. So, let's look at how the SELinux policy can offer a fine-grained access control in D-Bus. How to do it… To control D-Bus message flows with SELinux, perform the following steps: Identify the domain of the application that will (or does) own the D-Bus service we are interested in. For the dnsmasq application, this would be dnsmasq_t: ~# ps -eZ | grep dnsmasq | awk '{print $1}' system_u:system_r:dnsmasq_t:s0-s0:c0.c1023 Identify the domain of the application that wants to send messages to the service. For instance, this could be the sysadm_t user domain. Allow the two domains to interact with each other through D-Bus messages as follows: gen_require(` class dbus send_msg; ') allow sysadm_t dnsmasq_t:dbus send_msg; allow dnsmasq_t sysadm_t:dbus send_msg; How it works… When an application connects to D-Bus, the SELinux label of its connection is used as the label to check when sending messages. As there is no transition for such connections, the label of the connection is the context of the process itself (the domain); hence, the selection of dnsmasq_t in the example. When D-Bus receives a request to send a message to a service, D-Bus will check the SELinux policy for the send_msg permission. It does so by passing on the information about the session (source and target context and the permission that is requested) to the SELinux subsystem, which computes whether access should be allowed or not. The access control itself, however, is not enforced by SELinux (it only gives feedback), but by D-Bus itself as governing the message flows is solely D-Bus' responsibility. This is also why, when developing D-Bus-related policies, both the class and permission need to be explicitly mentioned in the policy module. Without this, the development environment might error out, claiming that dbus is not a valid class. D-Bus checks the context of the client that is sending a message as well as the context of the connection of the service (which are both domain labels) and see if there is a send_msg permission allowed. As most communication is two-fold (sending a message and then receiving a reply), the permission is checked in both directions. After all, sending a reply is just sending a message (policy-wise) in the reverse direction. It is possible to verify this behavior with dbus-send if the rule is on a user domain. For instance, to look at the objects provided by the service, the D-Bus introspection can be invoked against the service: ~# dbus-send --system --dest=uk.org.thekelleys.dnsmasq --print-reply /uk/org/thekelleys/dnsmasq org.freedesktop.DBus.Introspectable.Introspect When SELinux does not have the proper send_msg allow rules in place, the following error will be logged by D-Bus in its service logs (but no AVC denial will show up as it isn't the SELinux subsystem that denies the access): Error org.freedesktop.DBus.Error.AccessDenied: An SELinux policy prevents this sender from sending this message to this recipient. 0 matched rules; type="method_call", sender=":1.17" (uid=0 pid=6738 comm="") interface="org.freedesktop.DBus.Introspectable" member="Introspect" error name="(unset)" requested_reply="0" destination="uk.org.thekelleys.dnsmasq" (uid=0 pid=6635 comm="") When the policy does allow the send_msg permission, the introspection returns an XML output showing the provided methods and interfaces for this service. There's more... The current D-Bus implementation is a pure user space implementation. Because more applications become dependent on D-Bus, work is being done to create a kernel-based D-Bus implementation called kdbus. The exact implementation details of this project are not finished yet, so it is unknown whether the SELinux access controls that are currently applicable to D-Bus will still be valid on kdbus. Summary In this article, we learned how to control D-Bus message flows. It also covers what happens when the policy has or doesn't have the send_msg permission in place. Resources for Article: Further resources on this subject: An Introduction to the Terminal [Article] Wireless and Mobile Hacks [Article] Baking Bits with Yocto Project [Article]

0
0
3258

Packt

19 Sep 2014

21 min read

Jump Right In

Packt

19 Sep 2014

21 min read

In this article by Victor Quinn, J.D., the author of the book Getting Started with tmux, we'll go on a little tour, simulate an everyday use of tmux, and point out some key concepts along the way. tmux is short for Terminal Multiplexer. (For more resources related to this topic, see here.) Running tmux For now, let's jump right in and start playing with it. Open up your favorite terminal application and let's get started. Just run the following command: $ tmux You'll probably see a screen flash, and it'll seem like not much else has happened; it looks like you're right where you were previously, with a command prompt. The word tmux is gone, but not much else appears to have changed. However, you should notice that now there is a bar along the bottom of your terminal window. This can be seen in the following screenshot of the terminal window: Congratulations! You're now running tmux. That bar along the bottom is provided by tmux. We call this bar the status line. The status line gives you information about the session and window you are currently viewing, which other windows are available in this session, and more. Some of what's on that line may look like gibberish now, but we'll learn more about what things mean as we progress through this book. We'll also learn how to customize the status bar to ensure it always shows the most useful items for your workflow. These customizations include things that are a part of tmux (such as the time, date, server you are connected to, and so on) or things that are in third-party libraries (such as the battery level of your laptop, current weather, or number of unread mail messages). Sessions By running tmux with no arguments, you create a brand new session. In tmux, the base unit is called a session. A session can have one or more windows. A window can be broken into one or more panes. We'll have a sneak preview on this topic, what we have here on the current screen is a single pane taking up the whole window in a single session. Imagine that it could be split into two or more different terminals, all running different programs, and each visible split of the terminal is a pane. What is a session in tmux? It may be useful to think of a tmux session as a login on your computer. You can log on to your computer, which initiates a new session. After you log on by entering your username and password, you arrive at an empty desktop. This is similar to a fresh tmux session. You can run one or more programs in this session, where each program has its own window or windows and each window has its own state. In most operating systems, there is a way for you to log out, log back in, and arrive back at the same session, with the windows just as you left them. Often, some of the programs that you had opened will continue to run in the background when you log out, even though their windows are no longer visible. A session in tmux works in much the same way. So, it may be useful to think of tmux as a mini operating system that manages running programs, windows, and more, all within a session. You can have multiple sessions running at the same time. This is convenient if you want to have a session for each task you might be working on. You might have one for an application you are developing by yourself and another that you could use for pair programming. Alternatively, you might have one to develop an application and one to develop another. This way everything can be neat and clean and separate. Naming the session Each session has a name that you can set or change. Notice the [0] at the very left of the status bar? This is the name of the session in brackets. Here, since you just started tmux without any arguments, it was given the name 0. However, this is not a very useful name, so let's change it. In the prompt, just run the following command: $ tmux rename-session tutorial This tells tmux that you want to rename the current session and tutorial is the name you'd like it to have. Of course, you can name it anything you'd like. You should see that your status bar has now been updated, so now instead of [0] on the left-hand side, it should now say [tutorial]. Here's a screenshot of my screen: Of course, it's nice that the status bar now has a pretty name we defined rather than 0, but it provides many more utilities than this, as we'll see in a bit! It's worth noting that here we were giving a session a name, but this same command can also be used to rename an existing session. The window string The status bar has a string that represents each window to inform us about the things that are currently running. The following steps will help us to explore this a bit more: Let's fire up a text editor to pretend we're doing some coding: $ nano test Now type some stuff in there to simulate working very hard on some code: First notice how the text blob in our status bar just to the right of our session name ([tutorial]) has changed. It used to be 0:~* and now it's 0:nano*. Depending on the version of tmux and your chosen shell, yours may be slightly different (for example, 0:bash*). Let's decode this string a bit. This little string encodes a lot of information, some of which is provided in the following bullet points: The zero in the front represents the number of the window. As we'll shortly see, each window is given a number that we can use to identify and switch to it. The colon separates the window number from the name of the program running in that window. The symbols ~ or nano in the previous screenshot are loosely names of the running program. We say "loosely" because you'll notice that ~ is not the name of a program, but was the directory we were visiting. tmux is pretty slick about this; it knows some state of the program you're using and changes the default name of the window accordingly. Note that the name given is the default; it's possible to explicitly set one for the window, as we'll see later. The symbol * indicates that this is the currently viewed window. We only have one at the moment, so it's not too exciting; however, once we get more than one, it'll be very helpful. Creating another window OK! Now that we know a bit about a part of the status line, let's create a second window so we can run a terminal command. Just press Ctrl + b, then c, and you will be presented with a new window! A few things to note are as follows: Now there is a new window with the label 1:~*. It is given the number 1 because the last one was 0. The next will be 2, then 3, 4, and so on. The asterisk that denoted the currently active window has been moved to 1 since it is now the active one. The nano application is still running in window 0. The asterisk on window 0 has been replaced by a hyphen (-). The - symbol denotes the previously opened window. This is very helpful when you have a bunch of windows. Let's run a command here just to illustrate how it works. Run the following commands: $ echo "test" > test $ cat test The output of these commands can be seen in the following screenshot: This is just some stuff so we can help identify this window. Imagine in the real world though you are moving a file, performing operations with Git, viewing log files, running top, or anything else. Let's jump back to window 0 so we can see nano still running. Simply press Ctrl + b and l to switch back to the previously opened window (the one with the hyphen; l stands for the last). As shown in the following screenshot, you'll see that nano is alive, and well, it looks exactly as we left it: The prefix key There is a special key in tmux called the prefix key that is used to perform most of the keyboard shortcuts. We have even used it already quite a bit! In this section, we will learn more about it and run through some examples of its usage. You will notice that in the preceding exercise, we pressed Ctrl + b before creating a window, then Ctrl + b again before switching back, and Ctrl + b before a number to jump to that window. When using tmux, we'll be pressing this key a lot. It's even got a name! We call it the prefix key. Its default binding in tmux is Ctrl + b, but you can change that if you prefer something else or if it conflicts with a key in a program you often use within tmux. You can send the Ctrl + b key combination through to the program by pressing Ctrl + b twice in a row; however, if it's a keyboard command you use often, you'll most likely want to change it. This key is used before almost every command we'll use in tmux, so we'll be seeing it a lot. From here on, if we need to reference the prefix key, we'll do it like <Prefix>. This way if you rebind it, the text will still make sense. If you don't rebound it or see <Prefix>, just type Ctrl + b. Let's create another window for another task. Just run <Prefix>, c again. Now we've got three windows: 0, 1, and 2. We've got one running nano and two running shells, as shown in the following screenshot: Some more things to note are as follows: Now we have window 2, which is active. See the asterisk? Window 0 now has a hyphen because it was the last window we viewed. This is a clear, blank shell because the one we typed stuff into is over in Window 1. Let's switch back to window 1 to see our test commands above still active. The last time we switched windows, we used <Prefix>, l to jump to the last window, but that will not work to get us to window 1 at this point because the hyphen is on window 0. So, going to the last selected window will not get us to 1. Thankfully, it is very easy to switch to a window directly by its number. Just press <Prefix>, then the window number to jump to that window. So <Prefix>, 1 will jump to window 1 even though it wasn't the last one we opened, as shown in the following screenshot: Sure enough, now window 1 is active and everything is present, just as we left it. Now we typed some silly commands here, but it could just as well have been an active running process here, such as unit tests, code linting, or top. Any such process would run in the background in tmux without an issue. This is one of the most powerful features of tmux. In the traditional world, to have a long-running process in a terminal window and get some stuff done in a terminal, you would need two different terminal windows open; if you accidentally close one, the work done in that window will be gone. tmux allows you to keep just one terminal window open, and this window can have a multitude of different windows within it, closing all the different running processes. Closing this terminal window won't terminate the running processes; tmux will continue humming along in the background with all of the programs running behind the scenes. Help on key bindings Now a keen observer may notice that the trick of entering the window number will only work for the first 10 windows. This is because once you get into double digits, tmux won't be able to tell when you're done entering the number. If this trick of using the prefix key plus the number only works for the first 10 windows (windows 0 to 9), how will we select a window beyond 10? Thankfully, tmux gives us many powerful ways to move between windows. One of my favorites is the choose window interface. However, oh gee! This is embarrassing. Your author seems to have entirely forgotten the key combination to access the choose window interface. Don't fear though; tmux has a nice built-in way to access all of the key bindings. So let's use it! Press <Prefix>, ? to see your screen change to show a list with bind-key to the left, the key binding in the middle, and the command it runs to the right. You can use your arrow keys to scroll up and down, but there are a lot of entries there! Thankfully, there is a quicker way to get to the item you want without scrolling forever. Press Ctrl + s and you'll see a prompt appear that says Search Down:, where you can type a string and it will search the help document for that string. Emacs or vi mode tmux tries hard to play nicely with developer defaults, so it actually includes two different modes for many key combinations tailored for the two most popular terminal editors: Emacs and vi. These are referred to in tmux parlance as status-keys and mode-keys that can be either Emacs or vi. The tmux default mode is Emacs for all the key combinations, but it can be changed to vi via configuration. It may also be set to vi automatically based on the global $EDITOR setting in your shell. If you are used to Emacs, Ctrl + s should feel very natural since it's the command Emacs uses to search. So, if you try Ctrl + s and it has no effect, your keys are probably in the vi mode. We'll try to provide guidance when there is a mode-specific key like this by including the vi mode's counterpart in parentheses after the default key. For example, in this case, the command would look like Ctrl + s (/) since the default is Ctrl + s and / is the command in the vi mode. Type in choose-window and hit Enter to search down and find the choose-window key binding. Oh look! There it is; it's w: However, what exactly does that mean? Well, all that means is that we can type our prefix key (<Prefix>), followed by the key in that help document to run the mentioned command. First, let's get out of these help docs. To get out of these or any screens like them, generated by tmux, simply press q for quit and you should be back in the shell prompt for window 2. If you ever forget any key bindings, this should be your first step. A nice feature of this key binding help page is that it is dynamically updated as you change your key bindings. Later, when we get to Configuration, you may want to change bindings or bind new shortcuts. They'll all show up in this interface with the configuration you provide them with. Can't do that with manpages! Now, to open the choose window interface, simply type <Prefix>, w since w was the key shown in the help bound to choose-window and voilà: Notice how it nicely lays out all of the currently open windows in a task-manager-like interface. It's interactive too. You can use the arrow keys to move up and down to highlight whichever window you like and then just hit Enter to open it. Let's open the window with nano running. Move up to highlight window 0 and hit Enter. You may notice a few more convenient and intuitive ways to switch between the currently active windows when browsing through the key bindings help. For example, <Prefix>, p will switch to the previous window and <Prefix>, n will switch to the next window. Whether refreshing your recollection on a key binding you've already learnt or seeking to discover a new one, the key bindings help is an excellent resource. Searching for text Now we only have three windows so it's pretty easy to remember what's where, but what if we had 30 or 300? With tmux, that's totally possible. (Though, this is not terribly likely or useful! What would you do with 300 active windows?) One other convenient way to switch between windows is to use the find-window feature. This will prompt us for some text, and it will search all the active windows and open the window that has the text in it. If you've been following along, you should have the window with nano currently open (window 0). Remember we had a shell in window 1 where we had typed some silly commands? Let's try to switch to that one using the find-window feature. Type <Prefix>, f and you'll see a find-window prompt as shown in the following screenshot: Here, type in cat test and hit Enter . You'll see you've switched to window 1 because it had the cat test command in it. However, what if you search for some text that is ambiguous? For example, if you've followed along, you will see the word test appear multiple times on both windows 0 and 1. So, if you try find-window with just the word test, it couldn't magically switch right away because it wouldn't know which window you mean. Thankfully, tmux is smart enough to handle this. It will give you a prompt, similar to the choose-window interface shown earlier, but with only the windows that match the query (in our case, windows 0 and 1; window 2 did not have the word test in it). It also includes the first line in each window (for context) that had the text. Pick window 0 to open it. Detaching and attaching Now press <Prefix>, d . Uh oh! Looks like tmux is gone! The familiar status bar is no more available. The <Prefix> key set does nothing anymore. You may think we the authors have led you astray, causing you to lose your work. What will you do without that detailed document you were writing in nano? Fear not explorer, we are simply demonstrating another very powerful feature of tmux. <Prefix>, d will simply detach the currently active session, but it will keep running happily in the background! Yes, although it looks like it's gone, our session is alive and well. How can we get back to it? First, let's view the active sessions. In your terminal, run the following command: $ tmux list-sessions You should see a nice list that has your session name, number of windows, and date of creation and dimensions. If you had more than one session, you'd see them here too. To re attach the detached session to your session, simply run the following command: $ tmux attach-session –t tutorial This tells tmux to attach a session and the session to attach it to as the target (hence -t). In this case, we want to attach the session named tutorial. Sure enough, you should be back in your tmux session, with the now familiar status bar along the bottom and your nano masterpiece back in view. Note that this is the most verbose version of this command. You can actually omit the target if there is only one running session, as is in our scenario. This shortens the command to tmux attach-session. It can be further shortened because attach-session has a shorter alias, attach. So, we could accomplish the same thing with just tmux attach. Throughout this text, we will generally use the more verbose version, as they tend to be more descriptive, and leave shorter analogues as exercises for the reader. Explaining tmux commands Now you may notice that attach-session sounds like a pretty long command. It's the same as list-sessions, and there are many others in the lexicon of tmux commands that seem rather verbose. Tab completion There is less complexity to the long commands than it may seem because most of them can be tab-completed. Try going to your command prompt and typing the following: $ tmux list-se Next, hit the Tab key. You should see it fill out to this: $ tmux list-sessions So thankfully, due to tab completion, there is little need to remember these long commands. Note that tab completion will only work in certain shells with certain configurations, so if the tab completion trick doesn't work, you may want to search the Web and find a way to enable tab completion for tmux. Aliases Most of the commands have an alias, which is a shorter form of each command that can be used. For example, the alias of list-sessions is ls. The alias of new-session is new. You can see them all readily by running the tmux command list-commands (alias lscm), as used in the following code snippet: $ tmux list-commands This will show you a list of all the tmux commands along with their aliases in parenthesis after the full name. Throughout this text, we will always use the full form for clarity, but you could just as easily use the alias (or just tab complete of course). One thing you'll most likely notice is that only the last few lines are visible in your terminal. If you go for your mouse and try to scroll up, that won't work either! How can you view the text that is placed above? We will need to move into something called the Copy mode. Renaming windows Let's say you want to give a more descriptive name to a window. If you had three different windows, each with the nano editor open, seeing nano for each window wouldn't be all that helpful. Thankfully, it's very easy to rename a window. Just switch to the window you'd like to rename. Then <Prefix>, ,will prompt you for a new name. Let's rename the nano window to masterpiece. See how the status line has been updated and now shows window 0 with the masterpiece title as shown in the following screenshot. Thankfully, tmux is not smart enough to check the contents of your window; otherwise, we're not sure whether the masterpiece title would make it through. Killing windows As the last stop on our virtual tour, let's kill a window we no longer need. Switch to window 1 with our find-window trick by entering <Prefix>, f, cat test, Enter or of course we could use the less exciting <Prefix>, l command to move to the last opened window. Now let's say goodbye to this window. Press <Prefix>, & to kill it. You will receive a prompt to which you have to confirm that you want to kill it. This is a destructive process, unlike detaching, so be sure anything you care about has been saved. Once you confirm it, window 1 will be gone. Poor window 1! You will see that now there are only window 0 and window 2 left: You will also see that now <Prefix>, f, cat test, Enter no longer loads window 1 but rather says No windows matching: cat test. So, window 1 is really no longer with us. Whenever we create a new window, it will take the lowest available index, which in this case will be 1. So window 1 can rise again, but this time as a new and different window with little memory of its past. We can also renumber windows as we'll see later, so if window 1 being missing is offensive to your sense of aesthetics, fear not, it can be remedied! Summary In this article, we got to jump right in and get a whirlwind tour of some of the coolest features in tmux. Here is a quick summary of the features we covered in this article: Starting tmux Naming and renaming sessions The window string and what each chunk means Creating new windows The prefix key Multiple ways to switch back and forth between windows Accessing the help documents for available key bindings Detaching and attaching sessions Renaming and killing windows Resources for Article: Further resources on this subject: Getting Started with GnuCash [article] Creating a Budget for your Business with Gnucash [article] Apache CloudStack Architecture [article]

0
0
1524

Packt

19 Sep 2014

5 min read

Event-driven BPEL Process

Packt

19 Sep 2014

5 min read

0
0
1655

article-image-driving-visual-analyses-automobile-data-python

Packt

19 Sep 2014

19 min read

Driving Visual Analyses with Automobile Data (Python)

Packt

19 Sep 2014

19 min read

0
0
9021

Packt

19 Sep 2014

24 min read

Creating a RESTful API

Packt

19 Sep 2014

24 min read

In this article by Jason Krol, the author of Web Development with MongoDB and NodeJS, we will review the following topics: (For more resources related to this topic, see here.) Introducing RESTful APIs Installing a few basic tools Creating a basic API server and sample JSON data Responding to GET requests Updating data with POST and PUT Removing data with DELETE Consuming external APIs from Node What is an API? An Application Programming Interface (API) is a set of tools that a computer system makes available that provides unrelated systems or software the ability to interact with each other. Typically, a developer uses an API when writing software that will interact with a closed, external, software system. The external software system provides an API as a standard set of tools that all developers can use. Many popular social networking sites provide developer's access to APIs to build tools to support those sites. The most obvious examples are Facebook and Twitter. Both have a robust API that provides developers with the ability to build plugins and work with data directly, without them being granted full access as a general security precaution. As you will see with this article, providing your own API is not only fairly simple, but also it empowers you to provide your users with access to your data. You also have the added peace of mind knowing that you are in complete control over what level of access you can grant, what sets of data you can make read-only, as well as what data can be inserted and updated. What is a RESTful API? Representational State Transfer (REST) is a fancy way of saying CRUD over HTTP. What this means is when you use a REST API, you have a uniform means to create, read, and update data using simple HTTP URLs with a standard set of HTTP verbs. The most basic form of a REST API will accept one of the HTTP verbs at a URL and return some kind of data as a response. Typically, a REST API GET request will always return some kind of data such as JSON, XML, HTML, or plain text. A POST or PUT request to a RESTful API URL will accept data to create or update. The URL for a RESTful API is known as an endpoint, and while working with these endpoints, it is typically said that you are consuming them. The standard HTTP verbs used while interfacing with REST APIs include: GET: This retrieves data POST: This submits data for a new record PUT: This submits data to update an existing record PATCH: This submits a date to update only specific parts of an existing record DELETE: This deletes a specific record Typically, RESTful API endpoints are defined in a way that they mimic the data models and have semantic URLs that are somewhat representative of the data models. What this means is that to request a list of models, for example, you would access an API endpoint of /models. Likewise, to retrieve a specific model by its ID, you would include that in the endpoint URL via /models/:Id. Some sample RESTful API endpoint URLs are as follows: GET http://myapi.com/v1/accounts: This returns a list of accounts GET http://myapi.com/v1/accounts/1: This returns a single account by Id: 1 POST http://myapi.com/v1/accounts: This creates a new account (data submitted as a part of the request) PUT http://myapi.com/v1/accounts/1: This updates an existing account by Id: 1 (data submitted as part of the request) GET http://myapi.com/v1/accounts/1/orders: This returns a list of orders for account Id: 1 GET http://myapi.com/v1/accounts/1/orders/21345: This returns the details for a single order by Order Id: 21345 for account Id: 1 It's not a requirement that the URL endpoints match this pattern; it's just common convention. Introducing Postman REST Client Before we get started, there are a few tools that will make life much easier when you're working directly with APIs. The first of these tools is called Postman REST Client, and it's a Google Chrome application that can run right in your browser or as a standalone-packaged application. Using this tool, you can easily make any kind of request to any endpoint you want. The tool provides many useful and powerful features that are very easy to use and, best of all, free! Installation instructions Postman REST Client can be installed in two different ways, but both require Google Chrome to be installed and running on your system. The easiest way to install the application is by visiting the Chrome Web Store at https://chrome.google.com/webstore/category/apps. Perform a search for Postman REST Client and multiple results will be returned. There is the regular Postman REST Client that runs as an application built into your browser, and then separate Postman REST Client (packaged app) that runs as a standalone application on your system in its own dedicated window. Go ahead and install your preference. If you install the application as the standalone packaged app, an icon to launch it will be added to your dock or taskbar. If you installed it as a regular browser app, you can launch it by opening a new tab in Google Chrome and going to Apps and finding the Postman REST Client icon. After you've installed and launched the app, you should be presented with an output similar to the following screenshot: A quick tour of Postman REST Client Using Postman REST Client, we're able to submit REST API calls to any endpoint we want as well as modify the type of request. Then, we can have complete access to the data that's returned from the API as well as any errors that might have occurred. To test an API call, enter the URL to your favorite website in the Enter request URL here field and leave the dropdown next to it as GET. This will mimic a standard GET request that your browser performs anytime you visit a website. Click on the blue Send button. The request is made and the response is displayed at the bottom half of the screen. In the following screenshot, I sent a simple GET request to http://kroltech.com and the HTML is returned as follows: If we change this URL to that of the RSS feed URL for my website, you can see the XML returned: The XML view has a few more features as it exposes the sidebar to the right that gives you a handy outline to glimpse the tree structure of the XML data. Not only that, you can now see a history of the requests we've made so far along the left sidebar. This is great when we're doing more advanced POST or PUT requests and don't want to repeat the data setup for each request while testing an endpoint. Here is a sample API endpoint I submitted a GET request to that returns the JSON data in its response: A really nice thing about making API calls to endpoints that return JSON using Postman Client is that it parses and displays the JSON in a very nicely formatted way, and each node in the data is expandable and collapsible. The app is very intuitive so make sure you spend some time playing around and experimenting with different types of calls to different URLs. Using the JSONView Chrome extension There is one other tool I want to let you know about (while extremely minor) that is actually a really big deal. The JSONView Chrome extension is a very small plugin that will instantly convert any JSON you view directly via the browser into a more usable JSON tree (exactly like Postman Client). Here is an example of pointing to a URL that returns JSON from Chrome before JSONView is installed: And here is that same URL after JSONView has been installed: You should install the JSONView Google Chrome extension the same way you installed Postman REST Client—access the Chrome Web Store and perform a search for JSONView. Now that you have the tools to be able to easily work with and test API endpoints, let's take a look at writing your own and handling the different request types. Creating a Basic API server Let's create a super basic Node.js server using Express that we'll use to create our own API. Then, we can send tests to the API using Postman REST Client to see how it all works. In a new project workspace, first install the npm modules that we're going to need in order to get our server up and running: $ npm init $ npm install --save express body-parser underscore Now that the package.json file for this project has been initialized and the modules installed, let's create a basic server file to bootstrap up an Express server. Create a file named server.js and insert the following block of code: var express = require('express'), bodyParser = require('body-parser'), _ = require('underscore'), json = require('./movies.json'), app = express(); app.set('port', process.env.PORT || 3500); app.use(bodyParser.urlencoded()); app.use(bodyParser.json()); var router = new express.Router(); // TO DO: Setup endpoints ... app.use('/', router); var server = app.listen(app.get('port'), function() { console.log('Server up: http://localhost:' + app.get('port')); }); Most of this should look familiar to you. In the server.js file, we are requiring the express, body-parser, and underscore modules. We're also requiring a file named movies.json, which we'll create next. After our modules are required, we set up the standard configuration for an Express server with the minimum amount of configuration needed to support an API server. Notice that we didn't set up Handlebars as a view-rendering engine because we aren't going to be rendering any HTML with this server, just pure JSON responses. Creating sample JSON data Let's create the sample movies.json file that will act as our temporary data store (even though the API we build for the purposes of demonstration won't actually persist data beyond the app's life cycle): [{ "Id": "1", "Title": "Aliens", "Director": "James Cameron", "Year": "1986", "Rating": "8.5" }, { "Id": "2", "Title": "Big Trouble in Little China", "Director": "John Carpenter", "Year": "1986", "Rating": "7.3" }, { "Id": "3", "Title": "Killer Klowns from Outer Space", "Director": "Stephen Chiodo", "Year": "1988", "Rating": "6.0" }, { "Id": "4", "Title": "Heat", "Director": "Michael Mann", "Year": "1995", "Rating": "8.3" }, { "Id": "5", "Title": "The Raid: Redemption", "Director": "Gareth Evans", "Year": "2011", "Rating": "7.6" }] This is just a really simple JSON list of a few of my favorite movies. Feel free to populate it with whatever you like. Boot up the server to make sure you aren't getting any errors (note we haven't set up any routes yet, so it won't actually do anything if you tried to load it via a browser): $ node server.js Server up: http://localhost:3500 Responding to GET requests Adding a simple GET request support is fairly simple, and you've seen this before already in the app we built. Here is some sample code that responds to a GET request and returns a simple JavaScript object as JSON. Insert the following code in the routes section where we have the // TO DO: Setup endpoints ... waiting comment: router.get('/test', function(req, res) { var data = { name: 'Jason Krol', website: 'http://kroltech.com' }; res.json(data); }); Let's tweak the function a little bit and change it so that it responds to a GET request against the root URL (that is /) route and returns the JSON data from our movies file. Add this new route after the /test route added previously: router.get('/', function(req, res) { res.json(json); }); The res (response) object in Express has a few different methods to send data back to the browser. Each of these ultimately falls back on the base send method, which includes header information, statusCodes, and so on. res.json and res.jsonp will automatically format JavaScript objects into JSON and then send using res.send. res.render will render a template view as a string and then send it using res.send as well. With that code in place, if we launch the server.js file, the server will be listening for a GET request to the / URL route and will respond with the JSON data of our movies collection. Let's first test it out using the Postman REST Client tool: GET requests are nice because we could have just as easily pulled that same URL via our browser and received the same result: However, we're going to use Postman for the remainder of our endpoint testing as it's a little more difficult to send POST and PUT requests using a browser. Receiving data – POST and PUT requests When we want to allow our users using our API to insert or update data, we need to accept a request from a different HTTP verb. When inserting new data, the POST verb is the preferred method to accept data and know it's for an insert. Let's take a look at code that accepts a POST request and data along with the request, and inserts a record into our collection and returns the updated JSON. Insert the following block of code after the route you added previously for GET: router.post('/', function(req, res) { // insert the new item into the collection (validate first) if(req.body.Id && req.body.Title && req.body.Director && req.body.Year && req.body.Rating) { json.push(req.body); res.json(json); } else { res.json(500, { error: 'There was an error!' }); } }); You can see the first thing we do in the POST function is check to make sure the required fields were submitted along with the actual request. Assuming our data checks out and all the required fields are accounted for (in our case every field), we insert the entire req.body object into the array as is using the array's push function. If any of the required fields aren't submitted with the request, we return a 500 error message instead. Let's submit a POST request this time to the same endpoint using the Postman REST Client. (Don't forget to make sure your API server is running with node server.js.): First, we submitted a POST request with no data, so you can clearly see the 500 error response that was returned. Next, we provided the actual data using the x-www-form-urlencoded option in Postman and provided each of the name/value pairs with some new custom data. You can see from the results that the STATUS was 200, which is a success and the updated JSON data was returned as a result. Reloading the main GET endpoint in a browser yields our original movies collection with the new one added. PUT requests will work in almost exactly the same way except traditionally, the Id property of the data is handled a little differently. In our example, we are going to require the Id attribute as a part of the URL and not accept it as a parameter in the data that's submitted (since it's usually not common for an update function to change the actual Id of the object it's updating). Insert the following code for the PUT route after the existing POST route you added earlier: router.put('/:id', function(req, res) { // update the item in the collection if(req.params.id && req.body.Title && req.body.Director && req.body.Year && req.body.Rating) { _.each(json, function(elem, index) { // find and update: if (elem.Id === req.params.id) { elem.Title = req.body.Title; elem.Director = req.body.Director; elem.Year = req.body.Year; elem.Rating = req.body.Rating; } }); res.json(json); } else { res.json(500, { error: 'There was an error!' }); } }); This code again validates that the required fields are included with the data that was submitted along with the request. Then, it performs an _.each loop (using the underscore module) to look through the collection of movies and find the one whose Id parameter matches that of the Id included in the URL parameter. Assuming there's a match, the individual fields for that matched object are updated with the new values that were sent with the request. Once the loop is complete, the updated JSON data is sent back as the response. Similarly, in the POST request, if any of the required fields are missing, a simple 500 error message is returned. The following screenshot demonstrates a successful PUT request updating an existing record. The response from Postman after including the value 1 in the URL as the Id parameter, which provides the individual fields to update as x-www-form-urlencoded values, and finally sending as PUT shows that the original item in our movies collection is now the original Alien (not Aliens, its sequel as we originally had). Removing data – DELETE The final stop on our whirlwind tour of the different REST API HTTP verbs is DELETE. It should be no surprise that sending a DELETE request should do exactly what it sounds like. Let's add another route that accepts DELETE requests and will delete an item from our movies collection. Here is the code that takes care of DELETE requests that should be placed after the existing block of code from the previous PUT: router.delete('/:id', function(req, res) { var indexToDel = -1; _.each(json, function(elem, index) { if (elem.Id === req.params.id) { indexToDel = index; } }); if (~indexToDel) { json.splice(indexToDel, 1); } res.json(json); }); This code will loop through the collection of movies and find a matching item by comparing the values of Id. If a match is found, the array index for the matched item is held until the loop is finished. Using the array.splice function, we can remove an array item at a specific index. Once the data has been updated by removing the requested item, the JSON data is returned. Notice in the following screenshot that the updated JSON that's returned is in fact no longer displaying the original second item we deleted. Note that ~ in there! That's a little bit of JavaScript black magic! The tilde (~) in JavaScript will bit flip a value. In other words, take a value and return the negative of that value incremented by one, that is ~n === -(n+1). Typically, the tilde is used with functions that return -1 as a false response. By using ~ on -1, you are converting it to a 0. If you were to perform a Boolean check on -1 in JavaScript, it would return true. You will see ~ is used primarily with the indexOf function and jQuery's $.inArray()—both return -1 as a false response. All of the endpoints defined in this article are extremely rudimentary, and most of these should never ever see the light of day in a production environment! Whenever you have an API that accepts anything other than GET requests, you need to be sure to enforce extremely strict validation and authentication rules. After all, you are basically giving your users direct access to your data. Consuming external APIs from Node.js There will undoubtedly be a time when you want to consume an API directly from within your Node.js code. Perhaps, your own API endpoint needs to first fetch data from some other unrelated third-party API before sending a response. Whatever the reason, the act of sending a request to an external API endpoint and receiving a response can be done fairly easily using a popular and well-known npm module called Request. Request was written by Mikeal Rogers and is currently the third most popular and (most relied upon) npm module after async and underscore. Request is basically a super simple HTTP client, so everything you've been doing with Postman REST Client so far is basically what Request can do, only the resulting data is available to you in your node code as well as the response status codes and/or errors, if any. Consuming an API endpoint using Request Let's do a neat trick and actually consume our own endpoint as if it was some third-party external API. First, we need to ensure we have Request installed and can include it in our app: $ npm install --save request Next, edit server.js and make sure you include Request as a required module at the start of the file: var express = require('express'), bodyParser = require('body-parser'), _ = require('underscore'), json = require('./movies.json'), app = express(), request = require('request'); Now let's add a new endpoint after our existing routes, which will be an endpoint accessible in our server via a GET request to /external-api. This endpoint, however, will actually consume another endpoint on another server, but for the purposes of this example, that other server is actually the same server we're currently running! The Request module accepts an options object with a number of different parameters and settings, but for this particular example, we only care about a few. We're going to pass an object that has a setting for the method (GET, POST, PUT, and so on) and the URL of the endpoint we want to consume. After the request is made and a response is received, we want an inline callback function to execute. Place the following block of code after your existing list of routes in server.js: router.get('/external-api', function(req, res) { request({ method: 'GET', uri: 'http://localhost:' + (process.env.PORT || 3500), }, function(error, response, body) { if (error) { throw error; } var movies = []; _.each(JSON.parse(body), function(elem, index) { movies.push({ Title: elem.Title, Rating: elem.Rating }); }); res.json(_.sortBy(movies, 'Rating').reverse()); }); }); The callback function accepts three parameters: error, response, and body. The response object is like any other response that Express handles and has all of the various parameters as such. The third parameter, body, is what we're really interested in. That will contain the actual result of the request to the endpoint that we called. In this case, it is the JSON data from our main GET route we defined earlier that returns our own list of movies. It's important to note that the data returned from the request is returned as a string. We need to use JSON.parse to convert that string to actual usable JSON data. Using the data that came back from the request, we transform it a little bit. That is, we take that data and manipulate it a bit to suit our needs. In this example, we took the master list of movies and just returned a new collection that consists of only the title and rating of each movie and then sorts the results by the top scores. Load this new endpoint by pointing your browser to http://localhost:3500/external-api, and you can see the new transformed JSON output to the screen. Let's take a look at another example that's a little more real world. Let's say that we want to display a list of similar movies for each one in our collection, but we want to look up that data somewhere such as www.imdb.com. Here is the sample code that will send a GET request to IMDB's JSON API, specifically for the word aliens, and returns a list of related movies by the title and year. Go ahead and place this block of code after the previous route for external-api: router.get('/imdb', function(req, res) { request({ method: 'GET', uri: 'http://sg.media-imdb.com/suggests/a/aliens.json', }, function(err, response, body) { var data = body.substring(body.indexOf('(')+1); data = JSON.parse(data.substring(0,data.length-1)); var related = []; _.each(data.d, function(movie, index) { related.push({ Title: movie.l, Year: movie.y, Poster: movie.i ? movie.i[0] : '' }); }); res.json(related); }); }); If we take a look at this new endpoint in a browser, we can see the JSON data that's returned from our /imdb endpoint is actually itself retrieving and returning data from some other API endpoint: Note that the JSON endpoint I'm using for IMDB isn't actually from their API, but rather what they use on their homepage when you type in the main search box. This would not really be the most appropriate way to use their data, but it's more of a hack to show this example. In reality, to use their API (like most other APIs), you would need to register and get an API key that you would use so that they can properly track how much data you are requesting on a daily or an hourly basis. Most APIs will to require you to use a private key with them for this same reason. Summary In this article, we took a brief look at how APIs work in general, the RESTful API approach to semantic URL paths and arguments, and created a bare bones API. We used Postman REST Client to interact with the API by consuming endpoints and testing the different types of request methods (GET, POST, PUT, and so on). You also learned how to consume an external API endpoint by using the third-party node module Request. Resources for Article: Further resources on this subject: RESTful Services JAX-RS 2.0 [Article] REST – Where It Begins [Article] RESTful Web Services – Server-Sent Events (SSE) [Article]

0
0
8114

Packt

19 Sep 2014

11 min read

Mobility

Packt

19 Sep 2014

11 min read

0
0
2266

Packt

18 Sep 2014

8 min read

Redis in Autosuggest

Packt

18 Sep 2014

8 min read

0
0
9713

Packt

18 Sep 2014

6 min read

Waiting for AJAX, as always…

Packt

18 Sep 2014

6 min read

In this article, by Dima Kovalenko, author of the book, Selenium Design Patterns and Best Practices, we will learn how test automation have progressed over the period of time. Test automation was simpler in the good old days, before asynchronous page loading became mainstream. Previously the test would click on a button causing the whole page to reload; after the new page load we could check if any errors were displayed. The act of waiting for the page to load guaranteed that all of the items on the page are already there, and if expected element was missing our test could fail with confidence. Now, an element might be missing for several seconds, and magically show up after an unspecified delay. The only thing for a test to do is become smarter! (For more resources related to this topic, see here.) Filling out credit card information is a common test for any online store. Let’s take a look at a typical credit card form: Our form has some default values for user to fill out, and a quick JavaScript check that the required information was entered into the field, by adding Done next to a filled out input field, like this: Once all of the fields have been filled out and seem correct, JavaScript makes the Purchase button clickable. Clicking on the button will trigger an AJAX request for the purchase, followed by successful purchase message, like this: Very simple and straight forward, anyone who has made an online purchase has seen some variation of this form. Writing a quick test to fill out the form and make sure the purchase is complete should be a breeze! Testing AJAX with sleep method Let’s take a look at a simple test, written to test this form. Our tests are written in Ruby for this demonstration for easy of readability. However, this technique will work in Java or any other programming language you may choose to use. To follow along with this article, please make sure you have Ruby and selenium-webdriver gem installed. Installers for both can be found here https://www.ruby-lang.org/en/installation/ and http://rubygems.org/gems/selenium-webdriver. Our test file starts like this: If this code looks like a foreign language to you, don’t worry we will walk through it until it all makes sense. First three lines of the test file specify all of the dependencies such as selenium-webdriver gem. On line five, we declare our test class as TestAjax which inherits its behavior from the Test::Unit framework we required on line two. The setup and teardown methods will take care of the Selenium instance for us. In the setup we create a new instance of Firefox browser and navigate to a page, which contains the mentioned form; the teardown method closes the browser after the test is complete. Now let’s look at the test itself: Lines 17 to 21 fill out the purchase form with some test data, followed by an assertion that Purchase complete! text appears in the DIV with ID of success. Let’s run this test to see if it passes. The following is the output result of our test run; as you can see it’s a failure: Our test fails because it was expecting to see Purchase complete! right here: But no text was found, because the AJAX request took a much longer time than expected. The AJAX request in progress indicator is seen here: Since this AJAX request can take anywhere from 15 to 30 seconds to complete, the most logical next step is to add a pause in between the click on the Purchase button and the test assertion; shown as follows: However, this obvious solution is really bad for two reasons: If majority of AJAX requests take 15 seconds to run, than our test is wasting another 15 seconds waiting for things instead continuing. If our test environment is under heavy load, the AJAX request can take as long as 45 seconds to complete, so our test will fail. The better choice is to make our tests smart enough to wait for AJAX request to complete, instead of using a sleep method. using smart AJAX waits To solve the shortcomings of the sleep methods we will create a new method called wait_for_ajax, seen here: In this method, we use the Wait class built into the WebDriver. The until method in the Wait class allows us to pause the test execution for an arbitrary reason. In this case to sleep for 1 second, on line 29, and to execute a JavaScript command in the browser with the help of the execute_script method. This method allows us to run a JavaScript snippet in the current browser window on the current page, which gives us access to all of the variables and methods that JavaScript has. The snippet of JavaScript that we are sending to the browser is a query against jQuery framework. The active method in jQuery returns an integer of currently active AJAX requests. Zero means that the page is fully loaded, and there are no background HTTP requests happening. On line 30, we ask the execute_script to return the current active count of AJAX requests happening on the page, and if the returned value equals 0 we break out of the Wait loop. Once the loop is broken, our tests can continue on their way. Note that the upper limit of the wait_for_ajax method is set to 60 seconds on line 28. This value can be increased or decreased, depending on how slow the test environment is. Let’s replace the sleep method call with our newly created method, shown here: And run our tests one more time, to see this passing result: Now that we stabilized our test against slow and unpredictable AJAX requests, we need to add a method that will wait for JavaScript animations to finish. These animations can break our tests just as much as the AJAX requests. Also, are tests are incredibly vulnerable due third party slowness; such as when the Facebook Like button takes long time to load. Summary This article introduced you to using a simple method that intelligently waits for all of the AJAX requests to complete, we have increased the overall stability of our test and test suite. Furthermore, we have removed a wasteful delay, which adds unnecessary delay in our test execution. In conclusion, we have improved the test stability while at the same time making our test run faster! Resources for Article: Further resources on this subject: Quick Start into Selenium Tests [article] Behavior-driven Development with Selenium WebDriver [article] Exploring Advanced Interactions of WebDriver [article]

0
0
1171

Packt

18 Sep 2014

18 min read

Caches

Packt

18 Sep 2014

18 min read

In this article, by Federico Razzoli, author of the book Mastering MariaDB, we will see that how in order to avoid accessing disks, MariaDB and storage engines have several caches that a DBA should know about. (For more resources related to this topic, see here.) InnoDB caches Since InnoDB is the recommended engine for most use cases, configuring it is very important. The InnoDB buffer pool is a cache that should speed up most read and write operations. Thus, every DBA should know how it works. The doublewrite buffer is an important mechanism that guarantees that a row is never half-written to a file. For heavy-write workloads, we may want to disable it to obtain more speed. InnoDB pages Tables, data, and indexes are organized in pages, both in the caches and in the files. A page is a package of data that contains one or two rows and usually some empty space. The ratio between the used space and the total size of pages is called the fill factor. By changing the page size, the fill factor changes inevitably. InnoDB tries to keep the pages 15/16 full. If a page's fill factor is lower than 1/2, InnoDB merges it with another page. If the rows are written sequentially, the fill factor should be about 15/16. If the rows are written randomly, the fill factor is between 1/2 and 15/16. A low fill factor represents a memory waste. With a very high fill factor, when pages are updated and their content grows, they often need to be reorganized, which negatively affects the performance. The columns with a variable length type (TEXT, BLOB, VARCHAR, or VARBIT) are written into separate data structures called overlow pages. Such columns are called off-page columns. They are better handled by the DYNAMIC row format, which can be used for most tables when backward compatibility is not a concern. A page never changes its size, and the size is the same for all pages. The page size, however, is configurable: it can be 4 KB, 8 KB, or 16 KB. The default size is 16 KB, which is appropriate for many workloads and optimizes full table scans. However, smaller sizes can improve the performance of some OLTP workloads involving many small insertions because of lower memory allocation, or storage devices with smaller blocks (old SSD devices). Another reason to change the page size is that this can greatly affect the InnoDB compression. The page size can be changed by setting the innodb_page_size variable in the configuration file and restarting the server. The InnoDB buffer pool On servers that mainly use InnoDB tables (the most common case), the buffer pool is the most important cache to consider. Ideally, it should contain all the InnoDB data and indexes to allow MariaDB to execute queries without accessing the disks. Changes to data are written into the buffer pool first. They are flushed to the disks later to reduce the number of I/O operations. Of course, if the data does not fit the server's memory, only a subset of them can be in the buffer pool. In this case, that subset should be the so-called working set: the most frequently accessed data. The default size of the buffer pool is 128 MB and should always be changed. On production servers, this value is too low. On a developer's computer, usually, there is no need to dedicate so much memory to InnoDB. The minimum size, 5 MB, is usually more than enough when developing a simple application. Old and new pages We can think of the buffer pool as a list of data pages that are sorted with a variation of the classic Last Recently Used (LRU) algorithm. The list is split into two sublists: the new list contains the most used pages, and the old list contains the less used pages. The first page in each sublist is called the head. The head of the old list is called the midpoint. When a page is accessed that is not in the buffer pool, it is inserted into the midpoint. The other pages in the old list shift by one position, and the last one is evicted. When a page from the old list is accessed, it is moved from the old list to the head of the new list. When a page in the new list is accessed, it goes to the head of the list. The following variables affect the previously described algorithm: innodb_old_blocks_pct: This variable defines the percentage of the buffer pool reserved to the old list. The allowed range is 5 to 95, and it is 37 (3/5) by default. innodb_old_blocks_time: If this value is not 0, it represents the minimum age (in milliseconds) the old pages must reach before they can be moved into the new list. If an old page is accessed that did not reach this age, it goes to the head of the old list. innodb_max_dirty_pages_pct: This variable defines the maximum percentage of pages that were modified in-memory. This mechanism will be discussed in the Dirty pages section later in this article. This value is not a hard limit, but InnoDB tries not to exceed it. The allowed range is 0 to 100, and the default is 75. Increasing this value can reduce the rate of writes, but the shutdown will take longer (because dirty pages need to be written onto the disk before the server can be stopped in a clean way). innodb_flush_neighbors: If set to 1, when a dirty page is flushed from memory to a disk, even the contiguous pages are flushed. If set to 2, all dirty pages from the same extent (the portion of memory whose size is 1 MB) are flushed. With 0, only dirty pages are flushed when their number exceeds innodb_max_dirty_pages_pct or when they are evicted from the buffer pool. The default is 1. This optimization is only useful for spinning disks. Write-incentive workloads may need an aggressive flushing strategy; however, if the pages are written too often, they degrade the performance. Buffer pool instances On MariaDB versions older than 5.5, InnoDB creates only one instance of the buffer pool. However, concurrent threads are blocked by a mutex, and this may become a bottleneck. This is particularly true if the concurrency level is high and the buffer pool is very big. Splitting the buffer pool into multiple instances can solve the problem. Multiple instances represent an advantage only if the buffer pool size is at least 2 GB. Each instance should be of size 1 GB. InnoDB will ignore the configuration and will maintain only one instance if the buffer pool size is less than 1 GB. Furthermore, this feature is more useful on 64-bit systems. The following variables control the instances and their size: innodb_buffer_pool_size: This variable defines the total size of the buffer pool (no single instances). Note that the real size will be about 10 percent bigger than this value. A percentage of this amount of memory is dedicated to the change buffer. innodb_buffer_pool_instances: This variable defines the number of instances. If the value is -1, InnoDB will automatically decide the number of instances. The maximum value is 64. The default value is 8 on Unix and depends on the innodb_buffer_pool_size variable on Windows. Dirty pages When a user executes a statement that modifies data in the buffer pool, InnoDB initially modifies the data that is only in memory. The pages that are only modified in the buffer pool are called dirty pages. Pages that have not been modified or whose changes have been written on the disk are called as clean pages. Note that changes to data are also written to the redo log. If a crash occurs before those changes are applied to data files, InnoDB is usually able to recover the data, including the last modifications, by reading the redo log and the doublewrite buffer. The doublewrite buffer will be discussed later, in the Explaining the doublewrite buffer section. At some point, the data needs to be flushed to the InnoDB data files (the .ibd files). In MariaDB 10.0, this is done by a dedicated thread called the page cleaner. In older versions, this was done by the master thread, which executes several InnoDB maintenance operations. The flushing is not only concerned with the buffer pool, but also with the InnoDB redo and undo log. The list of dirty pages is frequently updated when transactions write data at the physical level. It has its own mutex that does not lock the whole buffer pool. The maximum number of dirty pages is determined by innodb_max_dirty_pages_pct as a percentage. When this maximum limit is reached, dirty pages are flushed. The innodb_flush_neighbor_pages value determines how InnoDB selects the pages to flush. If it is set to none, only selected pages are written. If it is set to area, even the neighboring dirty pages are written. If it is set to cont, all contiguous blocks of the dirty pages are flushed. On shutdown, a complete page flushing is only done if innodb_fast_shutdown is 0. Normally, this method should be preferred, because it leaves data in a consistent state. However, if many changes have been requested but still not written to disk, this process could be very slow. It is possible to speed up the shutdown by specifying a higher value for innodb_fast_shutdown. In this case, a crash recovery will be performed on the next restart. The read ahead optimization The read ahead feature is designed to reduce the number of read operations from the disks. It tries to guess which data will be needed in the near future and reads it with one operation. Two algorithms are available to choose the pages to read in advance: linear read ahead random read ahead The linear read ahead is used by default. It counts the pages in the buffer pool that are read sequentially. If their number is greater than or equal to innodb_read_ahead_threshold, InnoDB will read all data from the same extent (a portion of data whose size is always 1 MB). The innodb_read_ahead_threshold value must be a number from 0 to 64. The value 0 disables the linear read ahead but does not enable the random read ahead. The default value is 56. The random read ahead is only used if the innodb_random_read_ahead server variable is set to ON. By default, it is set to OFF. This algorithm checks whether at least 13 pages in the buffer pool have been read to the same extent. In this case, it does not matter whether they were read sequentially. With this variable enabled, the full extent will be read. The 13-page threshold is not configurable. If innodb_read_ahead_threshold is set to 0 and innodb_random_read_ahead is set to OFF, the read ahead optimization is completely turned off. Diagnosing the buffer pool performance MariaDB provides some tools to monitor the activities of the buffer pool and the InnoDB main thread. By inspecting these activities, a DBA can tune the relevant server variables to improve the performance. In this section, we will discuss the SHOW ENGINE INNODB STATUS SQL statement and the INNODB_BUFFER_POOL_STATS table in the information_schema database. While the latter provides more information about the buffer pool, the SHOW ENGINE INNODB STATUS output is easier to read. The INNODB_BUFFER_POOL_STATS table contains the following columns: Column name Description POOL_ID Each InnoDB buffer pool instance has a different ID. POOL_SIZE Size (in pages) of the instance. FREE_BUFFERS Number of free pages. DATABASE_PAGES Total number of data pages. OLD_DATABASE_PAGES Pages in the old list. MODIFIED_DATABASE_PAGES Dirty pages. PENDING_DECOMPRESS Number of pages that need to be decompressed. PENDING_READS Pending read operations. PENDING_FLUSH_LRU Pages in the old or new lists that need to be flushed. PENDING_FLUSH_LIST Pages in the flush list that need to flushed. PAGES_MADE_YOUNG Number of pages moved into the new list. PAGES_NOT_MADE_YOUNG Old pages that did not become young. PAGES_MADE_YOUNG_RATE Pages made young per second. This value is reset each time it is shown. PAGES_MADE_NOT_YOUNG_RATE Pages read but not made young (this happens because they do not reach the minimum age) per second. This value is reset each time it is shown. NUMBER_PAGES_READ Number of pages read from disk. NUMBER_PAGES_CREATED Number of pages created in the buffer pool. NUMBER_PAGES_WRITTEN Number of pages written to disk. PAGES_READ_RATE Pages read from disk per second. PAGES_CREATE_RATE Pages created in the buffer pool per second. PAGES_WRITTEN_RATE Pages written to disk per second. NUMBER_PAGES_GET Requests of pages that are not in the buffer pool. HIT_RATE Rate of page hits. YOUNG_MAKE_PER_THOUSAND_GETS Pages made young per thousand physical reads. NOT_YOUNG_MAKE_PER_THOUSAND_GETS Pages that remain in the old list per thousand reads. NUMBER_PAGES_READ_AHEAD Number of pages read with a read ahead operation. NUMBER_READ_AHEAD_EVICTED The number of pages read with a read ahead operation that were never used and then were evicted. READ_AHEAD_RATE Similar to NUMBER_PAGES_READ_AHEAD, but this is a per second rate. READ_AHEAD_EVICTED_RATE Similar to NUMBER_READ_AHEAD_EVICTED, but this is a per-second rate. LRU_IO_TOTAL Total number of pages read or written to disk. LRU_IO_CURRENT Pages read or written to disk within the last second. UNCOMPRESS_TOTAL Pages that have been uncompressed. UNCOMPRESS_CURRENT Pages that have been uncompressed within the last second. The per-second values are reset after they are shown. The PAGES_MADE_YOUNG_RATE and PAGES_NOT_MADE_YOUNG_RATE values show us, respectively, how often old pages become new and how much old pages are never accessed in a reasonable amount of time. If the former value is too high, the old list is probably not big enough and vice versa. Comparing READ_AHEAD_RATE and READ_AHEAD_EVICTED_RATE is useful to tune the read ahead feature. The READ_AHEAD_EVICTED_RATE value should be low, because it indicates which pages read with the read ahead operations were not useful. If their ratio is good but READ_AHEAD_RATE is low, probably the read ahead should be used more often. In this case, if the linear read ahead is used, we can try to increase or decrease innodb_read_ahead_threshold. Or, we can change the used algorithm (linear or random read ahead). The columns whose names end with _RATE better describe the current server activities. They should be examined several times a day, and during the whole week or month, perhaps with the help of one of more monitoring tools. Good, free software monitoring tools include Cacti and Nagios. The Percona Monitoring Tools package includes MariaDB (and MySQL) plugins that provide an interface to these tools. Dumping and loading the buffer pool In some cases, one may want to save the current contents of the buffer pool and reload it later. The most common case is when the server is stopped. Normally, on startup, the buffer pool is empty, and InnoDB needs to fill it with useful data. This process is called warm-up. Until the warm-up is complete, the InnoDB performance is lower than usual. Two variables help avoid the warm-up phase: innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup. If their value is ON, InnoDB automatically saves the buffer pool into a file at shut down and restores it at startup. Their default value is OFF. Turning them ON can be very useful, but remember the caveats: The startup and shutdown time might be longer. In some cases, we might prefer MariaDB to start more quickly even if it is slower during warm-up. We need the disk space necessary to store the buffer pool. The user may also want to dump the buffer pool at any moment and restore it without restarting the server. This is advisable when the buffer pool is optimal and some statements are going to heavily change its contents. A common example is when a big InnoDB table is fully scanned. This happens, for example, during logical backups. A full table scan will fill the old list with non-frequently accessed data. A good way to solve the problem is to dump the buffer pool before the table scan and reload it later. This operation can be performed by setting two special variables: innodb_buffer_pool_dump_now and innodb_buffer_pool_load_now. Reading the values of these variables always returns OFF. Setting the first variable to ON forces InnoDB to immediately dump the buffer pool into a file. Setting the latter variable to ON forces InnoDB to load the buffer pool from that file. In both cases, the progress of the dump or load operation is indicated by the Innodb_buffer_pool_dump_status and Innodb_buffer_pool_load_status status variables. If loading the buffer pool takes too long, it is possible to stop it by setting innodb_buffer_pool_load_abort to ON. The name and path of the dump file is specified in the innodb_buffer_pool_filename server variable. Of course, we should be sure that the chosen directory can contain the file, but it is much smaller than the memory used by the buffer pool. InnoDB change buffer The change buffer is a cache that is a part of the buffer pool. It contains dirty pages related to secondary indexes (not primary keys) that are not stored in the main part of the buffer pool. If the modified data is read later, it will be merged into the buffer pool. In older versions, this buffer was called the insert buffer, but now it is renamed, because it can handle deletions. The change buffer speeds up the following write operations: insertions: When new rows are written. deletions: When existing row are marked for deletion but not yet physically erased for performance reasons. purges: The physical elimination of previously marked rows and obsolete index values. This is periodically done by a dedicated thread. In some cases, we may want to disable the change buffer. For example, we may have a working set that only fits the memory if the change buffer is discarded. In this case, even after disabling it, we will still have all the frequently accessed secondary indexes in the buffer pool. Also, DML statements may be rare for our database, or we may have just a few secondary indexes: in these cases, the change buffer does not help. The change buffer can be configured using the following variables: innodb_change_buffer_max_size: This is the maximum size of the change buffer, expressed as a percentage of the buffer pool. The allowed range is 0 to 50, and the default value is 25. innodb_change_buffering: This determines which types of operations are cached by the change buffer. The allowed values are none (to disable the buffer), all, inserts, deletes, purges, and changes (to cache inserts and deletes, but not purges). The all value is the default value. Explaining the doublewrite buffer When InnoDB writes a page to disk, at least two events can interrupt the operation after it is started: a hardware failure or an OS failure. In the case of an OS failure, this should not be possible if the pages are not bigger than the blocks written by the system. In this case, the InnoDB redo and undo logs are not sufficient to recover the half-written page, because they only contain pages ID's, not their data. This improves the performance. To avoid half-written pages, InnoDB uses the doublewrite buffer. This mechanism involves writing every page twice. A page is valid after the second write is complete. When the server restarts, if a recovery occurs, half-written pages are discarded. The doublewrite buffer has a small impact on performance, because the writes are sequential, and are flushed to disk together. However, it is still possible to disable the doublewrite buffer by setting the innodb_doublewrite variable to OFF in the configuration file or by starting the server with the --skip-innodb-doublewrite parameter. This can be done if data correctness is not important. If performance is very important, and we use a fast storage device, we may note the overhead caused by the additional disk writes. But if data correctness is important to us, we do not want to simply disable it. MariaDB provides an alternative mechanism called atomic writes. These writes are like a transaction: they completely succeed or they completely fail. Half-written data is not possible. However, MariaDB does not directly implement this mechanism, so it can only be used on FusionIO storage devices using the DirectFS filesystem. FusionIO flash memories are very fast flash memories that can be used as block storage or DRAM memory. To enable this alternative mechanism, we can set innodb_use_atomic_writes to ON. This automatically disables the doublewrite buffer. Summary In this article, we discussed the main MariaDB buffers. The most important ones are the caches used by the storage engine. We dedicated much space to the InnoDB buffer pool, because it is more complex and, usually, InnoDB is the most used storage engine. Resources for Article: Further resources on this subject: Building a Web Application with PHP and MariaDB – Introduction to caching [article] Installing MariaDB on Windows and Mac OS X [article] Using SHOW EXPLAIN with running queries [article]

0
0
2186

article-image-index-item-sharding-and-projection-dynamodb

Packt

17 Sep 2014

13 min read

Index, Item Sharding, and Projection in DynamoDB

Packt

17 Sep 2014

13 min read

Understanding the secondary index and projections should go hand in hand because of the fact that a secondary index cannot be used efficiently without specifying projection. In this article by Uchit Vyas and Prabhakaran Kuppusamy, authors of DynamoDB Applied Design Patterns, we will take a look at local and global secondary indexes, and projection and its usage with indexes. (For more resources related to this topic, see here.) The use of projection in DynamoDB is pretty much similar to that of traditional databases. However, here are a few things to watch out for: Whenever a DynamoDB table is created, it is mandatory to create a primary key, which can be of a simple type (hash type), or it can be of a complex type (hash and range key). For the specified primary key, an index will be created (we call this index the primary index). Along with this primary key index, the user is allowed to create up to five secondary indexes per table. There are two kinds of secondary index. The first is a local secondary index (in which the hash key of the index must be the same as that of the table) and the second is the global secondary index (in which the hash key can be any field). In both of these secondary index types, the range key can be a field that the user needs to create an index for. Secondary indexes A quick question: while writing a query in any database, keeping the primary key field as part of the query (especially in the where condition) will return results much faster compared to the other way. Why? This is because of the fact that an index will be created automatically in most of the databases for the primary key field. This the case with DynamoDB also. This index is called the primary index of the table. There is no customization possible using the primary index, so the primary index is seldom discussed. In order to make retrieval faster, the frequently-retrieved attributes need to be made as part of the index. However, a DynamoDB table can have only one primary index and the index can have a maximum of two attributes (hash and range key). So for faster retrieval, the user should be given privileges to create user-defined indexes. This index, which is created by the user, is called the secondary index. Similar to the table key schema, the secondary index also has a key schema. Based on the key schema attributes, the secondary index can be either a local or global secondary index. Whenever a secondary index is created, during every item insertion, the items in the index will be rearranged. This rearrangement will happen for each item insertion into the table, provided the item contains both the index's hash and range key attribute. Projection Once we have an understanding of the secondary index, we are all set to learn about projection. While creating the secondary index, it is mandatory to specify the hash and range attributes based on which the index is created. Apart from these two attributes, if the query wants one or more attribute (assuming that none of these attributes are projected into the index), then DynamoDB will scan the entire table. This will consume a lot of throughput capacity and will have comparatively higher latency. The following is the table (with some data) that is used to store book information: Here are few more details about the table: The BookTitle attribute is the hash key of the table and local secondary index The Edition attribute is the range key of the table The PubDate attribute is the range key of the index (let's call this index IDX_PubDate) Local secondary index While creating the secondary index, the hash and range key of the table and index will be inserted into the index; optionally, the user can specify what other attributes need to be added. There are three kinds of projection possible in DynamoDB: KEYS_ONLY: Using this, the index consists of the hash and range key values of the table and index INCLUDE: Using this, the index consists of attributes in KEYS_ONLY plus other non-key attributes that we specify ALL: Using this, the index consists of all of the attributes from the source table The following code shows the creation of a local secondary index named Idx_PubDate with BookTitle as the hash key (which is a must in the case of a local secondary index), PubDate as the range key, and using the KEYS_ONLY projection: private static LocalSecondaryIndex getLocalSecondaryIndex() { ArrayList<KeySchemaElement> indexKeySchema = newArrayList<KeySchemaElement>(); indexKeySchema.add(new KeySchemaElement() .withAttributeName("BookTitle") .withKeyType(KeyType.HASH)); indexKeySchema.add(new KeySchemaElement() .withAttributeName("PubDate") .withKeyType(KeyType.RANGE)); LocalSecondaryIndex lsi = new LocalSecondaryIndex() .withIndexName("Idx_PubDate") .withKeySchema(indexKeySchema) .withProjection(new Projection() .withProjectionType("KEYS_ONLY")); return lsi; } The usage of the KEYS_ONLY index type will create the smallest possible index and the usage of ALL will create the biggest possible index. We will discuss the trade-offs between these index types a little later. Going back to our example, let us assume that we are using the KEYS_ONLY index type, so none of the attributes (other than the previous three key attributes) are projected into the index. So the index will look as follows: You may notice that the row order of the index is almost the same as that of the table order (except the second and third rows). Here, you can observe one point: the table records will be grouped primarily based on the hash key, and then the records that have the same hash key will be ordered based on the range key of the index. In the case of the index, even though the table's range key is part of the index attribute, it will not play any role in the ordering (only the index's hash and range keys will take part in the ordering). There is a negative in this approach. If the user is writing a query using this index to fetch BookTitle and Publisher with PubDate as 28-Dec-2008, then what happens? Will DynamoDB complain that the Publisher attribute is not projected into the index? The answer is no. The reason is that even though Publisher is not projected into the index, we can still retrieve it using the secondary index. However, retrieving a nonprojected attribute will scan the entire table. So if we are sure that certain attributes need to be fetched frequently, then we must project it into the index; otherwise, it will consume a large number of capacity units and retrieval will be much slower as well. One more question: if the user is writing a query using the local secondary index to fetch BookTitle and Publisher with PubDate as 28-Dec-2008, then what happens? Will DynamoDB complain that the PubDate attribute is not part of the primary key and hence queries are not allowed on nonprimary key attributes? The answer is no. It is a rule of thumb that we can write queries on the secondary index attributes. It is possible to include nonprimary key attributes as part of the query, but these attributes must at least be key attributes of the index. The following code shows how to add non-key attributes to the secondary index's projection: private static Projection getProjectionWithNonKeyAttr() { Projection projection = new Projection() .withProjectionType(ProjectionType.INCLUDE); ArrayList<String> nonKeyAttributes = new ArrayList<String>(); nonKeyAttributes.add("Language"); nonKeyAttributes.add("Author2"); projection.setNonKeyAttributes(nonKeyAttributes); return projection; } There is a slight limitation with the local secondary index. If we write a query on a non-key (both table and index) attribute, then internally DynamoDB might need to scan the entire table; this is inefficient. For example, consider a situation in which we need to retrieve the number of editions of the books in each and every language. Since both of the attributes are non-key, even if we create a local secondary index with either of the attributes (Edition and Language), the query will still result in a scan operation on the entire table. Global secondary index A problem arises here: is there any way in which we can create a secondary index using both the index keys that are different from the table's primary keys? The answer is the global secondary index. The following code shows how to create the global secondary index for this scenario: private static GlobalSecondaryIndex getGlobalSecondaryIndex() { GlobalSecondaryIndex gsi = new GlobalSecondaryIndex() .withIndexName("Idx_Pub_Edtn") .withProvisionedThroughput(new ProvisionedThroughput() .withReadCapacityUnits((long) 1) .withWriteCapacityUnits((long) 1)) .withProjection(newProjection().withProjectionType ("KEYS_ONLY")); ArrayList<KeySchemaElement> indexKeySchema1 = newArrayList<KeySchemaElement>(); indexKeySchema1.add(new KeySchemaElement() .withAttributeName("Language") .withKeyType(KeyType.HASH)); indexKeySchema1.add(new KeySchemaElement() .withAttributeName("Edition") .withKeyType(KeyType.RANGE)); gsi.setKeySchema(indexKeySchema1); return gsi; } While deciding the attributes to be projected into a global secondary index, there are trade-offs we must consider between provisioned throughput and storage costs. A few of these are listed as follows: If our application doesn't need to query a table so often and it performs frequent writes or updates against the data in the table, then we must consider projecting the KEYS_ONLY attributes. The global secondary index will be minimum size, but it will still be available when required for the query activity. The smaller the index, the cheaper the cost to store it and our write costs will be cheaper too. If we need to access only those few attributes that have the lowest possible latency, then we must project only those (lesser) attributes into a global secondary index. If we need to access almost all of the non-key attributes of the DynamoDB table on a frequent basis, we can project these attributes (even the entire table) into the global secondary index. This will give us maximum flexibility with the trade-off that our storage cost would increase, or even double if we project the entire table's attributes into the index. The additional storage costs to store the global secondary index might equalize the cost of performing frequent table scans. If our application will frequently retrieve some non-key attributes, we must consider projecting these non-key attributes into the global secondary index. Item sharding Sharding, also called horizontal partitioning, is a technique in which rows are distributed among the database servers to perform queries faster. In the case of sharding, a hash operation will be performed on the table rows (mostly on one of the columns) and, based on the hash operation output, the rows will be grouped and sent to the proper database server. Take a look at the following diagram: As shown in the previous diagram, if all the table data (only four rows and one column are shown for illustration purpose) is stored in a single database server, the read and write operations will become slower and the server that has the frequently accessed table data will work more compared to the server storing the table data that is not accessed frequently. The following diagram shows the advantage of sharding over a multitable, multiserver database environment: In the previous diagram, two tables (Tbl_Places and Tbl_Sports) are shown on the left-hand side with four sample rows of data (Austria.. means only the first column of the first item is illustrated and all other fields are represented by ..).We are going to perform a hash operation on the first column only. In DynamoDB, this hashing will be performed automatically. Once the hashing is done, similar hash rows will be saved automatically in different servers (if necessary) to satisfy the specified provisioned throughput capacity. Have you ever wondered about the importance of the hash type key while creating a table (which is mandatory)? Of course we all know the importance of the range key and what it does. It simply sorts items based on the range key value. So far, we might have been thinking that the range key is more important than the hash key. If you think that way, then you may be correct, provided we neither need our table to be provisioned faster nor do we need to create any partitions for our table. As long as the table data is smaller, the importance of the hash key will be realized only while writing a query operation. However, once the table grows, in order to satisfy the same provision throughput, DynamoDB needs to partition our table data based on this hash key (as shown in the previous diagram). This partitioning of table items based on the hash key attribute is called sharding. It means the partitions are created by splitting items and not attributes. This is the reason why a query that has the hash key (of table and index) retrieves items much faster. Since the number of partitions is managed automatically by DynamoDB, we cannot just hope for things to work fine. We also need to keep certain things in mind, for example, the hash key attribute should have more distinct values. To simplify, it is not advisable to put binary values (such as Yes or No, Present or Past or Future, and so on) into the hash key attributes, thereby restricting the number of partitions. If our hash key attribute has either Yes or No values in all the items, then DynamoDB can create only a maximum of two partitions; therefore, the specified provisioned throughput cannot be achieved. Just consider that we have created a table called Tbl_Sports with a provisioned throughput capacity of 10, and then we put 10 items into the table. Assuming that only a single partition is created, we are able to retrieve 10 items per second. After a point of time, we put 10 more items into the table. DynamoDB will create another partition (by hashing over the hash key), thereby satisfying the provisioned throughput capacity. There is a formula taken from the AWS site: Total provisioned throughput/partitions = throughput per partition OR No. of partitions = Total provisioned throughput/throughput per partition In order to satisfy throughput capacity, the other parameters will be automatically managed by DynamoDB. Summary In this article, we saw what the local and global secondary indexes are. We walked through projection and its usage with indexes. Resources for Article: Further resources on this subject: Comparative Study of NoSQL Products [Article] Ruby with MongoDB for Web Development [Article] Amazon DynamoDB - Modelling relationships, Error handling [Article]

0
0
10079

Packt

17 Sep 2014

12 min read

What is REST?

Packt

17 Sep 2014

12 min read

This article by Bhakti Mehta, the author of Restful Java Patterns and Best Practices, starts with the basic concepts of REST, how to design RESTful services, and best practices around designing REST resources. It also covers the architectural aspects of REST. (For more resources related to this topic, see here.) Where REST has come from The confluence of social networking, cloud computing, and era of mobile applications creates a generation of emerging technologies that allow different networked devices to communicate with each other over the Internet. In the past, there were traditional and proprietary approaches for building solutions encompassing different devices and components communicating with each other over a non-reliable network or through the Internet. Some of these approaches such as RPC, CORBA, and SOAP-based web services, which evolved as different implementations for Service Oriented Architecture (SOA) required a tighter coupling between components along with greater complexities in integration. As the technology landscape evolves, today’s applications are built on the notion of producing and consuming APIs instead of using web frameworks that invoke services and produce web pages. This requirement enforces the need for easier exchange of information between distributed services along with predictable, robust, well-defined interfaces. API based architecture enables agile development, easier adoption and prevalence, scale and integration with applications within and outside the enterprise HTTP 1.1 is defined in RFC 2616, and is ubiquitously used as the standard protocol for distributed, collaborative and hypermedia information systems. Representational State Transfer (REST) is inspired by HTTP and can be used wherever HTTP is used. The widespread adoption of REST and JSON opens up the possibilities of applications incorporating and leveraging functionality from other applications as needed. Popularity of REST is mainly because it enables building lightweight, simple, cost-effective modular interfaces, which can be consumed by a variety of clients. This article covers the following topics Introduction to REST Safety and Idempotence HTTP verbs and REST Best practices when designing RESTful services REST architectural components Introduction to REST REST is an architectural style that conforms to the Web Standards like using HTTP verbs and URIs. It is bound by the following principles. All resources are identified by the URIs. All resources can have multiple representations All resources can be accessed/modified/created/deleted by standard HTTP methods. There is no state on the server. REST is extensible due to the use of URIs for identifying resources. For example, a URI to represent a collection of book resources could look like this: http://foo.api.com/v1/library/books A URI to represent a single book identified by its ISBN could be as follows: http://foo.api.com/v1/library/books/isbn/12345678 A URI to represent a coffee order resource could be as follows: http://bar.api.com/v1/coffees/orders/1234 A user in a system can be represented like this: http://some.api.com/v1/user A URI to represent all the book orders for a user could be: http://bar.api.com/v1/user/5034/book/orders All the preceding samples show a clear readable pattern, which can be interpreted by the client. All these resources could have multiple representations. These resource examples shown here can be represented by JSON or XML and can be manipulated by HTTP methods: GET, PUT, POST, and DELETE. The following table summarizes HTTP Methods and descriptions for the actions taken on the resource with a simple example of a collection of books in a library. HTTP method Resource URI Description GET /library/books Gets a list of books GET /library/books/isbn/12345678 Gets a book identified by ISBN “12345678” POST /library/books Creates a new book order DELETE /library/books/isbn/12345678 Deletes a book identified by ISBN “12345678” PUT /library/books/isbn/12345678 Updates a specific book identified by ISBN “12345678’ PATCH /library/books/isbn/12345678 Can be used to do partial update for a book identified by ISBN “12345678” REST and statelessness REST is bound by the principle of statelessness. Each request from the client to the server must have all the details to understand the request. This helps to improve visibility, reliability and scalability for requests. Visibility is improved, as the system monitoring the requests does not have to look beyond one request to get details. Reliability is improved, as there is no check-pointing/resuming to be done in case of partial failures. Scalability is improved, as the number of requests that can be processed is increases as the server is not responsible for storing any state. Roy Fielding’s dissertation on the REST architectural style provides details on the statelessness of REST, check http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm With this initial introduction to basics of REST, we shall cover the different maturity levels and how REST falls in it in the following section. Richardson Maturity Model Richardson maturity model is a model, which is developed by Leonard Richardson. It talks about the basics of REST in terms of resources, verbs and hypermedia controls. The starting point for the maturity model is to use HTTP layer as the transport. Level 0 – Remote Procedure Invocation This level contains SOAP or XML-RPC sending data as POX (Plain Old XML). Only POST methods are used. This is the most primitive way of building SOA applications with a single method POST and using XML to communicate between services. Level 1 – REST resources This uses POST methods and instead of using a function and passing arguments uses the REST URIs. So it still uses only one HTTP method. It is better than Level 0 that it breaks a complex functionality into multiple resources with one method. Level 2 – more HTTP verbs This level uses other HTTP verbs like GET, HEAD, DELETE, PUT along with POST methods. Level 2 is the real use case of REST, which advocates using different verbs based on the HTTP request methods and the system can have multiple resources. Level 3 – HATEOAS Hypermedia as the Engine of Application State (HATEOAS) is the most mature level of Richardson’s model. The responses to the client requests, contains hypermedia controls, which can help the client decide what the next action they can take. Level 3 encourages easy discoverability and makes it easy for the responses to be self- explanatory. Safety and Idempotence This section discusses in details about what are safe and idempotent methods. Safe methods Safe methods are methods that do not change the state on the server. GET and HEAD are safe methods. For example GET /v1/coffees/orders/1234 is a safe method. Safe methods can be cached. PUT method is not safe as it will create or modify a resource on the server. POST method is not safe for the same reasons. DELETE method is not safe as it deletes a resource on the server. Idempotent methods An idempotent method is a method that will produce the same results irrespective of how many times it is called. For example GET method is idempotent, as multiple calls to the GET resource will always return the same response. PUT method is idempotent as calling PUT method multiple times will update the same resource and not change the outcome. POST is not idempotent and calling POST method multiple times can have different results and will result in creating new resources. DELETE is idempotent because once the resource is deleted it is gone and calling the method multiple times will not change the outcome. HTTP verbs and REST HTTP verbs inform the server what to do with the data sent as part of the URL GET GET is the simplest verb of HTTP, which enables to get access to a resource. Whenever the client clicks a URL in the browser it sends a GET request to the address specified by the URL. GET is safe and idempotent. GET requests are cached. Query parameters can be used in GET requests. For example a simple GET request is as follows: curl http://api.foo.com/v1/user/12345 POST POST is used to create a resource. POST requests are neither idempotent nor safe. Multiple invocations of the POST requests can create multiple resources. POST requests should invalidate a cache entry if exists. Query parameters with POST requests are not encouraged For example a POST request to create a user can be curl –X POST -d’{“name”:”John Doe”,“username”:”jdoe”, “phone”:”412-344-5644”} http://api.foo.com/v1/user PUT PUT is used to update a resource. PUT is idempotent but not safe. Multiple invocations of PUT requests should produce the same results by updating the resource. PUT requests should invalidate the cache entry if exists. For example a PUT request to update a user can be curl –X PUT -d’{ “phone”:”413-344-5644”} http://api.foo.com/v1/user DELETE DELETE is used to delete a resource. DELETE is idempotent but not safe. DELETE is idempotent because based on the RFC 2616 "the side effects of N > 0 requests is the same as for a single request". This means once the resource is deleted calling DELETE multiple times will get the same response. For example, a request to delete a user is as follows: curl –X DELETE http://foo.api.com/v1/user/1234 HEAD HEAD is similar like GET request. The difference is that only HTTP headers are returned and no content is returned. HEAD is idempotent and safe. For example, a request to send HEAD request with curl is as follows: curl –X HEAD http://foo.api.com/v1/user It can be useful to send a HEAD request to see if the resource has changed before trying to get a large representation using a GET request. PUT vs POST According to RFC the difference between PUT and POST is in the Request URI. The URI identified by POST defines the entity that will handle the POST request. The URI in the PUT request includes the entity in the request. So POST /v1/coffees/orders means to create a new resource and return an identifier to describe the resource In contrast PUT /v1/coffees/orders/1234 means to update a resource identified by “1234” if it does not exist else create a new order and use the URI orders/1234 to identify it. Best practices when designing resources This section highlights some of the best practices when designing RESTful resources: The API developer should use nouns to understand and navigate through resources and verbs with the HTTP method. For example the URI /user/1234/books is better than /user/1234/getBook. Use associations in the URIs to identify sub resources. For example to get the authors for book 5678 for user 1234 use the following URI /user/1234/books/5678/authors. For specific variations use query parameters. For example to get all the books with 10 reviews /user/1234/books?reviews_counts=10. Allow partial responses as part of query parameters if possible. An example of this case is to get only the name and age of a user, the client can specify, ?fields as a query parameter and specify the list of fields which should be sent by the server in the response using the URI /users/1234?fields=name,age. Have defaults for the output format for the response incase the client does not specify which format it is interested in. Most API developers choose to send json as the default response mime type. Have camelCase or use _ for attribute names. Support a standard API for count for example users/1234/books/count in case of collections so the client can get the idea of how many objects can be expected in the response. This will also help the client, with pagination queries. Support a pretty printing option users/1234?pretty_print. Also it is a good practice to not cache queries with pretty print query parameter. Avoid chattiness by being as verbose as possible in the response. This is because if the server does not provide enough details in the response the client needs to make more calls to get additional details. That is a waste of network resources as well as counts against the client’s rate limits. REST architecture components This section will cover the various components that must be considered when building RESTful APIs As seen in the preceding screenshot, REST services can be consumed from a variety of clients and applications running on different platforms and devices like mobile devices, web browsers etc. These requests are sent through a proxy server. The HTTP requests will be sent to the resources and based on the various CRUD operations the right HTTP method will be selected. On the response side there can be Pagination, to ensure the server sends a subset of results. Also the server can do Asynchronous processing thus improving responsiveness and scale. There can be links in the response, which deals with HATEOAS. Here is a summary of the various REST architectural components: HTTP requests use REST API with HTTP verbs for the uniform interface constraint Content negotiation allows selecting a representation for a response when there are multiple representations available. Logging helps provide traceability to analyze and debug issues Exception handling allows sending application specific exceptions with HTTP codes Authentication and authorization with OAuth2.0 gives access control to other applications, to take actions without the user having to send their credentials Validation provides support to send back detailed messages with error codes to the client as well as validations for the inputs received in the request. Rate limiting ensures the server is not burdened with too many requests from single client Caching helps to improve application responsiveness. Asynchronous processing enables the server to asynchronously send back the responses to the client. Micro services which comprises breaking up a monolithic service into fine grained services HATEOAS to improve usability, understandability and navigability by returning a list of links in the response Pagination to allow clients to specify items in a dataset that they are interested in. The REST Architectural components in the image can be chained one after the other as shown priorly. For example, there can be a filter chain, consisting of filters related with Authentication, Rate limiting, Caching, and Logging. This will take care of authenticating the user, checking if the requests from the client are within rate limits, then a caching filter which can check if the request can be served from the cache respectively. This can be followed by a logging filter, which can log the details of the request. For more details, check RESTful Patterns and best practices.

0
0
2615

Adding Real-time Functionality Using Socket.io

Visualization as a Tool to Understand Data

Building, Publishing, and Supporting Your Force.com Application

Improving Code Quality

Handling SELinux-aware Applications

Jump Right In

Event-driven BPEL Process

Driving Visual Analyses with Automobile Data (Python)

Creating a RESTful API

Mobility

Trending Topics

Redis in Autosuggest

Waiting for AJAX, as always…

Caches

Index, Item Sharding, and Projection in DynamoDB

What is REST?

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access