Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-testing-and-debugging-distributed-applications
Packt
01 Apr 2016
21 min read
Save for later

Testing and Debugging Distributed Applications

Packt
01 Apr 2016
21 min read
In this article, by Francesco Pierfederici author of the book Distributed Computing with Python, the author likes to state that, "distributed systems, both large and small, can be extremely challenging to test and debug, as they are spread over a network, run on computers that can be quite different from each other, and might even be physically located in different continents altogether". Moreover, the computers we use could have different user accounts, different disks with different software packages, different hardware resources, and very uneven performance. Some can even be in a different time zone. Developers of distributed systems need to consider all these pieces of information when trying to foresee failure conditions. Operators have to work around all of these challenges when debugging errors. (For more resources related to this topic, see here.) The big picture Testing and debugging monolithic applications is not simple, as every developer knows. However, there are a number of tools that dramatically make the task easier, including the pdb debugger, various profilers (notable mentions include cProfile and line_profile), linters, static code analysis tools, and a host of test frameworks, a number of which have been included in the standard library of Python 3.3 and higher. The challenge with distributed applications is that most of the tools and packages that we can use for single-process applications lose much of their power when dealing with multiple processes, especially when these processes run on different computers. Debugging and profiling distributed applications written in C, C++, and Fortran can be done with tools such as Intel VTune, Allinea MAP, and DDT. Unfortunately, Python developers are left with very few or no options for the time being. Writing small- or medium-sized distributed systems is not terribly hard, as we saw in the pages so far. The main difference between writing monolithic programs and distributed applications is the large number of interdependent components running on remote hardware. This is what makes monitoring and debugging distributed code harder and less convenient. Fortunately, we can still use all familiar debuggers and code analysis tools on our Python distributed applications. Unfortunately, these tools will only go so far to the point that we will have to rely on old-fashioned logging and print statements to get the full picture on what went wrong. Common problems – clocks and time Time is a handy variable for use. For instance, using timestamps is very natural when we want to join different streams of data, sort database records, and in general, reconstruct the timeline for a series of events, which we often times observe are out of order. In addition, some tools (for example, GNU make) rely solely on file modification time and are easily confused by a clock skew between machines. For these reasons, clock synchronization among all computers and systems we use is very important. If our computers are in different time zones, we might want to not only synchronize their clocks but also set them to Coordinated Universal Time (UTC) for simplicity. In all the cases, when changing clocks to UTC is not possible, a good advice is to always process time in UTC within our code and to only covert local time for display purposes. In general, clock synchronization in distributed systems is a fascinating and complex topic, and it is out of the scope of this article. Most readers, luckily, are likely to be well served by the Network Time Protocol (NTP), which is a perfectly fine synchronization solution for most situations. Most modern operating systems, including Windows, Mac OS X, and Linux, have great support for NTP. Another thing to consider when talking about time is the timing of periodic actions, such as polling loops or cronjobs. Many applications need to spawn processes or perform actions (for example, sending a confirmation e-mail or checking whether new data is available) at regular intervals. A common pattern is to set up timers (either in our code or via the tools provided by the OS) and have all these timers go off at some time, usually at a specific hour and at regular intervals after that. The risk of this approach is that we can overload the system the very moment all these processes wake up and start their work. A surprisingly common example would be starting a significant number of processes that all need to read some configuration or data file from a shared disk. In these cases, everything goes fine until the number of processes becomes so large that the shared disk cannot handle the data transfer, thus slowing our application to a crawl. A common solution is the staggering of these timers in order to spread them out over a longer time interval. In general, since we do not always control all code that we use, it is good practice to start our timers at some random number of minutes past the hour, just to be safe. Another example of this situation would be an image-processing service that periodically polls a set of directories looking for new data. When new images are found, they are copied to a staging area, renamed, scaled, and potentially converted to a common format before being archived for later use. If we're not careful, it would be easy to overload the system if many images were to be uploaded at once. A better approach would be to throttle our application (maybe using a queue-based architecture) so that it would only start an appropriately small number of image processors so as to not flood the system. Common problems – software environments Another common challenge is making sure that the software installed on all the various machines we are ever going to use is consistent and consistently upgraded. Unfortunately, it is frustratingly common to spend hours debugging a distributed application only to discover that for some unknown and seemingly impossible reason, some computers had an old version of the code and/or its dependencies. Sometimes, we might even find the code to have disappeared completely. The reasons for these discrepancies can be many: from a mount point that failed to a bug in our deployment procedures to a simple human mistake. A common approach, especially in the HPC world, is to always create a self-contained environment for our code before launching the application itself. Some projects go as far as preferring static linking of all dependencies to avoid having the runtime pick up the wrong version of a dynamic library. This approach works well if the application runtime is long compared to the time it takes to build its full environment, all of its software dependencies, and the application itself. It is not that practical otherwise. Python, fortunately, has the ability to create self-contained virtual environments. There are two related tools that we can use: pyvenv (available as part of the Python 3.5 standard library) and virtualenv (available in PyPI). Additionally, pip, the Python package management system, allows us to specify the exact version of each package we want to install in a requirements file. These tools, when used together, permit reasonable control on the execution environment. However, the devil, as it is often said, is in the details, and different computer nodes might use the exact same Python virtual environment but incompatible versions of some external library. In this respect, container technologies such as Docker (https://www.docker.com) and, in general, version-controlled virtual machines are promising ways out of the software runtime environment maelstrom in those environments where they can be used. In all other cases, HPC clusters come to mind, the best approach will probably be to not rely on the system software and manage our own environments and the full-software stack. Common problems – permissions and environments Different computers might have run our code under different user accounts, and our application might expect to be able to read a file or write data into a specific directory and hit an unexpected permission error. Even in cases where the user accounts used by our code are all the same (down to the same user ID and group ID), their environment may be different on different hosts. Therefore, an environment variable we assumed to be defined might not be or, even worse, might be set to an incompatible value. These problems are common when our code runs as a special, unprivileged user such as nobody. Defensive coding, especially when accessing the environment, and making sure to always fall back to sensible defaults when variables are undefined (that is, value = os.environ.get('SOME_VAR', fallback_value) instead of simply value = os.environ.get['SOME_VAR']) is often necessary. A common approach, when this is possible, is to only run our applications under a specific user account that we control and specify the full set of environment variables our code needs in the deployment and application startup scripts (which will have to be version controlled as well). Some systems, however, not only execute jobs under extremely limited user accounts, but they also restrict code execution to temporary sandboxes. In many cases, access to the outside network is also blocked. In these situations, one might have no other choice but to set up the full environment locally and copy it to a shared disk partition. Other data can be served from custom-build servers running as ancillary jobs just for this purpose. In general, permission problems and user environment mismatches are very similar to problems with the software environment and should be tackled in concert. Often times, developers find themselves wanting to isolate their code from the system as much as possible and create a small, but self-contained environment with all the code and all the environment variables they need. Common problems – the availability of hardware resources The hardware resources that our application needs might or might not be available at any given point in time. Moreover, even if some resources were to be available at some point in time, nothing guarantees that they will stay available for much longer. A problems we can face related to this are network glitches, which are quite common in many environments (especially for mobile apps) and which, for most practical purposes, are undistinguishable from machine or application crashes. Applications using a distributed computing framework or job scheduler can often rely on the framework itself to handle at least some common failure scenarios. Some job schedulers will even resubmit our jobs in case of errors or sudden machine unavailability. Complex applications, however, might need to implement their own strategies to deal with hardware failures. In some cases, the best strategy is to simply restart the application when the necessary resources are available again. Other times, restarting from scratch would be cost prohibitive. In these cases, a common approach is to implement application checkpointing. What this means is that the application both writes its state to a disk periodically and is able to bootstrap itself from a previously saved state. In implementing a checkpointing strategy, you need to balance the convenience of being able to restart an application midway with the performance hit of writing a state to a disk. Another consideration is the increase in code complexity, especially when many processes or threads are involved in reading and writing state information. A good rule of thumb is that data or results that can be recreated easily and quickly do not warrant implementation of application checkpointing. If, on the other hand, some processing requires a significant amount of time and one cannot afford to waste it, then application checkpointing might be in order. For example, climate simulations can easily run for several weeks or months at a time. In these cases, it is important to checkpoint them every hour or so, as restarting from the beginning after a crash would be expensive. On the other hand, a process that takes an uploaded image and creates a thumbnail for, say, a web gallery runs quickly and is not normally worth checkpointing. To be safe, a state should always be written and updated automatically (for example, by writing to a temporary file and replacing the original only after the write completes successfully). The last thing we want is to restart from a corrupted state! Very familiar to HPC users as well as users of AWS, a spot instance is a situation where a fraction or the entirety of the processes of our application are evicted from the machines that they are running on. When this happens, a warning is typically sent to our processes (usually, a SIGQUIT signal) and after a few seconds, they are unceremoniously killed (via a SIGKILL signal). For AWS spot instances, the time of termination is available through a web service in the instance metadata. In either case, our applications are given some time to save the state and quit in an orderly fashion. Python has powerful facilities to catch and handle signals (refer to the signal module). For example, the following simple commands shows how we can implement a bare-bones checkpointing strategy in our application: #!/usr/bin/env python3.5 """ Simple example showing how to catch signals in Python """ import json import os import signal import sys     # Path to the file we use to store state. Note that we assume # $HOME to be defined, which is far from being an obvious # assumption! STATE_FILE = os.path.join(os.environ['HOME'],                                '.checkpoint.json')     class Checkpointer:     def __init__(self, state_path=STATE_FILE):         """         Read the state file, if present, and initialize from that.         """         self.state = {}         self.state_path = state_path         if os.path.exists(self.state_path):             with open(self.state_path) as f:                 self.state.update(json.load(f))         return       def save(self):         print('Saving state: {}'.format(self.state))         with open(self.state_path, 'w') as f:             json.dump(self.state, f)         return       def eviction_handler(self, signum, frame):         """         This is the function that gets called when a signal is trapped.         """         self.save()           # Of course, using sys.exit is a bit brutal. We can do better.         print('Quitting')         sys.exit(0)         return     if __name__ == '__main__':     import time       print('This is process {}'.format(os.getpid()))       ckp = Checkpointer()     print('Initial state: {}'.format(ckp.state))       # Catch SIGQUIT     signal.signal(signal.SIGQUIT, ckp.eviction_handler)       # Get a value from the state.     i = ckp.state.get('i', 0)     try:         while True:             i += 1             ckp.state['i'] = i             print('Updated in-memory state: {}'.format(ckp.state))             time.sleep(1)     except KeyboardInterrupt:         ckp.save() If we run the preceding script in a terminal window and then in another terminal window, we send it a SIGQUIT signal (for example, via kill -s SIGQUIT <process id>). After this, we see the checkpointing in action, as the following screenshot illustrates: A common situation in distributed applications is that of being forced to run code in potentially heterogeneous environments: machines (real or virtual) of different performances, with different hardware resources (for example, with or without GPUs), and potentially different software environments (as we mentioned already). Even in the presence of a job scheduler, to help us choose the right software and hardware environment, we should always log the full environment as well as the performance of each execution machine. In advanced architectures, these performance metrics can be used to improve the efficiency of job scheduling. PBS Pro, for instance, takes into consideration the historical performance figures of each job being submitted to decide where to execute it next. HTCondor continuously benchmarks each machine and makes those figures available for node selection and ranking. Perhaps, the most frustrating cases are where either due to the network itself or due to servers being overloaded, network requests take so long that our code hits its internal timeouts. This might lead us to believe that the counterpart service is not available. These bugs, especially when transient, can be quite hard to debug. Challenges – the development environment Another common challenge in distributed systems is the setup of a representative development and testing environment, especially for individuals or small teams. Ideally, in fact, the development environment should be identical to the worst-case scenario deployment environment. It should allow developers to test common failure scenarios, such as a disk filling up, varying network latencies, intermittent network connections, hardware and software failures, and so on—all things that are bound to happen in real time, sooner or later. Large teams have the resources to set up development and test clusters, and they almost always have dedicated software quality teams stress testing our code. Small teams, unfortunately, often find themselves forced to write code on their laptops and use a very simplified (and best-case scenario!) environment made up by two or three virtual machines running on the laptops themselves to emulate the real system. This pragmatic solution works and is definitely better than nothing. However, we should remember that virtual machines running on the same host exhibit unrealistically high-availability and low-network latencies. In addition, nobody will accidentally upgrade them without us knowing or image them with the wrong operating system. The environment is simply too controlled and stable to be realistic. A step closer to a realistic setup would be to create a small development cluster on, say, AWS using the same VM images, with the same software stack and user accounts that we are going to use in production. All things said, there is simply no replacement for the real thing. For cloud-based applications, it is worth our while to at least test our code on a smaller version of the deployment setup. For HPC applications, we should be using either a test cluster, a partition of the operational cluster, or a test queue for development and testing. Ideally, we would develop on an exact clone of the operational system. Cost consideration and ease of development will constantly push us to the multiple-VMs-on-a-laptop solution; it is simple, essentially free, and it works without an Internet connection, which is an important point. We should, however, keep in mind that distributed applications are not impossibly hard to write; they just have more failure modes than their monolithic counterparts do. Some of these failure modes (especially those related to data access patterns) typically require a careful choice of architecture. Correcting architectural choices dictated by false assumptions later on in the development stage can be costly. Convincing managers to give us the hardware resources that we need early on is usually difficult. In the end, this is a delicate balancing act. A useful strategy – logging everything Often times, logging is like taking backups or eating vegetables—we all know we should do it, but most of us forget. In distributed applications, we simply have no other choice—logging is essential. Not only that, logging everything is essential. With many different processes running on potentially ephemeral remote resources at difficult-to-predict times, the only way to understand what happens is to have logging information and have it readily available and in an easily searchable format/system. At the bare minimum, we should log process startup and exit time, exit code and exceptions (if any), all input arguments, all outputs, the full execution environment, the name and IP of the execution host, the current working directory, the user account as well as the full application configuration, and all software versions. The idea is that if something goes wrong, we should be able to use this information to log onto the same machine (if still available), go to the same directory, and reproduce exactly what our code was doing. Of course, being able to exactly reproduce the execution environment might simply not be possible (often times, because it requires administrator privileges). However, we should always aim to be able to recreate a good approximation of that environment. This is where job schedulers really shine; they allow us to choose a specific machine and specify the full job environment, which makes replicating failures easier. Logging software versions (not only the version of the Python interpreter, but also the version of all the packages used) helps diagnose outdated software stacks on remote machines. The Python package manager, pip, makes getting the list of installed packages easy: import pip; pip.main(['list']). Whereas, import sys; print(sys.executable, sys.version_info) displays the location and version of the interpreter. It is also useful to create a system whereby all our classes and function calls emit logging messages with the same level detail and at the same points in the object life cycle. Common approaches involve the use of decorators and, maybe a bit too esoteric for some, metaclasses. This is exactly what the autologging Python module (available on PyPI) does for us. Once logging is in place, we face the questions where to store all these logging messages and whose traffic could be substantial for high verbosity levels in large applications. Simple installations will probably want to write log messages to text files on a disk. More complex applications might want to store these messages in a database (which can be done by creating a custom handler for the Python logging module) or in specialized log aggregators such as Sentry (https://getsentry.com). Closely related to logging is the issue of monitoring. Distributed applications can have many moving parts, and it is often essential to know which machines are up, which are busy, as well as which processes or jobs are currently running, waiting, or in an error state. Knowing which processes are taking longer than usual to complete their work is often an important warning sign that something might be wrong. Several monitoring solutions for Python (often times, integrated with our logging system) exist. The Celery project, for instance, recommends flower (http://flower.readthedocs.org) as a monitoring and control web application. HPC job schedulers, on the other hand, tend to lack common, general-purpose, monitoring solutions that go beyond simple command-line clients. Monitoring comes in handy in discovering potential problems before they become serious. It is in fact useful to monitor resources such as available disk space and trigger actions or even simple warning e-mails when they fall under a given threshold. Many centers monitor hardware performance and hard drive SMART data to detect early signs of potential problems. These issues are more likely to be of interest to operations personnel rather than developers, but they are useful to keep in mind. They can also be integrated in our applications to implement strategies in order to handle performance degradations gracefully. A useful strategy – simulating components A good, although possibly expensive in terms of time and effort, test strategy is to simulate some or all of the components of our system. The reasons are multiple; on one hand, simulating or mocking software components allows us to test our interfaces to them more directly. In this respect, mock testing libraries, such as unittest.mock (part of the Python 3.5 standard library), are truly useful. Another reason to simulate software components is to make them fail or misbehave on demand and see how our application responds. For instance, we could increase the response time of services such as REST APIs or databases to worst-case scenario levels and see what happens. Sometimes, we might exceed timeout values in some network calls leading our application to incorrectly assume that the sever has crashed. Especially early on in the design and development of a complex distributed application, one can make overly optimistic assumptions about things such as network availability and performance or response time of services such as databases or web servers. For this reason, having the ability to either completely bring a service offline or, more subtly, modify its behavior can tell us a lot about which of the assumptions in our code might be overly optimistic. The Netflix Chaos Monkey (https://github.com/Netflix/SimianArmy) approach of disabling random components of our system to see how our application copes with failures can be quite useful. Summary Writing and running small- or medium-sized distributed applications in Python is not hard. There are many high-quality frameworks that we can leverage among others, for example, Celery, Pyro, various job schedulers, Twisted, MPI bindings, or the multiprocessing module in the standard library. The real difficulty, however, lies in monitoring and debugging our applications, especially because a large fraction of our code runs concurrently on many different, often remote, computers. The most insidious bugs are those that end up producing incorrect results (for example, because of data becoming corrupted along the way) rather than raising an exception, which most frameworks are able to catch and bubble up. The monitoring and debugging tools that we can use with Python code are, sadly, not as sophisticated as the frameworks and libraries we use to develop that same code. The consequence is that large teams end up developing their own, often times, very specialized distributed debugging systems from scratch and small teams mostly rely on log messages and print statements. More work is needed in the area of debuggers for distributed applications in general and for dynamic languages such as Python in particular. Resources for Article: Further resources on this subject: Python Data Structures [article] Python LDAP applications - extra LDAP operations and the LDAP URL library [article] Machine Learning Tasks [article]
Read more
  • 0
  • 0
  • 3629

article-image-machine-learning-tasks
Packt
01 Apr 2016
16 min read
Save for later

Machine Learning Tasks

Packt
01 Apr 2016
16 min read
In this article written by David Julian, author of the book Designing Machine Learning Systems with Python, the author wants to state that, he will first introduce the basic machine learning tasks. Classification is probably the most common task, due in part to the fact that it is relatively easy, well understood, and solves a lot of common problems. Multiclass classification (for instance, handwriting recognition) can sometimes be achieved by chaining binary classification tasks. However, we lose information this way, and we become unable to define a single decision boundary. For this reason, multiclass classification is often treated separately from binary classification. (For more resources related to this topic, see here.) There are cases where we are not interested in discrete classes but rather a real number, for instance, a probability. These type of problems are regression problems. Both classification and regression require a training set of correctly labelled data. They are supervised learning problems. Originating from these basic machine tasks are a number of derived tasks. In many applications, this may simply be applying the learning model to a prediction to establish a causal relationship. We must remember that explaining and predicting are not the same. A model can make a prediction, but unless we know explicitly how it made the prediction, we cannot begin to form a comprehensible explanation. An explanation requires human knowledge of the domain. We can also use a prediction model to find exceptions from a general pattern. Here, we are interested in the individual cases that deviate from the predictions. This is often called anomaly detection and has wide applications in areas such as detecting bank fraud, noise filtering, and even in the search for extraterrestrial life. An important and potentially useful task is subgroup discovery. Our goal here is not, as in clustering, to partition the entire domain but rather to find a subgroup that has a substantially different distribution. In essence, subgroup discovery is trying to find relationships between a dependent target variable and many independent explaining variables. We are not trying to find a complete relationship but rather a group of instances that are different in ways that are important in the domain. For instance, establishing the subgroups, smoker = true and family history =true, for a target variable of heart disease =true. Finally, we consider control type tasks. These act to optimize control setting to maximize a pay off is given different conditions. This can be achieved in several ways. We can clone expert behavior; the machine learns directly from a human and makes predictions of actions given different conditions. The task is to learn a prediction model for the expert's actions. This is similar to reinforcement learning, where the task is to learn about the relationship between conditions and optimal action. Clustering, on the other hand, is the task of grouping items without any information on that group; this is an unsupervised learning task. Clustering is basically making a measurement of similarity. Related to clustering is association, which is an unsupervised task to find a certain type of pattern in the data. This task is behind movie recommender systems, and customers who bought this also bought .. on checkout pages of online stores. Data for machine learning When considering raw data for machine learning applications, there are three separate aspects: The volume of the data The velocity of the data The variety of the data Data volume The volume problem can be approached from three different directions: efficiency, scalability, and parallelism. Efficiency is about minimizing the time it takes for an algorithm to process a unit of information. A component of this is the underlying processing power of the hardware. The other component, and one that we have more control over, is ensuring our algorithms are not wasting precious processing cycles on unnecessary tasks. Scalability is really about brute force, and throwing as much hardware at a problem as you can. With Moore's law, which predicts the trend of computer power doubling every two years and reaching its limit, it is clear that scalability is not, by its self, going to be able to keep pace with the ever increasing amounts of data. Simply adding more memory and faster processors is not, in many cases, going to be a cost effective solution. Parallelism is a growing area of machine learning, and it encompasses a number of different approaches from harnessing capabilities of multi core processors, to large scale distributed computing on many different platforms. Probably, the most common method is to simply run the same algorithm on many machines, each with a different set of parameters. Another method is to decompose a learning algorithm into an adaptive sequence of queries, and have these queries processed in parallel. A common implementation of this technique is known as MapReduce, or its open source version, Hadoop. Data velocity The velocity problem is often approached in terms of data producers and data consumers. The data transfer rate between the two is its velocity, and it can be measured in interactive response times. This is the time it takes from a query being made to its response being delivered. Response times are constrained by latencies such as hard disk read and write times, and the time it takes to transmit data across a network. Data is being produced at ever greater rates, and this is largely driven by the rapid expansion of mobile networks and devices. The increasing instrumentation of daily life is revolutionizing the way products and services are delivered. This increasing flow of data has led to the idea of streaming processing. When input data is at a velocity that makes it impossible to store in its entirety, a level of analysis is necessary as the data streams, in essence, deciding what data is useful and should be stored and what data can be thrown away. An extreme example is the Large Hadron Collider at CERN, where the vast majority of data is discarded. A sophisticated algorithm must scan the data as it is being generated, looking at the information needle in the data haystack. Another instance where processing data streams might be important is when an application requires an immediate response. This is becoming increasingly used in applications such as online gaming and stock market trading. It is not just the velocity of incoming data that we are interested in. In many applications, particularly on the web, the velocity of a system's output is also important. Consider applications such as recommender systems, which need to process large amounts of data and present a response in the time it takes for a web page to load. Data variety Collecting data from different sources invariably means dealing with misaligned data structures, and incompatible formats. It also often means dealing with different semantics and having to understand a data system that may have been built on a pretty different set of logical principles. We have to remember that, very often, data is repurposed for an entirely different application than the one it was originally intended for. There is a huge variety of data formats and underlying platforms. Significant time can be spent converting data into one consistent format. Even when this is done, the data itself needs to be aligned such that each record consists of the same number of features and is measured in the same units. Models The goal in machine learning is not to just solve an instance of a problem, but to create a model that will solve unique problems from new data. This is the essence of learning. A learning model must have a mechanism for evaluating its output, and in turn, changing its behavior to a state that is closer to a solution. A model is essentially a hypothesis: a proposed explanation for a phenomenon. The goal is to apply a generalization to the problem. In the case of supervised learning, problem knowledge gained from the training set is applied to the unlabeled test. In the case of an unsupervised learning problem, such as clustering, the system does not learn from a training set. It must learn from the characteristics of the dataset itself, such as degree of similarity. In both cases, the process is iterative. It repeats a well-defined set of tasks, that moves the model closer to a correct hypothesis. There are many models and as many variations on these models as there are unique solutions. We can see that the problems that machine learning systems solve (regression, classification, association, and so on) come up in many different settings. They have been used successfully in almost all branches of science, engineering, mathematics, commerce, and also in the social sciences; they are as diverse as the domains they operate in. This diversity of models gives machine learning systems great problem solving powers. However, it can also be a bit daunting for the designer to decide what is the best model, or models, for a particular problem. To complicate things further, there are often several models that may solve your task, or your task may need several models. The most accurate and efficient pathway through an original problem is something you simply cannot know when you embark upon such a project. There are several modeling approaches. These are really different perspectives that we can use to help us understand the problem landscape. A distinction can be made regarding how a model divides up the instance space. The instance space can be considered all possible instances of your data, regardless of whether each instance actually appears in the data. The data is a subset of the instance space. There are two approaches to dividing up this space: grouping and grading. The key difference between the two is that grouping models divide the instance space into fixed discrete units called segments. Each segment has a finite resolution and cannot distinguish between classes beyond this resolution. Grading, on the other hand, forms a global model over the entire instance space, rather than dividing the space into segments. In theory, the resolution of a grading model is infinite, and it can distinguish between instances no matter how similar they are. The distinction between grouping and grading is not absolute, and many models contain elements of both. Geometric models One of the most useful approaches to machine learning modeling is through geometry. Geometric models use the concept of instance space. The most obvious example is when all the features are numerical and can become coordinates in a Cartesian coordinate system. When we only have two or three features, they are easy to visualize. Since many machine learning problems have hundreds or thousands of features, and therefore dimensions, visualizing these spaces is impossible. Importantly, many of the geometric concepts, such as linear transformations, still apply in this hyper space. This can help us better understand our models. For instance, we expect many learning algorithms to be translation invariant, which means that it does not matter where we place the origin in the coordinate system. Also, we can use the geometric concept of Euclidean distance to measure similarity between instances; this gives us a method to cluster alike instances and form a decision boundary between them. Probabilistic models Often, we will want our models to output probabilities rather than just binary true or false. When we take a probabilistic approach, we assume that there is an underlying random process that creates a well-defined, but unknown, probability distribution. Probabilistic models are often expressed in the form of a tree. Tree models are ubiquitous in machine learning, and one of their main advantages is that they can inform us about the underlying structure of a problem. Decision trees are naturally easy to visualize and conceptualize. They allow inspection and do not just give an answer. For example, if we have to predict a category, we can also expose the logical steps that gave rise to a particular result. Also, tree models generally require less data preparation than other models and can handle numerical and categorical data. On the down side, tree models can create overly complex models that do not generalize very well to new data. Another potential problem with tree models is that they can become very sensitive to changes in the input data, and as we will see later, this problem can be mitigated by using them as ensemble learners. Linear models A key concept in machine learning is that of the linear model. Linear models form the foundation of many advanced nonlinear techniques such as support vector machines and neural networks. They can be applied to any predictive task such as classification, regression, or probability estimation. When responding to small changes in the input data, and provided that our data consists of entirely uncorrelated features, linear models tend to be more stable than tree models. Tree models can over-respond to small variation in training data. This is because splits at the root of a tree have consequences that are not recoverable further down a branch, potentially making the rest of the tree significantly different. Linear models, on the other hand, are relatively stable, being less sensitive to initial conditions. However, as you would expect, this has the opposite effect of making it less sensitive to nuanced data. This is described by the terms variance (for over fitting models) and bias (for under fitting models). A linear model is typically low variance and high bias. Linear models are generally best approached from a geometric perspective. We know we can easily plot two dimensions of space in a Cartesian co-ordinate system, and we can use the illusion of perspective to illustrate a third. We have also been taught to think of time as being a fourth dimension, but when we start speaking of n dimensions, a physical analogy breaks down. Intriguingly, we can still use many of the mathematical tools that we intuitively apply to three dimensions of space. While it becomes difficult to visualize these extra dimensions, we can still use the same geometric concepts (such as lines, planes, angles, and distance) to describe them. With geometric models, we describe each instance as having a set of real-valued features, each of which is a dimension in a space. Model ensembles Ensemble techniques can be divided broadly into two types. The Averaging Method: With this method, several estimators are run independently, and their predictions are averaged. This includes the random forests and bagging methods. The Boosting Methods: With this method, weak learners are built sequentially using weighted distributions of the data, based on the error rates. Ensemble methods use multiple models to obtain better performance than any single constituent model. The aim is to not only build diverse and robust models, but also to work within limitations such as processing speed and return times. When working with large datasets and quick response times, this can be a significant developmental bottleneck. Troubleshooting and diagnostics are important aspects of working with all machine learning models, but they are especially important when dealing with models that might take days to run. The types of machine learning ensembles that can be created are as diverse as the models themselves, and the main considerations revolve around three things: how we divide our data, how we select the models, and the methods we use to combine their results. This simplistic statement actually encompasses a very large and diverse space. Neural nets When we approach the problem of trying to mimic the brain, we are faced with a number of difficulties. Considering all the different things the brain does, we might first think that it consists of a number of different algorithms, each specialized to do a particular task, and each hard wired into different parts of the brain. This approach translates to considering the brain as a number of subsystems, each with its own program and task. For example, the auditory cortex for perceiving sound has its own algorithm that, say, does a Fourier transform on an incoming sound wave to detect the pitch. The visual cortex, on the other hand, has its own distinct algorithm for decoding the signals from the optic nerve and translating them into the sensation of sight. There is, however, growing evidence that the brain does not function like this at all. It appears, from biological studies, that brain tissue in different parts of the brain can relearn how to interpret inputs. So, rather than consisting of specialized subsystems that are programmed to perform specific tasks, the brain uses the same algorithm to learn different tasks. This single algorithm approach has many advantages, not least of which is that it is relatively easy to implement. It also means that we can create generalized models and then train them to perform specialized tasks. Like in real brains, using a singular algorithm to describe how each neuron communicates with the other neurons around it allows artificial neural networks to be adaptable and able to carry out multiple higher-level tasks. Much of the most important work being done with neural net models, and indeed machine learning in general, is through the use of very complex neural nets with many layers and features. This approach is often called deep architecture or deep learning. Human and animal learning occurs at a rate and depth that no machine can match. Many of the elements of biological learning still remain a mystery. One of the key areas of research, and one of the most useful in application, is that of object recognition. This is something quite fundamental to living systems, and higher animals have evolved to possessing an extraordinary ability to learn complex relationships between objects. Biological brains have many layers; each synaptic event exists in a long chain of synaptic processes. In order to recognize complex objects, such as people's faces or handwritten digits, a fundamental task is to create a hierarchy of representation from the raw input to higher and higher levels of abstraction. The goal is to transform raw data, such as a set of pixel values, into something that we can describe as, say, a person riding bicycle. Resources for Article: Further resources on this subject: Python Data Structures [article] Exception Handling in MySQL for Python [article] Python Data Analysis Utilities [article]
Read more
  • 0
  • 0
  • 5074

article-image-launching-spark-cluster
Packt
31 Mar 2016
7 min read
Save for later

Launching a Spark Cluster

Packt
31 Mar 2016
7 min read
 In this article by Omar Khedher, author of OpenStack Sahara Essentials we will use Sahara to create and launch a Spark cluster. Sahara provides several plugins to provision Hadoop clusters on top of OpenStack. We will be using Spark plugins to provision Apache Spark clusters using Horizon. (For more resources related to this topic, see here.) General settings The following diagram illustrates our Spark cluster topology, which includes: One Spark master node: This runs the Spark Master and the HDFS NameNode Three Spark slave nodes: These run a Spark Slave and an HDFS DataNode each Preparing the Spark image The following link provides several Sahara images available for download for different plugins: http://sahara-files.mirantis.com/images/upstream/liberty. Note that the upstream Sahara image files are destined for the OpenStack Liberty release. From Horizon, click on Compute and select Images, click on Create Image and add the new image, as shown here: We will need to upload the downloaded image to Glance so that it can be registered in the Sahara image registry catalog. Make sure that the new image is active. Click on the Data Processing tab and select Image Registry. Click on Register Image to register the new uploaded Glance image to Sahara, as shown here: Click on Done and the new Spark image is ready to start launching the Spark cluster. Creating the Spark master group template Node group templates in Sahara facilitate the configuration of a set of instances that have same properties, such as RAM and CPU. We will start by creating the first node group template for the Spark master. From the Data Processing tab, select Node Group Templates and click on Create Template. Our first node group template will be based on Apache Spark with Version 1.3.1, as shown here: The next wizard will guide to specifying the name of the template, the instance flavor, the storage location, and which floating IP pool will be assigned to the cluster instance: The next tab in same wizard will guide you to selecting which kind of process the nodes in the cluster will run. In our case, the Spark master node group template will include Spark master and HDFS namenode processes, as shown here: The next tab in the wizard exposes more choices regarding the security groups that will be applied for the template cluster nodes: Auto security group: This will automatically create a set of security groups that will be directly applied to the instances of the node group template Default security group: Any existing security groups in the OpenStack environment configured as default will be applied the instances of the node group template The last tab in the wizard exposes more specific HDFS configuration that depend on the available resources of the cluster, such as disk space, CPU and memory: dfs.datanode.handler.count: How many server threads there are for the datanode dfs.datanode.du.reserved: How much of the available disk space will not be taken into account for HDFS use dfs.namenode.handler.count: How many server threads there are for the namenode dfs.datanode.failed.volumes.tolerated: How many volumes are allowed to fail before a datanode instance stops dfs.datanode.max.xcievers: What is the maximum number of threads to be used in order to transfer data to/from the DataNode instance. Name Node Heap Size: How much memory will be assigned to the heap size to handle workload per NameNode instance Data Node Heap Size: How much memory will be assigned to the heap size to handle workload per DataNode instance Creating the Spark slave group template Creating the Spark slave group template will be performed in the same way as the Spark master group template except the assignment of the node processes. The Spark slave nodes will be running Spark slave and HDFS datanode processes, as shown here: Security groups and HDFS parameters can be configured the same as the Spark master node group template. Creating the Spark cluster template Now that we have defined the basic templates for the Spark cluster, we will need to compile both entities into one cluster template. In the Sahara dashboard, select Cluster Templates and click on Create Template. Select Apache Spark as the Plugin name, with version 1.3.1, as follows: Give the cluster template a name and small description. It is also possible to mention which process in the Spark cluster will run in a different compute node for high-availability purposes. This is only valid when you have more than one compute node in the OpenStack environment. The next tab in the same wizard allows you to add the necessary number of Spark instances based on the node group templates created previously. In our case, we will use one master Spark instance and three slave Spark instances, as shown here: The next tab, General Parameters, provides more advanced cluster configuration, including the following: Timeout for disk preparing: The cluster will fail when the duration of formatting and mounting the disk per node exceeds the timeout value. Enable NTP service: This option will enable all the instances of the cluster to synchronized time. An NTP file can be found under /tmp when cluster nodes are active. URL of NTP server: If mentioned, the Spark cluster will use the URL of the NTP server for time synchronization. Heat Wait Condition timeout: Heat will throw an error message to Sahara and the cluster will fail when a node is not able to boot up after a certain amount of time. This will prevent Sahara spawning instances indefinitely. Enable XFS: Allows XFS disk formatting. Decommissioning Timeout: This will throw an error when scaling data nodes in the Spark cluster takes more than the time mentioned. Enable Swift: Allows using Swift object storage to pull and push data during job execution. The Spark Parameters tab allows you to specify the following: Master webui port: Which port will access the Spark master web user interface. Work webui port: Which port will access the Spark slave web user interface. Worker memory: How much memory will be reserved for Spark applications. By default, if all is selected, Spark will use all the available RAM is the instance minus 1 GB. Spark will not run properly when using a flavor having RAM less than 1 GB. Launching the Spark cluster Based on the cluster template, the last step will require you to only push the button Launch Cluster from the Clusters tab in the Sahara dashboard. You will need only to select the plugin name, Apache Spark, with version 1.3.1. Next, you will need to name the new cluster, select the right cluster template created previously, and the base image registered in Sahara. Additionally, if you intend to access the cluster instances via SSH, select an existing SSH keypair. It is also possible to select from which network segment you will be able to manage the cluster instances; in our case, an existing private network, Private_Net10, will be used for this purpose. Launch the cluster; this will take a while to finish spawning four instances forming the Spark cluster. The Spark cluster instances can be listed in the Compute Instances tab, as shown here: Summary In this article, we created a Spark cluster using Sahara in OpenStack by means of the Apache Spark plugin. The provisioned cluster includes one Spark master node and three Spark slave nodes. When the cluster status changes to theactive state, it is possible to start executing jobs. Resources for Article:  Further resources on this subject: Introducing OpenStack Trove [article] OpenStack Performance, Availability [article] Monitoring Physical Network Bandwidth Using OpenStack Ceilometer [article]
Read more
  • 0
  • 0
  • 2945

article-image-why-mesos
Packt
31 Mar 2016
8 min read
Save for later

Why Mesos?

Packt
31 Mar 2016
8 min read
In this article by Dipa Dubhasi and Akhil Das authors of the book Mastering Mesos, delves into understanding the importance of Mesos. Apache Mesos is an open source, distributed cluster management software that came out of AMPLab, UC Berkeley in 2011. It abstracts CPU, memory, storage, and other computer resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. It is referred to as a metascheduler (scheduler of schedulers) and a "distributed systems kernel/distributed datacenter OS". It improves resource utilization, simplifies system administration, and supports a wide variety of distributed applications that can be deployed by leveraging its pluggable architecture. It is scalable and efficient and provides a host of features, such as resource isolation and high availability, which, along with a strong and vibrant open source community, makes this one of the most exciting projects. (For more resources related to this topic, see here.) Introduction to the datacenter OS and architecture of Mesos Over the past decade, datacenters have graduated from packing multiple applications into a single server box to having large datacenters that aggregate thousands of servers to serve as a massively distributed computing infrastructure. With the advent of virtualization, microservices, cluster computing, and hyper-scale infrastructure, the need of the hour is the creation of an application-centric enterprise that follows a software-defined datacenter strategy. Currently, server clusters are predominantly managed individually, which can be likened to having multiple operating systems on the PC, one each for processor, disk drive, and so on. With an abstraction model that treats these machines as individual entities being managed in isolation, the ability of the datacenter to effectively build and run distributed applications is greatly reduced. Another way of looking at the situation is comparing running applications in a datacenter to running them on a laptop. One major difference is that while launching a text editor or web browser, we are not required to check which memory modules are free and choose ones that suit our need. Herein lies the significance of a platform that acts like a host operating system and allows multiple users to run multiple applications simultaneously by utilizing a shared set of resources. Datacenters now run varied distributed application workloads, such as Spark, Hadoop, and so on, and need the capability to intelligently match resources and applications. The datacenter ecosystem today has to be equipped to manage and monitor resources and efficiently distribute workloads across a unified pool of resources with the agility and ease to cater to a diverse user base (noninfrastructure teams included). A datacenter OS brings to the table a comprehensive and sustainable approach to resource management and monitoring. This not only reduces the cost of ownership but also allows a flexible handling of resource requirements in a manner that isolated datacenter infrastructure cannot support. The idea behind a datacenter OS is that of an intelligent software that sits above all the hardware in a datacenter and ensures efficient and dynamic resource sharing. Added to this is the capability to constantly monitor resource usage and improve workload and infrastructure management in a seamless way that is not tied to specific application requirements. In its absence, we have a scenario with silos in a datacenter that force developers to build software catering to machine-specific characteristics and make the moving and resizing of applications a highly cumbersome procedure. The datacenter OS acts as a software layer that aggregates all servers in a datacenter into one giant supercomputer to deliver the benefits of multilatency, isolation, and resource control across all microservice applications. Another major advantage is the elimination of human-induced error during the continual assigning and reassigning of virtual resources. From a developer's perspective, this will allow them to easily and safely build distributed applications without restricting them to a bunch of specialized tools, each catering to a specific set of requirements. For instance, let's consider the case of Data Science teams who develop analytic applications that are highly resource intensive. An operating system that can simplify how the resources are accessed, shared, and distributed successfully alleviates their concern about reallocating hardware every time the workloads change. Of key importance is the relevance of the datacenter OS to DevOps, primarily a software development approach that emphasizes automation, integration, collaboration, and communication between traditional software developers and other IT professionals. With a datacenter OS that effectively transforms individual servers into a pool of resources, DevOps teams can focus on accelerating development and not continuously worry about infrastructure issues. In a world where distributed computing becomes the norm, the datacenter OS is a boon in disguise. With freedom from manually configuring and maintaining individual machines and applications, system engineers need not configure specific machines for specific applications as all applications would be capable of running on any available resources from any machine, even if there are other applications already running on them. Using a datacenter OS results in centralized control and smart utilization of resources that eliminate hardware and software silos to ensure greater accessibility and usability even for noninfrastructural professionals. Examples of some organizations administering their hyperscale datacenters via the datacenter OS are Google with the Borg (and next geneneration Omega) systems. The merits of the datacenter OS are undeniable, with benefits ranging from the scalability of computing resources and flexibility to support data sharing across applications to saving team effort, time, and money while launching and managing interoperable cluster applications. It is this vision of transforming the datacenter into a single supercomputer that Apache Mesos seeks to achieve. Born out of a Berkeley AMPLab research paper in 2011, it has since come a long way with a number of leading companies, such as Apple, Twitter, Netflix, and AirBnB among others, using it in production. Mesosphere is a start-up that is developing a distributed OS product with Mesos at its core. The architecture of Mesos Mesos is an open-source platform for sharing clusters of commodity servers between different distributed applications (or frameworks), such as Hadoop, Spark, and Kafka among others. The idea is to act as a centralized cluster manager by pooling together all the physical resources of the cluster and making it available as a single reservoir of highly available resources for all the different frameworks to utilize. For example, if an organization has one 10-node cluster (16 CPUs and 64 GB RAM) and another 5-node cluster (4 CPUs and 16 GB RAM), then Mesos can be leveraged to pool them into one virtual cluster of 720 GB RAM and 180 CPUs, where multiple distributed applications can be run. Sharing resources in this fashion greatly improves cluster utilization and eliminates the need for an expensive data replication process per-framework. Some of the important features of Mesos are: Scalability: It can elastically scale to over 50,000 nodes Resource isolation: This is achieved through Linux/Docker containers Efficiency: This is achieved through CPU and memory-aware resource scheduling across multiple frameworks High availability: This is through Apache ZooKeeper Interface: A web UI for monitoring the cluster state Mesos is based on the same principles as the Linux kernel and aims to provide a highly available, scalable, and fault-tolerant base for enabling various frameworks to share cluster resources effectively and in isolation. Distributed applications are varied and continuously evolving, a fact that leads Mesos' design philosophy towards a thin interface that allows an efficient resource allocation between different frameworks and delegates the task of scheduling and job execution to the frameworks themselves. The two advantages of doing so are: Different frame data replication works can independently devise methods to address their data locality, fault-tolerance, and other such needs. It simplifies the Mesos codebase and allows it to be scalable, flexible, robust, and agile Mesos' architecture hands over the responsibility of scheduling tasks to the respective frameworks by employing a resource offer abstraction that packages a set of resources and makes offers to each framework. The Mesos master node decides the quantity of resources to offer each framework, while each framework decides which resource offers to accept and which tasks to execute on these accepted resources. This method of resource allocation is shown to achieve good degree of data locality for each framework sharing the same cluster. An alternative architecture would implement a global scheduler that took framework requirements, organizational priorities, and resource availability as inputs and provided a task schedule breakdown by framework and resource as output, essentially acting as a matchmaker for jobs and resources with priorities acting as constraints. The challenges with this architecture, such as developing a robust API that could capture all the varied requirements of different frameworks, anticipating new frameworks, and solving a complex scheduling problem for millions of jobs, made the former approach a much more attractive option for the creators. Summary Thus in this article, we introduced Mesos, and then dived deep into its architecture to understand importance of Mesos. Resources for Article:   Further resources on this subject: Understanding Mesos Internals [article] Leveraging Python in the World of Big Data [article] Self-service Business Intelligence, Creating Value from Data [article]
Read more
  • 0
  • 0
  • 10027

Packt
30 Mar 2016
15 min read
Save for later

ALM – Developers and QA

Packt
30 Mar 2016
15 min read
This article by Can Bilgin, the author of Mastering Cross-Platform Development with Xamarin, provides an introduction to Application Lifecycle Management (ALM) and continuous integration methodologies on Xamarin cross-platform applications. As the part of the ALM process that is most relevant for developers, unit test strategies will be discussed and demonstrated, as well as automated UI testing. This article is divided into the following sections: Development pipeline Troubleshooting Unit testing UI testing (For more resources related to this topic, see here.) Development pipeline The development pipeline can be described as the virtual production line that steers a project from a mere bundle of business requirements to the consumers. Stakeholders that are part of this pipeline include, but are not limited to, business proxies, developers, the QA team, the release and configuration team, and finally the consumers themselves. Each stakeholder in this production line assumes different responsibilities, and they should all function in harmony. Hence, having an efficient, healthy, and preferably automated pipeline that is going to provide the communication and transfer of deliverables between units is vital for the success of a project. In the Agile project management framework, the development pipeline is cyclical rather than a linear delivery queue. In the application life cycle, requirements are inserted continuously into a backlog. The backlog leads to a planning and development phase, which is followed by testing and QA. Once the production-ready application is released, consumers can be made part of this cycle using live application telemetry instrumentation. Figure 1: Application life cycle management In Xamarin cross-platform application projects, development teams are blessed with various tools and frameworks that can ease the execution of ALM strategies. From sketching and mock-up tools available for early prototyping and design to source control and project management tools that make up the backbone of ALM, Xamarin projects can utilize various tools to automate and systematically analyze project timeline. The following sections of this article concentrate mainly on the lines of defense that protect the health and stability of a Xamarin cross-platform project in the timeline between the assignment of tasks to developers to the point at which the task or bug is completed/resolved and checked into a source control repository. Troubleshooting and diagnostics SDKs associated with Xamarin target platforms and development IDEs are equipped with comprehensive analytic tools. Utilizing these tools, developers can identify issues causing app freezes, crashes, slow response time, and other resource-related problems (for example, excessive battery usage). Xamarin.iOS applications are analyzed using the XCode Instruments toolset. In this toolset, there are a number of profiling templates, each used to analyze a certain perspective of application execution. Instrument templates can be executed on an application running on the iOS simulator or on an actual device. Figure 2: XCode Instruments Similarly, Android applications can be analyzed using the device monitor provided by the Android SDK. Using Android Monitor, memory profile, CPU/GPU utilization, and network usage can also be analyzed, and application-provided diagnostic information can be gathered. Android Debug Bridge (ADB) is a command-line tool that allows various manual or automated device-related operations. For Windows Phone applications, Visual Studio provides a number of analysis tools for profiling CPU usage, energy consumption, memory usage, and XAML UI responsiveness. XAML diagnostic sessions in particular can provide valuable information on problematic sections of view implementation and pinpoint possible visual and performance issues: Figure 3: Visual Studio XAML analyses Finally, Xamarin Profiler, as a maturing application (currently in preview release), can help analyze memory allocations and execution time. Xamarin Profiler can be used with iOS and Android applications. Unit testing The test-driven development (TDD) pattern dictates that the business requirements and the granular use-cases defined by these requirements should be initially reflected on unit test fixtures. This allows a mobile application to grow/evolve within the defined borders of these assertive unit test models. Whether following a TDD strategy or implementing tests to ensure the stability of the development pipeline, unit tests are fundamental components of a development project. Figure 4: Unit test project templates Xamarin Studio and Visual Studio both provide a number of test project templates targeting different areas of a cross-platform project. In Xamarin cross-platform projects, unit tests can be categorized into two groups: platform-agnostic and platform-specific testing. Platform-agnostic unit tests Platform-agnostic components, such as portable class libraries containing shared logic for Xamarin applications, can be tested using the common unit test projects targeting the .NET framework. Visual Studio Test Tools or the NUnit test framework can be used according to the development environment of choice. It is also important to note that shared projects used to create shared logic containers for Xamarin projects cannot be tested with .NET unit test fixtures. For shared projects and the referencing platform-specific projects, platform-specific unit test fixtures should be prepared. When following an MVVM pattern, view models are the focus of unit test fixtures since, as previously explained, view models can be perceived as a finite state machine where the bindable properties are used to create a certain state on which the commands are executed, simulating a specific use-case to be tested. This approach is the most convenient way to test the UI behavior of a Xamarin application without having to implement and configure automated UI tests. While implementing unit tests for such projects, a mocking framework is generally used to replace the platform-dependent sections of the business logic. Loosely coupling these dependent components makes it easier for developers to inject mocked interface implementations and increases the testability of these modules. The most popular mocking frameworks for unit testing are Moq and RhinoMocks. Both Moq and RhinoMocks utilize reflection and, more specifically, the Reflection.Emit namespace, which is used to generate types, methods, events, and other artifacts in the runtime. Aforementioned iOS restrictions on code generation make these libraries inapplicable for platform-specific testing, but they can still be included in unit test fixtures targeting the .NET framework. For platform-specific implementation, the True Fakes library provides compile time code generation and mocking features. Depending on the implementation specifics (such as namespaces used, network communication, multithreading, and so on), in some scenarios it is imperative to test the common logic implementation on specific platforms as well. For instance, some multithreading and parallel task implementations give different results on Windows Runtime, Xamarin.Android, and Xamarin.iOS. These variations generally occur because of the underlying platform's mechanism or slight differences between the .NET and Mono implementation logic. In order to ensure the integrity of these components, common unit test fixtures can be added as linked/referenced files to platform-specific test projects and executed on the test harness. Platform-specific unit tests In a Xamarin project, platform-dependent features cannot be unit tested using the conventional unit test runners available in Visual Studio Test Suite and NUnit frameworks. Platform-dependent tests are executed on empty platform-specific projects that serve as a harness for unit tests for that specific platform. Windows Runtime application projects can be tested using the Visual Studio Test Suite. However, for Android and iOS, the NUnit testing framework should be used, since Visual Studio Test Tools are not available for the Xamarin.Android and Xamarin.iOS platforms.                              Figure 5: Test harnesses The unit test runner for Windows Phone (Silverlight) and Windows Phone 8.1 applications uses a test harness integrated with the Visual Studio test explorer. The unit tests can be executed and debugged from within Visual Studio. Xamarin.Android and Xamarin.iOS test project templates use NUnitLite implementation for the respective platforms. In order to run these tests, the test application should be deployed on the simulator (or the testing device) and the application has to be manually executed. It is possible to automate the unit tests on Android and iOS platforms through instrumentation. In each Xamarin target platform, the initial application lifetime event is used to add the necessary unit tests: [Activity(Label = "Xamarin.Master.Fibonacci.Android.Tests", MainLauncher = true, Icon = "@drawable/icon")] public class MainActivity : TestSuiteActivity { protected override void OnCreate(Bundle bundle) { // tests can be inside the main assembly //AddTest(Assembly.GetExecutingAssembly()); // or in any reference assemblies AddTest(typeof(Fibonacci.Android.Tests.TestsSample).Assembly); // Once you called base.OnCreate(), you cannot add more assemblies. base.OnCreate(bundle); } } In the Xamarin.Android implementation, the MainActivity class derives from the TestSuiteActivity, which implements the necessary infrastructure to run the unit tests and the UI elements to visualize the test results. On the Xamarin.iOS platform, the test application uses the default UIApplicationDelegate, and generally, the FinishedLaunching event delegate is used to create the ViewController for the unit test run fixture: public override bool FinishedLaunching(UIApplication application, NSDictionary launchOptions) { // Override point for customization after application launch. // If not required for your application you can safely delete this method var window = new UIWindow(UIScreen.MainScreen.Bounds); var touchRunner = new TouchRunner(window); touchRunner.Add(System.Reflection.Assembly.GetExecutingAssembly()); window.RootViewController = new UINavigationController(touchRunner.GetViewController()); window.MakeKeyAndVisible(); return true; } The main shortcoming of executing unit tests this way is the fact that it is not easy to generate a code coverage report and archive the test results. Neither of these testing methods provide the ability to test the UI layer. They are simply used to test platform-dependent implementations. In order to test the interactive layer, platform-specific or cross-platform (Xamarin.Forms) coded UI tests need to be implemented. UI testing In general terms, the code coverage of the unit tests directly correlates with the amount of shared code which amounts to, at the very least, 70-80 percent of the code base in a mundane Xamarin project. One of the main driving factors of architectural patterns was to decrease the amount of logic and code in the view layer so that the testability of the project utilizing conventional unit tests reaches a satisfactory level. Coded UI (or automated UI acceptance) tests are used to test the uppermost layer of the cross-platform solution: the views. Xamarin.UITests and Xamarin Test Cloud The main UI testing framework used for Xamarin projects is the Xamarin.UITests testing framework. This testing component can be used on various platform-specific projects, varying from native mobile applications to Xamarin.Forms implementations, except for the Windows Phone platform and applications. Xamarin.UITests is an implementation based on the Calabash framework, which is an automated UI acceptance testing framework targeting mobile applications. Xamarin.UITests is introduced to the Xamarin.iOS or Xamarin.Android applications using the publicly available NuGet packages. The included framework components are used to provide an entry point to the native applications. The entry point is the Xamarin Test Cloud Agent, which is embedded into the native application during the compilation. The cloud agent is similar to a local server that allows either the Xamarin Test Cloud or the test runner to communicate with the app infrastructure and simulate user interaction with the application. Xamarin Test Cloud is a subscription-based service allowing Xamarin applications to be tested on real mobile devices using UI tests implemented via Xamarin.UITests. Xamarin Test Cloud not only provides a powerful testing infrastructure for Xamarin.iOS and Xamarin.Android applications with an abundant amount of mobile devices but can also be integrated into Continuous Integration workflows. After installing the appropriate NuGet package, the UI tests can be initialized for a specific application on a specific device. In order to initialize the interaction adapter for the application, the app package and the device should be configured. On Android, the APK package path and the device serial can be used for the initialization: IApp app = ConfigureApp.Android.ApkFile("<APK Path>/MyApplication.apk") .DeviceSerial("<DeviceID>") .StartApp(); For an iOS application, the procedure is similar: IApp app = ConfigureApp.iOS.AppBundle("<App Bundle Path>/MyApplication.app") .DeviceIdentifier("<DeviceID of Simulator") .StartApp(); Once the App handle has been created, each test written using NUnit should first create the pre-conditions for the tests, simulate the interaction, and finally test the outcome. The IApp interface provides a set of methods to select elements on the visual tree and simulate certain interactions, such as text entry and tapping. On top of the main testing functionality, screenshots can be taken to document test steps and possible bugs. Both Visual Studio and Xamarin Studio provide project templates for Xamarin.UITests. Xamarin Test Recorder Xamarin Test Recorder is an application that can ease the creation of automated UI tests. It is currently in its preview version and is only available for the Mac OS platform. Figure 6: Xamarin Test Recorder Using this application, developers can select the application in need of testing and the device/simulator that is going to run the application. Once the recording session starts, each interaction on the screen is recorded as execution steps on a separate screen, and these steps can be used to generate the preparation or testing steps for the Xamarin.UITests implementation. Coded UI tests (Windows Phone) Coded UI tests are used for automated UI testing on the Windows Phone platform. Coded UI Tests for Windows Phone and Windows Store applications are not any different than their counterparts for other .NET platforms such as Windows Forms, WPF, or ASP.Net. It is also important to note that only XAML applications support Coded UI tests. Coded UI tests are generated on a simulator and written on an Automation ID premise. The Automation ID property is an automatically generated or manually configured identifier for Windows Phone applications (only in XAML) and the UI controls used in the application. Coded UI tests depend on the UIMap created for each control on a specific screen using the Automation IDs. While creating the UIMap, a crosshair tool can be used to select the application and the controls on the simulator screen to define the interactive elements: Figure 7:- Generating coded UI accessors and tests Once the UIMap has been created and the designer files have been generated, gestures and the generated XAML accessors can be used to create testing pre-conditions and assertions. For Coded UI tests, multiple scenario-specific input values can be used and tested on a single assertion. Using the DataRow attribute, unit tests can be expanded to test multiple data-driven scenarios. The code snippet below uses multiple input values to test different incorrect input values: [DataRow(0,"Zero Value")] [DataRow(-2, "Negative Value")] [TestMethod] public void FibonnaciCalculateTest_IncorrectOrdinal(int ordinalInput) { // TODO: Check if bad values are handled correctly } Automated tests can run on available simulators and/or a real device. They can also be included in CI build workflows and made part of the automated development pipeline. Calabash Calabash is an automated UI acceptance testing framework used to execute Cucumber tests. Cucumber tests provide an assertion strategy similar to coded UI tests, only broader and behavior oriented. The Cucumber test framework supports tests written in the Gherkin language (a human-readable programming grammar description for behavior definitions). Calabash makes up the necessary infrastructure to execute these tests on various platforms and application runtimes. A simple declaration of the feature and the scenario that is previously tested on Coded UI using the data-driven model would look similar to the excerpt below. Only two of the possible test scenarios are declared in this feature for demonstration; the feature can be extended: Feature: Calculate Single Fibonacci number. Ordinal entry should greater than 0. Scenario: Ordinal is lower than 0. Given I use the native keyboard to enter "-2" into text field Ordinal And I touch the "Calculate" button Then I see the text "Ordinal cannot be a negative number." Scenario: Ordinal is 0. Given I use the native keyboard to enter "0" into text field Ordinal And I touch the "Calculate" button Then I see the text "Cannot calculate the number for the 0th ordinal." Calabash test execution is possible on Xamarin target platforms since the Ruby API exposed by the Calabash framework has a bidirectional communication line with the Xamarin Test Cloud Agent embedded in Xamarin applications with NuGet packages. Calabash/Cucumber tests can be executed on Xamarin Test Cloud on real devices since the communication between the application runtime and Calabash framework is maintained by Xamarin Test Cloud Agent, the same as Xamarin.UI tests. Summary Xamarin projects can benefit from a properly established development pipeline and the use of ALM principles. This type of approach makes it easier for teams to share responsibilities and work out business requirements in an iterative manner. In the ALM timeline, the development phase is the main domain in which most of the concrete implementation takes place. In order for the development team to provide quality code that can survive the ALM cycle, it is highly advised to analyze and test native applications using the available tooling in Xamarin development IDEs. While the common codebase for a target platform in a Xamarin project can be treated and tested as a .NET implementation using the conventional unit tests, platform-specific implementations require more particular handling. Platform-specific parts of the application need to be tested on empty shell applications, called test harnesses, on the respective platform simulators or devices. To test views, available frameworks such as Coded UI tests (for Windows Phone) and Xamarin.UITests (for Xamarin.Android and Xamarin.iOS) can be utilized to increase the test code coverage and create a stable foundation for the delivery pipeline. Most tests and analysis tools discussed in this article can be integrated into automated continuous integration processes. Resources for Article:   Further resources on this subject: A cross-platform solution with Xamarin.Forms and MVVM architecture [article] Working with Xamarin.Android [article] Application Development Workflow [article]
Read more
  • 0
  • 0
  • 22117

article-image-golang-decorators-logging-time-profiling
Nicholas Maccharoli
30 Mar 2016
6 min read
Save for later

Golang Decorators: Logging & Time Profiling

Nicholas Maccharoli
30 Mar 2016
6 min read
Golang's imperative world Golang is not, by any means, a functional language; its design remains true to its jingle, which says that it is "C for the 21st Century". One task I tried to do early on in learning the language was search for the map, filter, and reduce functions in the standard library but to no avail. Next, I tried rolling my own versions, but I felt as though I hit a bit of a road block when I discovered that there is no support for generics in the language at the time of writing this. There is, however, support for Higher Order Functions or, more simply put, functions that take other functions as arguments and return functions. If you have spent some time in Python, you may have come to love a design pattern called "Decorator". In fact, decorators make life in Python so great that support for applying them is built right into the language with a nifty @ operator! Python frameworks such as Flask extensively use decorators. If you have little or no experience in Python, fear not for the concept is a design pattern independent of any language. Decorators An alternative name for the decorator pattern is "wrapper", which pretty much sums it all up in one word! A decorator's job is only to wrap a function so that additional code can be executed when the original function is called. This is accomplished by writing a function that takes a function as its argument and returns a function of the same type (Higher Order Functions in action!). While this still calls the original function and passes through its return value, it does something extra along the way. Decorators for logging We can easily log which specific method is passed with a little help from our decorator friends. Say, we wanted to log which user liked a blog post and what the ID of the post was all without touching any code in the original likePost function. Here is our original function: func likePost(userId int, postId int) bool { fmt.Printf("Update Complete!n") return true } Our decorator might look something similar to this: type LikeFunc func(int, int) bool func decoratedLike(f LikeFunc) LikeFunc { return func(userId int, postId int) bool { fmt.Printf("likePost Log: User %v liked post# %vn", userId, postId) return f(userId, postId) } } Note the use of the type definition here. I encourage you to use it for the sake of readability when defining functions with long signatures, such as those of decorators, as you need to type the function signature twice. Now, we can apply the decorator and allow the logging to begin: r := likeStats(likePost) r(1414, 324) r(5454, 324) r(4322, 250) This produces the following output: likePost Log: User 1414 liked post# 324 Update Complete! likePost Log: User 5454 liked post# 324 Update Complete! likePost Log: User 4322 liked post# 250 Update Complete! Our original likePost function still gets called and runs as expected, but now we get an additional log detailing the user and post IDs that were passed to the function each time it was called. Hopefully, this will help speed up debugging our likePost function if and when we encounter strange behavior! Decorators for performance! Say, we run a "Top 10" site and previously, our main sorting routine to find the top 10 cat photos of this week on the Internet was written with Golang's func Sort(data Interface) function from the sort package of the Golang standard library. Everything is fine until we are informed that Fluffy the cat is infuriated that she is coming in at number six on the list and not number five. The cats at ranks five and six on the list both had 5000 likes each, but Fluffy reached 5000 likes a day earlier than Bozo the cat, who is currently higher ranked. We like to give credit where it's due, so we apologize to Fluffy and go on to use the stable version of the func Stable(data Interface) sort, which preserves the order of elements equal in value during the sort. We can improve our code and tests so that this does not happen again (We promised Fluffy!). The tests pass, everything looks great, and we deploy gracefully... or so we think. Over the course of the day, other developers also deploy their changes, and then, after checking our performance reports, we notice a slowdown somewhere. Is it from our switch to stable the sorting? Well, let’s use decorators to measure the performance of both sort functions and check whether there is a noticeable dip in performance. Here’s our testing function: type SortFunc func(sort.Interface) func timedSortFunc(f SortFunc) SortFunc { return func(data sort.Interface) { defer func(t time.Time) { fmt.Printf("--- Time Elapsed: %v ---n", time.Since(t)) }(time.Now()) f(data) } } In case you are unfamiliar with defer, all it does is call the function it is passed right after its calling function returns. The arguments passed to defer are evaluated right away, so the value we get from time.Now() is really the start time of the function! Let’s go ahead and give this test a go: stable := timedSortFunc(sort.Stable) unStable := timedSortFunc(sort.Sort) // 10000 Elements with values ranging // between 0 and 5000 randomCatList1 := randomCatScoreSlice(10000, 5000) randomCatList2 := randomCatScoreSlice(10000, 5000) fmt.Printf("Unstable Sorting Function:n") stable(randomCatList1) fmt.Printf("Stable Sorting Function:n") unStable(randomCatList2) The following output is yielded: Unstable Sorting Function: --- Time Elapsed: 282.889µs --- Stable Sorting Function: --- Time Elapsed: 93.947µs --- Wow! Fluffy's complaint not only made our top 10 list more accurate but now they sort about three times as fast with the stable version of sort as well! (However, we still need to be careful; sort.Stable most likely uses way more memory than the standard sort.Sort function.) Final thoughts Figuring out when and where to apply the decorator pattern is really up to you and your team. There are no hard rules, and you can completely live without it. However, when it comes to things like extra logging or profiling a pesky area of your code, this technique may prove to be a valuable tool. Where is the rest of the code? In order get this example up and running, there is some setup code that was not shown here in order to keep the post from becoming too bloated. I encourage you take a look at this code here if you are interested! About the author Nick Maccharoli is an iOS/backend developer and open source enthusiast working at a start-up in Tokyo and enjoying the current development scene. You can see what he is up to at @din0sr or github.com/nirma.
Read more
  • 0
  • 0
  • 19758
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-building-product-recommendation-system
Packt
29 Mar 2016
25 min read
Save for later

Building a Product Recommendation System

Packt
29 Mar 2016
25 min read
 In this article by Raghav Bali and Dipanjan Sarkar, author of the book R Machine Learning By Example, we will discuss about, collaborative filtering is a simple yet very effective approach for predicting and recommending items to users. If we look closely, the algorithms work on input data, which is nothing but a matrix representation of the user ratings for different products. Bringing in a mathematical perspective into the picture, matrix factorization is a technique to manipulate matrices and identify latent or hidden features from the data represented in the matrix. Building upon the same concept, let us use matrix factorization as the basis for predicting ratings for items which the user has not yet rated. (For more resources related to this topic, see here.)  Matrix factorization Matrix factorization refers to identification of two or more matrices such that when these matrices are multiplied we get the original matrix. Matrix factorization, as mentioned earlier, can be used to discover latent features between two different kinds of entities. We will understand and use the concepts of matrix factorization as we go along preparing our recommender engine for our e-commerce platform. As our aim for the current project is to personalize the shopping experience and recommend product ratings for an e-commerce platform, our input data contains user ratings for various products on the website. We process the input data and transform it into matrix representation for analyzing it using matrix factorization. The input data looks like this: User ratings matrix As you can see, the input data is a matrix with each row representing a particular user's rating for different items represented in the columns. For the current case, the columns representing items are different mobile phones such as iPhone 4, iPhone 5s, Nexus 5, and so on. Each row contains ratings for each of these mobile phones as given by eight different users. The ratings range from 1 to 5 with 1 being the lowest and 5 being the highest. A rating of 0 represents unrated items or missing rating. The task of our recommender engine will be to predict the correct rating for the missing ones in the input matrix. We could then use the predicted ratings to recommend items most desired by the user. The premise here is that two users would rate a product similarly if they like similar features of the product or item. Since our current data is related to user ratings for different mobile phones, people might rate the phones based on their hardware configuration, price, OS, and so on. Hence, matrix factorization tries to identify these latent features to predict ratings for a certain user and a certain product. While trying to identify these latent features, we proceed with the basic assumption that the number of such features is less than the total number of items in consideration. This assumption makes sense because if this was the case, then each user would have a specific feature associated with him/her (and similarly for the product). This would in turn make recommendations futile as none of the users would be interested in items rated by the other users (which is not the case usually). Now let us get into the mathematical details of matrix factorization and our recommender engine. Since we are dealing with user ratings for different products, let us assume U to be a matrix representing user preferences and similarly, a matrix P representing the products for which we have the ratings. Then the ratings matrix R will be defined as R = U x PT (we take transpose of P as PT for matrix multiplication) where, |R| = |U| x |P|. Assuming the process helps us identify K latent features, our aim is to find two matrices X and Y such that their product (matrix multiplication) approximates R. X = |U| x K matrix Y = |P| x K matrix Here, X is a user related matrix which represents the associations between the users and the latent features. Y, on the other hand, is the product related matrix which represents the associations between the products and the latent features. The task of predicting the rating of a product pj by a user ui is done by calculating the dot product of the vectors corresponding to pj (vector Y, that is the user) and ui (vector X, that is the product). Now, to find the matrices X and Y, we utilize a technique called gradient descent. Gradient descent, in simple terms, tries to find the local minimum of a function; it is an optimization technique. We use gradient descent in the current context to iteratively minimize the difference between the predicted ratings and the actual ratings. To begin with, we randomly initialize the matrices X and Y and then calculate how different their product is from the actual ratings matrix R. The difference between the predicted and the actual values is what is termed as the error. For our problem, we will consider the squared error, which is calculated as: Here, rij is the actual rating by user i for product j and  is the predicted value of the same. To minimize the error, we need to find the correct direction or gradient to change our values to. To obtain the gradient for each of the variables x and y, we differentiate them separately as:   Hence, the equations to find xik and ykj can be given as:   Here α is the constant to denote the rate of descent or the rate of approaching the minima (also known as the learning rate). The value of α defines the size of steps we take in either direction to reach the minima. Large values may lead to oscillations as we may overshoot the minima every time. Usual practice is to select very small values for α, of the order 10-4.  and  are the updated values of xik and ykj after each iteration of gradient descent. To avoid overfitting, along with controlling extreme or large values in the matrices X and Y, we introduce the concept of regularization. Formally, regularization refers to the process of introducing additional information in order to prevent overfitting. Regularization penalizes models with extreme values. To prevent overfitting in our case, we introduce the regularization constant called β. With introduction of β, the equations are updated as follows: Also, As we already have the ratings matrix R and we use it determine how far are our predicted values from the actual, matrix factorization turns into a supervised learning problem. We use some of the rows as our training samples. Let S be our training set with elements being tuples of the form (ui, pj, rij). Thus, our task is to minimise the error (eij) for every tuple (ui, pj, rij) ϵ training set S. The overall error (say E) can be calculated as: E = ∑(ui, pj, rij) ϵS eij   = ∑(ui, pj, rij) ϵS (rij - ∑(k=1 to K) xikykj)2 Implementation Now that we have looked into the mathematics of matrix factorization, let us convert the algorithm into code and prepare a recommender engine for the mobile phone ratings input data set discussed earlier. As shown in the Matrix factorization section, the input dataset is a matrix with each row representing a user's rating for the products mentioned as columns. The ratings range from 1 to 5 with 0 representing the missing values. To transform our algorithm into working code, we need to compute and complete the following tasks: Load the input data and transform it into ratings matrix representation Prepare a matrix factorization based recommendation model Predict and recommend products to the users Interpret and evaluate the model Loading and transforming input data into matrix representation is simple. As seen earlier, R provides us with easy to use utility functions for the same. # load raw ratings from csv raw_ratings <- read.csv(<file_name>)   # convert columnar data to sparse ratings matrix ratings_matrix <- data.matrix(raw_ratings) Now that we have our data loaded into an R matrix, we proceed and prepare the user-latent features matrix X and item-latent features matrix Y. We initialize both from uniform distributions using the runif function. # number of rows in ratings rows <- nrow(ratings_matrix)   # number of columns in ratings matrix columns <- ncol(ratings_matrix)   # latent features K <- 2   # User-Feature Matrix X <- matrix(runif(rows*K), nrow=rows, byrow=TRUE)   # Item-Feature Matrix Y <- matrix(runif(columns*K), nrow=columns, byrow=TRUE) The major component is the matrix factorization function itself. Let us split the task into two, calculation of the gradient and subsequently the overall error. The calculation of the gradient involves the ratings matrix R and the two factor matrices X and Y, along with the constants α and β. Since we are dealing with matrix manipulations (specifically, multiplication), we transpose Y before we begin with any further calculations. The following lines of code convert the algorithm discussed previously into R syntax. All variables follow naming convention similar to the algorithm for ease of understanding. for (i in seq(nrow(ratings_matrix))){         for (j in seq(length(ratings_matrix[i, ]))){           if (ratings_matrix[i, j] > 0){             # error           eij = ratings_matrix[i, j] - as.numeric(X[i, ] %*% Y[, j])          # gradient calculation             for (k in seq(K)){             X[i, k] = X[i, k] + alpha * (2 * eij * Y[k, j]/             - beta * X[i, k])               Y[k, j] = Y[k, j] + alpha * (2 * eij * X[i, k]/             - beta * Y[k, j])           }         }       }     } The next part of the algorithm is to calculate the overall error; we again use similar variable names for consistency: # Overall Squared Error Calculation   e = 0   for (i in seq(nrow(ratings_matrix))){      for (j in seq(length(ratings_matrix[i, ]))){        if (ratings_matrix[i, j] > 0){          e = e + (ratings_matrix[i, j] - /            as.numeric(X[i, ] %*% Y[, j]))^2          for (k in seq(K)){           e = e + (beta/2) * (X[i, k]^2 + Y[k, j]^2)         }       }     } } As a final piece, we iterate over these calculations multiple times to mitigate the risks of cold start and sparsity. We term the variable controlling multiple starts as epoch. We also terminate the calculations once the overall error drops below a certain threshold. Moreover, as we had initialized X and Y from uniform distributions, the predicted values would be real numbers. We round the final output before returning the predicted matrix. Note that this is a very simplistic implementation and a lot of complexity has been kept out for ease of understanding. Hence, this may result in the predicted matrix to contain values greater than 5. For the current scenario, it is safe to assume the values above the max scale of 5 as equivalent to 5 (and similarly for values lesser than 0). We encourage the reader to fine tune the code to handle such cases. Setting α to 0.0002, β to 0.02, K (that is, latent features) to 2, and epoch to 1000, let us see a sample run of our code with overall error threshold set to 0.001: # load raw ratings from csv raw_ratings <- read.csv("product_ratings.csv")   # convert columnar data to sparse ratings matrix ratings_matrix <- data.matrix(raw_ratings)     # number of rows in ratings rows <- nrow(ratings_matrix)   # number of columns in ratings matrix columns <- ncol(ratings_matrix)   # latent features K <- 2   # User-Feature Matrix X <- matrix(runif(rows*K), nrow=rows, byrow=TRUE)   # Item-Feature Matrix Y <- matrix(runif(columns*K), nrow=columns, byrow=TRUE)   # iterations epoch <- 10000   # rate of descent alpha <- 0.0002   # regularization constant beta <- 0.02     pred.matrix <- mf_based_ucf(ratings_matrix, X, Y, K, epoch = epoch)   # setting column names colnames(pred.matrix)<- c("iPhone.4","iPhone.5s","Nexus.5","Moto.X","Moto.G","Nexus.6",/ "One.Plus.One") The preceding lines of code utilize the functions explained earlier to prepare the recommendation model. The predicted ratings or the output matrix looks like the following: Predicted ratings matrix Result interpretation Let us do a quick visual inspection to see how good or bad our predictions have been. Consider users 1 and 3 as our training samples. From the input dataset, we can clearly see that user 1 has given high ratings to iPhones while user 3 has done the same for Android based phones. The following side by side comparison shows that our algorithm has predicted values close enough to the actual values: Ratings by user 1 Let us see the ratings of user 3 in the following screenshot: Ratings by user 3 Now that we have our ratings matrix with updated values, we are ready to recommend products to users. It is common sense to show only the products which the user hasn't rated yet. The right set of recommendations will also enable the seller to pitch the products which have high probability of being purchased by the user. The usual practice is to return a list of top N items from the unrated list of products for each user. The user in consideration is usually termed as the active-user. Let us consider user 6 as our active-user. This user has only rated Nexus 6, One Plus One, Nexus 5, and iPhone4 in that order of rating, that is Nexus 6 was highly rated and iPhone4 was rated the least. Getting a list of Top 2 recommended phones for such a customer using our algorithm would result in Moto X and Moto G (very rightly indeed, do you see why?). Thus, we built a recommender engine smart enough to recommend the right mobile phones to an Android Fanboy and saved the world from yet another catastrophe! Data to the rescue! This simple implementation of recommender engine using matrix factorization gave us a flavor of how such a system actually works. Next, let us get into some real world action using recommender engines. Production ready recommender engines In this article so far, we have learnt about recommender engines in detail and even developed one from scratch (using matrix factorization). Through all this, it is clearly evident how widespread is the application of such systems. E-commerce websites (or for that fact, any popular technology platform) out there today have tonnes of content to offer. Not only that, but the number of users is also huge. In such a scenario, where thousands of users are browsing/buying stuff simultaneously across the globe, providing recommendations to them is a task in itself. To complicate things even further, a good user experience (response times, for example) can create a big difference between two competitors. These are live examples of production systems handling millions of customers day in and day out. Fun Fact Amazon.com is one of the biggest names in the e-commerce space with 244 million active customers. Imagine the amount of data being processed to provide recommendations to such a huge customer base browsing through millions of products! Source: http://www.amazon.com/b?ie=UTF8&node=8445211011 In order to provide a seamless capability for use in such platforms, we need highly optimized libraries and hardware. For a recommender engine to handle thousands of users simultaneously every second, R has a robust and reliable framework called the recommenderlab. Recommenderlab is a widely used R extension designed to provide a robust foundation for recommender engines. The focus of this library is to provide efficient handling of data, availability of standard algorithms and evaluation capabilities. In this section, we will be using recommenderlab to handle a considerably large data set for recommending items to users. We will also use the evaluation functions from recommenderlab to see how good or bad our recommendation system is. These capabilities will help us build a production ready recommender system similar (or at least closer) to what many online applications such as Amazon or Netflix use. The dataset used in this section contains ratings for 100 items as rated by 5000 users. The data has been anonymised and the product names have been replaced by product IDs. The rating scale used is 0 to 5 with 1 being the worst, 5 being the best, and 0 representing unrated items or missing ratings. To build a recommender engine using recommenderlab for a production ready system, the following steps are to be performed: Extract, transform, and analyze the data. Prepare a recommendation model and generate recommendations. Evaluate the recommendation model. We will look at all these steps in the following subsections. Extract, transform, and analyze As in case of any data intensive (particularly machine learning) application, the first and foremost step is to get the data, understand/explore it, and then transform it into the format required by the algorithm deemed fit for the current application. For our recommender engine using recommenderlab package, we will first load the data from a csv file described in the previous section and then explore it using various R functions. # Load recommenderlab library library("recommenderlab")   # Read dataset from csv file raw_data <- read.csv("product_ratings_data.csv")   # Create rating matrix from data ratings_matrix<- as(raw_data, "realRatingMatrix")   #view transformed data image(ratings_matrix[1:6,1:10]) The preceding section of code loads the recommenderlab package and then uses the standard utility function to read the product_ratings_data.csv file. For exploratory as well as further steps, we need the data to be transformed into user-item ratings matrix format (as described in the Core concepts and definitions section). The as(<data>,<type>) utility converts csv into the required ratings matrix format. The csv file contains data in the format shown in the following screenshot. Each row contains a user's rating for a specific product. The column headers are self explanatory. Product ratings data The realRatingMatrix conversion transforms the data into a matrix as shown in the following image. The users are depicted as rows while the columns represent the products. Ratings are represented using a gradient scale where white represents missing/unrated rating while black denotes a rating of 5/best. Ratings matrix representation of our data Now that we have the data in our environment, let us explore some of its characteristics and see if we can decipher some key patterns. First of all, we extract a representative sample from our main data set (refer to the screenshot Product ratings data) and analyse it for: Average rating score for our user population Spread/distribution of item ratings across the user population Number of items rated per user The following lines of code help us explore our data set sample and analyse the points mentioned previously: # Extract a sample from ratings matrix sample_ratings <-sample(ratings_matrix,1000)   # Get the mean product ratings as given by first user rowMeans(sample_ratings[1,])     # Get distribution of item ratings hist(getRatings(sample_ratings), breaks=100,/      xlab = "Product Ratings",main = " Histogram of Product Ratings")   # Get distribution of normalized item ratings hist(getRatings(normalize(sample_ratings)),breaks=100,/             xlab = "Normalized Product Ratings",main = /                 " Histogram of Normalized Product Ratings")   # Number of items rated per user hist(rowCounts(sample_ratings),breaks=50,/      xlab = "Number of Products",main =/      " Histogram of Product Count Distribution") We extract a sample of 1,000 users from our dataset for exploration purposes. The mean of product ratings as given by the first row in our user-rating sample is 2.055. This tells us that this user either hasn't seen/rated many products or he usually rates the products pretty low. To get a better idea of how the users rate products, we generate a histogram of item rating distribution. This distribution peaks around the middle, that is, 3. The histogram is shown next: Histogram for ratings distribution The histogram shows that the ratings are normally distributed around the mean with low counts for products with very high or very low ratings. Finally, we check the spread of the number of products rated by the users. We prepare a histogram which shows this spread: Histogram of number of rated products The preceding histogram shows that there are many users who have rated 70 or more products, as well as there are many users who have rated all the 100 products. The exploration step helps us get an idea of how our data is. We also get an idea about the way the users generally rate the products and how many products are being rated. Model preparation and prediction We have the data in our R environment which has been transformed into the ratings matrix format. In this section, we are interested in preparing a recommender engine based upon user-based collaborative filtering. We will be using similar terminology as described in the previous sections. Recommenderlab provides straight forward utilities to learn and prepare a model for building recommender engines. We prepare our model based upon a sample of just 1,000 users. This way, we can use this model to predict the missing ratings for rest of the users in our ratings matrix. The following lines of code utilize the first thousand rows for learning the model: # Create 'User Based collaborative filtering' model ubcf_recommender <- Recommender(ratings_matrix[1:1000],"UBCF") "UBCF" in the preceding code signifies user-based collaborative filtering. Recommenderlab also provides other algorithms, such as IBCF or Item-Based Collaborative Filtering, PCA or Principal Component Analysis, and others as well. After preparing the model, we use it to predict the ratings for our 1,010th and 1,011th users in the system. Recommenderlab also requires us to mention the number of items to be recommended to the users (in the order of preference of course). For the current case, we mention 5 as the number of items to be recommended. # Predict list of product which can be recommended to given users recommendations <- predict(ubcf_recommender,/                   ratings_matrix[1010:1011], n=5)   # show recommendation in form of the list as(recommendations, "list") The preceding lines of code generate two lists, one for each of the users. Each element in these lists is a product for recommendation. The model predicted that for user 1,010, product prod_93 should be recommended as the top most product followed by prod_79, and so on. # output generated by the model [[1]] [1] "prod_93" "prod_79" "prod_80" "prod_83" "prod_89"   [[2]] [1] "prod_80" "prod_85" "prod_87" "prod_75" "prod_79" Recommenderlab is a robust platform which is optimized to handle large datasets. With a few lines of code, we were able to load the data, learn a model, and even recommend products to the users in virtually no time. Compare this with the basic recommender engine we developed using matrix factorization which involved a lot many lines of code (when compared to recommenderlab) apart from the obvious difference in performance. Model evaluation We have successfully prepared a model and used it for predicting and recommending products to the users in our system. But what do we know about the accuracy of our model? To evaluate the prepared model, recommenderlab has handy and easy to use utilities. Since we need to evaluate our model, we need to split it into training and test data sets. Also, recommenderlab requires us to mention the number of items to be used for testing (it uses the rest for computing the error). For the current case, we will use 500 users to prepare an evaluation model. The model will be based upon 90-10 training-testing dataset split with 15 items used for test sets. # Evaluation scheme eval_scheme <- evaluationScheme(ratings_matrix[1:500],/                       method="split",train=0.9,given=15)   # View the evaluation scheme eval_scheme   # Training model training_recommender <- Recommender(getData(eval_scheme,/                        "train"), "UBCF")   # Preditions on the test dataset test_rating <- predict(training_recommender,/                getData(eval_scheme, "known"), type="ratings")   #Error error <- calcPredictionAccuracy(test_rating,/                    getData(eval_scheme, "unknown"))   error We use the evaluation scheme to train our model based upon UBCF algorithm. The prepared model from the training dataset is used to predict ratings for the given items. We finally use the method calcPredictionAccuracy to calculate the error in predicting the ratings between known and unknown components of the test set. For our case, we get an output as follows: The generated output mentions the values for RMSE or root mean squared error, MSE or mean squared error, and MAE or mean absolute error. For RMSE in particular, the values deviate from the correct values by 1.162 (note that the values might deviate slightly across runs due to various factors such as sampling, iterations, and so on). This evaluation will make more sense when the outcomes are compared from different CF algorithms. For evaluating UBCF, we use IBCF as comparator. The following few lines of code help us prepare an IBCF based model and test the ratings, which can then be compared using the calcPredictionAccuracy utility: # Training model using IBCF training_recommender_2 <- Recommender(getData(eval_scheme,/                                      "train"), "IBCF")   # Preditions on the test dataset test_rating_2 <- predict(training_recommender_2,/                   getData(eval_scheme, "known"),/                 type="ratings")   error_compare <- rbind(calcPredictionAccuracy(test_rating,/                 getData(eval_scheme, "unknown")),/                        calcPredictionAccuracy(test_rating_2,/                 getData(eval_scheme, "unknown")))   rownames(error_compare) <- c("User Based CF","Item Based CF") The comparative output shows that UBCF outperforms IBCF with lower values of RMSE, MSE, and MAE. Similarly, we can use the other algorithms available in recommenderlab to test/evaluate our models. We encourage the user to try out a few more and see which algorithm has the least error in predicted ratings. Summary In this arcticle, we continued our pursuit of using machine learning in the field of e-commerce to enhance sales and overall user experience. In this article, we accounted for the human factor and looked into the recommendation engines based upon user behavior. We started off by understanding what are recommendation systems and their classifications into user-based, content-based, and hybrid recommender systems. We touched upon the problems associated with recommender engines in general. Then we dived deep into the specifics of Collaborative Filters and discussed the math around prediction and similarity measures. After getting our basics straight, we moved onto building a recommender engine of our own from scratch. We utilized matrix factorization to build a recommender engine step by step using a small dummy dataset. We then moved onto building a production ready recommender engine using R's popular library called recommenderlab. We used user-based CF as our core algorithm to build a recommendation model upon a bigger dataset containing ratings for 100 products by 5,000 users. We closed our discussion by evaluating our recommendation model using recommenderlab's utility methods. Resources for Article: Further resources on this subject: Machine Learning with R [article] Introduction to Machine Learning with R [article] Training and Visualizing a neural network with R [article]
Read more
  • 0
  • 0
  • 5424

article-image-boosting-performance-database
Packt
29 Mar 2016
10 min read
Save for later

Boosting up the Performance of a Database

Packt
29 Mar 2016
10 min read
 In this article by Altaf Hussain, author of the book Learning PHP 7 High Performance we will see how databases play a key role in dynamic websites. All incoming and outgoing data is stored in databases. So if the database for a PHP application is not well-designed and optimized, then it will affect the application performance tremendously. In this article, we will be looking into the ways to optimize our PHP application database. (For more resources related to this topic, see here.) MySQL MySQL is the most used Relational Database Management System (RDMS) for the web. It is open source and has a free community version. It provides all those features, which can be provided by an enterprise-level database. The default settings provided with the MySQL installation may not be so good for performance, and there are always ways to fine-tune settings to get an increased performance. Also, remember that your database design also plays a role in performance. A poorly designed database will have an effect on overall performance. In this article, we will discuss how to improve the MySQL database performance. We will be modifying the MySQL configuration my.cnf file. This file is located in different places in different OSes. Also, if you are using XAMPP, WAMP, and so on, on Windows, this file will be located in those respective folders. Whenever my.cnf is mentioned, it is assumed that the file is open no matter which OS is used. Query Caching Query Caching is an important performance feature of MySQL. It caches SELECT queries along with the resulting dataset. When an identical SELECT query occurs, MySQL fetches the data from memory; hence, the query is executed faster. Thus, this reduces the load on the database. To check whether query cache is enabled on a MySQL server or not, issue the following command in your MySQL command line: SHOW VARIABLES LIKE 'have_query_cache'; This command will display an output, as follows: This result set shows that query cache is enabled. If query cache is disabled, the value will be NO. To enable query caching, open up the my.cnf file and add the following lines. If these lines are present, just uncomment them if they are commented: query_cache_type = 1 query_cache_size = 128MB query_cache_limit = 1MB Save the my.cnf file and restart the MySQL server. Let's discuss what these three configurations mean. query_cache_size The query_cache_size parameter means how much memory will be allocated. Some will think that the more memory used, the better this is; but this is just a misunderstanding. It all depends on the size of the database, the types of queries, and ratios between read and writes, hardware and database traffic, and so on. A good value for query_cache_size is in between 100 MB and 200 MB. Then, monitor the performance and the other previously mentioned variables on which the query cache depends, and adjust the size. We have used 128 MB for a medium range traffic magento website, and it is working perfectly. Set this value to 0 to disable the query cache. query_cache_limit This defines the maximum size of a query dataset to be cached. If the size of a query dataset is larger than this value, it won't be cached. The value of this configuration can be guessed by finding out the largest select query and the size of its returned dataset. query_cache_type The query_cache_type parameter plays a weird role. If query_cache_type is set to 1, then the following may occur: If query_cache_size is 0, then no memory is allocated and query cache is disabled If query_cache_size is greater than 0, then query cache is enabled, memory is allocated, and all queries that do not exceed query_cache_limit and use the SQL_NO_CACHE option will be cached If query_cache_type value is 0, then the following occurs: If query_cache_size is 0, then no memory is allocated and the cache is disabled If query_cache_size is greater than 0, then the memory is allocated, but nothing is cached, that is, the cache is disabled Storage Engines Storage Engines (or Table Types) are a part of core MySQL and are responsible for handling operations on tables. MySQL provides several storage engines, and the two most widely-used are MyISAM and InnoDB. Both storage engines have their own pros and cons, but InnoDB is always prioritized. MySQL started to use InnoDB as its default storage engine starting from version 5.5. MySQL provides some other storage engines, which have their own purposes. During the database design process, which table should use which storage engine can be decided. A complete list of storage engines for MySQL 5.6 can be found at http://dev.mysql.com/doc/refman/5.6/en/storage-engines.html. Storage engine can be set at database level, which will be then used as default storage engine for each newly created table. Note that the storage engine is table-based and different tables can have different storage engines in a single database. What if we have a table already created and we want to change its storage engine? This is easy. Let's say our table name is pkt_users and its storage engine is MyISAM and we want to change it to InnoDB, then we will use the following MySQL command: ALTER TABLE pkt_users ENGINE=INNODB; This will change the storage engine of the table to InnoDB. Now, let's discuss the difference between the two most widely-used storage engines MyISAM and InnoDB. MyISAM A brief list of features that are or are not supported by MyISAM is as follows: MyISAM is designed for speed, which plays best with SELECT statement. If a table is more static, that is, the data in that table is less frequently updated or deleted and mostly the data is only fetched, then MyISAM is best for this table. MyISAM supports table-level locking. If a specific operation needs to be performed on data in a table, then the complete table can be locked. During this lock, no operation can be performed on this table. This can cause performance degradation if the table is more dynamic, that is, the data is frequently changing in the table. MyISAM does not have support for Foreign Keys (FK). MyISAM supports fulltext search. MyISAM does not support transactions. So, there is no support for commit and rollback. If a query on a table is executed, it is executed and there is no coming back. Data compression, Replication, Query Cache, and Data encryption is supported. Cluster database is not supported. InnoDB A brief list of features that are or are not supported by InnoDB is as follows: InnoDB is designed for high reliability and high performance when processing a high volume of data. InnoDB supports row-level locking. It is a good feature and is great for performance. Instead of locking the complete table like MyISAM, it locks only the specific rows for SELECT, DELETE, or UPDATE operations; and during these operations, other data in this table can be manipulated. InnoDB supports Foreign Keys and support forcing Foreign Keys Constraints. Transactions are supported. Commits and rollbacks are possible; hence, data can be recovered from a specific transaction. Data Compression, Replication, Query Cache, and Data encryption is supported. InnoDB can be used in a cluster environment, but it does not have full support. However, the InnoDB tables can be converted to an NDB storage engine, which is used in a MySQL cluster by changing the table engine to NDB. In the following sections, we will discuss some more performance features that are related to InnoDB. Values for the following configuration are set in the my.cnf file. InnoDB_buffer_pool_size This setting defines how much memory should be used for InnoDB data and indexes loaded into memory. For a dedicated MySQL server, the recommended value is 50-80% of the installed memory on the sever. If this value is set to a high value, then there will be no memory left for the operating system and other subsystems of MySQL, such as transaction logs. So, let's open our my.cnf file, search for innodb_buffer_pool_size, and set the value in between the recommended value (50-80%) of our RAM. Innoddb_buffer_pool_instances This feature is not that widely-used. This feature enables multiple buffer pool instances to work together to reduce the chances of memory contentions on 64 bits' system and with a large value for innodb_buffer_pool_size. There are different choices on which the value for innodb_buffer_pool_instances should be calculated. One way is to use one instance per GB of innodb_buffer_pool_size. So, if the value of innodb_bufer_pool_size is 16 GB, we will set innodb_buffer_pool_instances to 16. InnoDB_log_file_size Inno_db_log_file_size is the the size of the log file that stores every query information that has been executed. For a dedicated server, a value up to 4 GB is safe, but the time of crash recovery may increase if the log file size is too big. So, in best practices, it should be kept in between 1 GB to 4 GB. Percona server According to Percona website, "Percona server is a free, fully compatible, enhanced, open source drop-in replacement for MySQL that provides superior performance, scalability, and instrumentation." Percona is a fork of MySQL with enhanced features for performance. All the features available in MySQL are available in Percona. Percona uses an enhanced storage engine, which is called XtraDB. According to the Percona website: "Percona XtraDB is an enhanced version of the InnoDB storage engine for MySQL, which has more features, faster performance, and better scalability on modern hardware. Percona XtraDB uses memory more efficiently in high-load environments." As mentioned previously, XtraDB is a fork of InnoDB, so all features available with InnoDB are available in XtraDB. Installation Percona is only available for Linux systems. It is not available for Windows as of now. In this book, we will install the Percona server on Debian 8. The process is the same for both Ubuntu and Debian. To install the Percona server on other Linux flavors, check out the Percona Installation manual at https://www.percona.com/doc/percona-server/5.5/installation.html. As of now, they provide instructions for Debian, Ubuntu, CentOS, and RHEL. They also provide instructions to install the Percona server from sources and Git. Now, let's install Percona server using the following steps: Open your sources list file using the following command in your terminal: sudo nano /etc/apt/sources.list If prompted for a password, enter your Debian password. The file will be opened. Now, place the following repository information at the end of the sources.list file: deb http://repo.percona.com/apt jessie main deb-src http://repo.percona.com/apt jessie main Save the file by clicking on CTRL + O and close the file by clicking on CTRL + X. Update your system using the following command in terminal: sudo apt-get update Start the installation by issuing the following command in terminal: sudo apt-get install percona-server-server-5.5 The installation will start. The process is the same as the MySQL server installation. During installation, the root password for the Percona server will be asked. You just need to enter it. When the installation is completed, you are ready to use the Percona server in the same way as you would use MySQL. Configure the Percona server and optimize it as discussed in the previous sections. Summary In this article, we studied the MySQL and Percona servers with Query Caching and other MySQL configuration options for performance. We also compared different storage engines and Percona XtraDB. We saw MySQL Workbench Performance monitoring tools as well. Resources for Article: Further resources on this subject: Building a Web Application with PHP and MariaDB – Introduction to caching [article] PHP Magic Features [article] Understanding PHP basics [article]
Read more
  • 0
  • 0
  • 3014

article-image-making-app-react-and-material-design
Soham Kamani
21 Mar 2016
7 min read
Save for later

Making an App with React and Material Design

Soham Kamani
21 Mar 2016
7 min read
There has been much progression in the hybrid app development space, and also in React.js. Currently, almost all hybrid apps use cordova to build and run web applications on their platform of choice. Although learning React can be a bit of a steep curve, the benefit you get is that you are forced to make your code more modular, and this leads to huge long-term gains. This is great for developing applications for the browser, but when it comes to developing mobile apps, most web apps fall short because they fail to create the "native" experience that so many users know and love. Implementing these features on your own (through playing around with CSS and JavaScript) may work, but it's a huge pain for even something as simple as a material-design-oriented button. Fortunately, there is a library of react components to help us out with getting the look and feel of material design in our web application, which can then be ported to a mobile to get a native look and feel. This post will take you through all the steps required to build a mobile app with react and then port it to your phone using cordova. Prerequisites and dependencies Globally, you will require cordova, which can be installed by executing this line: npm install -g cordova Now that this is done, you should make a new directory for your project and set up a build environment to use es6 and jsx. Currently, webpack is the most popular build system for react, but if that's not according to your taste, there are many more build systems out there. Once you have your project folder set up, install react as well as all the other libraries you would be needing: npm init npm install --save react react-dom material-ui react-tap-event-plugin Making your app Once we're done, the app should look something like this:   If you just want to get your hands dirty, you can find the source files here. Like all web applications, your app will start with an index.html file: <html> <head> <title>My Mobile App</title> </head> <body> <div id="app-node"> </div> <script src="bundle.js" ></script> </body> </html> Yup, that's it. If you are using webpack, your CSS will be included in the bundle.js file itself, so there's no need to put "style" tags either. This is the only HTML you will need for your application. Next, let's take a look at index.js, the entry point to the application code: //index.js import React from 'react'; import ReactDOM from 'react-dom'; import App from './app.jsx'; const node = document.getElementById('app-node'); ReactDOM.render( <App/>, node ); What this does is grab the main App component and attach it to the app-node DOM node. Drilling down further, let's look at the app.jsx file: //app.jsx'use strict';import React from 'react';import AppBar from 'material-ui/lib/app-bar';import MyTabs from './my-tabs.jsx';let App = React.createClass({ render : function(){ return ( <div> <AppBar title="My App" /> <MyTabs /> </div> ); }});module.exports = App; Following react's philosophy of structuring our code, we can roughly break our app down into two parts: The title bar The tabs below The title bar is more straightforward and directly fetched from the material-ui library. All we have to do is supply a "title" property to the AppBar component. MyTabs is another component that we have made, put in a different file because of the complexity: 'use strict';import React from 'react';import Tabs from 'material-ui/lib/tabs/tabs';import Tab from 'material-ui/lib/tabs/tab';import Slider from 'material-ui/lib/slider';import Checkbox from 'material-ui/lib/checkbox';import DatePicker from 'material-ui/lib/date-picker/date-picker';import injectTapEventPlugin from 'react-tap-event-plugin';injectTapEventPlugin();const styles = { headline: { fontSize: 24, paddingTop: 16, marginBottom: 12, fontWeight: 400 }};const TabsSimple = React.createClass({ render: () => ( <Tabs> <Tab label="Item One"> <div> <h2 style={styles.headline}>Tab One Template Example</h2> <p> This is the first tab. </p> <p> This is to demonstrate how easy it is to build mobile apps with react </p> <Slider name="slider0" defaultValue={0.5}/> </div> </Tab> <Tab label="Item 2"> <div> <h2 style={styles.headline}>Tab Two Template Example</h2> <p> This is the second tab </p> <Checkbox name="checkboxName1" value="checkboxValue1" label="Installed Cordova"/> <Checkbox name="checkboxName2" value="checkboxValue2" label="Installed React"/> <Checkbox name="checkboxName3" value="checkboxValue3" label="Built the app"/> </div> </Tab> <Tab label="Item 3"> <div> <h2 style={styles.headline}>Tab Three Template Example</h2> <p> Choose a Date:</p> <DatePicker hintText="Select date"/> </div> </Tab> </Tabs> )});module.exports = TabsSimple; This file has quite a lot going on, so let’s break it down step by step: We import all the components that we're going to use in our app. This includes tabs, sliders, checkboxes, and datepickers. injectTapEventPlugin is a plugin that we need in order to get tab switching to work. We decide the style used for our tabs. Next, we make our Tabs react component, which consists of three tabs: The first tab has some text along with a slider. The second tab has a group of checkboxes. The third tab has a pop-up datepicker. Each component has a few keys, which are specific to it (such as the initial value of the slider, the value reference of the checkbox, or the placeholder for the datepicker). There are a lot more properties you can assign, which are specific to each component. Building your App For building on Android, you will first need to install the Android SDK. Now that we have all the code in place, all that is left is building the app. For this, make a new directory, start a new cordova project, and add the Android platform, by running the following on your terminal: mkdir my-cordova-project cd my-cordova-project cordova create . cordova platform add android Once the installation is complete, build the code we just wrote previously. If you are using the same build system as the source code, you will have only two files, that is, index.html and bundle.min.js. Delete all the files that are currently present in the www folder of your cordova project and copy those two files there instead. You can check whether your app is working on your computer by running cordova serve and going to the appropriate address on your browser. If all is well, you can build and deploy your app: cordova build android cordova run android This will build and install the app on your Android device (provided it is in debug mode and connected to your computer). Similarly, you can build and install the same app for iOS or windows (you may need additional tools such as XCode or .NET for iOS or Windows). You can also use any other framework to build your mobile app. The angular framework also comes with its own set of material design components. About the Author Soham Kamani is a full-stack web developer and electronics hobbyist.  He is especially interested in JavaScript, Python, and IoT.
Read more
  • 0
  • 0
  • 14657

article-image-delegate-pattern-limitations-swift
Anthony Miller
18 Mar 2016
5 min read
Save for later

Delegate Pattern Limitations in Swift

Anthony Miller
18 Mar 2016
5 min read
If you've ever built anything using UIKit, then you are probably familiar with the delegate pattern. The delegate pattern is used frequently throughout Apple's frameworks and many open source libraries you may come in contact with. But many times, it is treated as a one-size-fits-all solution for problems that it is just not suited for. This post will describe the major shortcomings of the delegate pattern. Note: This article assumes that you have a working knowledge of the delegate pattern. If you would like to learn more about the delegate pattern, see The Swift Programming Language - Delegation. 1. Too Many Lines! Implementation of the delegate pattern can be cumbersome. Most experienced developers will tell you that less code is better code, and the delegate pattern does not really allow for this. To demonstrate, let's try implementing a new view controller that has a delegate using the least amount of lines possible. First, we have to create a view controller and give it a property for its delegate: class MyViewController: UIViewController { var delegate: MyViewControllerDelegate? } Then, we define the delegate protocol. protocol MyViewControllerDelegate { func foo() } Now we have to implement the delegate. Let's make another view controller that presents a MyViewController: class DelegateViewController: UIViewController { func presentMyViewController() { let myViewController = MyViewController() presentViewController(myViewController, animated: false, completion: nil) } } Next, our DelegateViewController needs to conform to the delegate protocol: class DelegateViewController: UIViewController, MyViewControllerDelegate { func presentMyViewController() { let myViewController = MyViewController() presentViewController(myViewController, animated: false, completion: nil) } func foo() { /// Respond to the delegate method. } } Finally, we can make our DelegateViewController the delegate of MyViewController: class DelegateViewController: UIViewController, MyViewControllerDelegate { func presentMyViewController() { let myViewController = MyViewController() myViewController.delegate = self presentViewController(myViewController, animated: false, completion: nil) } func foo() { /// Respond to the delegate method. } } That's a lot of boilerplate code that is repeated every time you want to create a new delegate. This opens you up to a lot of room for errors. In fact, the above code has a pretty big error already that we are going to fix now. 2. No Non-Class Type Delegates Whenever you create a delegate property on an object, you should use the weak keyword. Otherwise, you are likely to create a retain cycle. Retain cycles are one of the most common ways to create memory leaks and can be difficult to track down. Let's fix this by making our delegate weak: class MyViewController: UIViewController { weak var delegate: MyViewControllerDelegate? } This causes another problem though. Now we are getting a build error from Xcode! 'weak' cannot be applied to non-class type 'MyViewControllerDelegate'; consider adding a class bound. This is because you can't make a weak reference to a value type, such as a struct or an enum, so in order to use the weak keyword here, we have to guarantee that our delegate is going to be a class. Let's take Xcode's advice here and add a class bound to our protocol: protocol MyViewControllerDelegate: class { func foo() } Well, now everything builds just fine, but we have another issue. Now your delegate must be an object (sorry structs and enums!). You are now creating more constraints on what can conform to your delegate. The whole point of the delegate pattern is to allow an unknown "something" to respond to the delegate events. We should be putting as few constraints as possible on our delegate object, which brings us to the next issue with the delegate pattern. 3. Optional Delegate Methods In pure Swift, protocols don't have optional functions. This means, your delegate must implement every method in the delegate protocol, even if it is irrelevant in your case. For example, you may not always need to be notified when a user taps a cell in a UITableView. There are ways to get around this though. In Swift 2.0+, you can make a protocol extension on your delegate protocol that contains a default implementation for protocol methods that you want to make optional. Let's make a new optional method on our delegate protocol using this method: protocol MyViewControllerDelegate: class { func foo() func optionalFunction() } extension MyViewControllerDelegate { func optionalFunction() { } } This adds even more unnecessary code. It isn't really clear what the intention of this extension is unless you understand what's going on already, and there is no way to explicitly show that this method is optional. Alternatively, if you mark your protocol as @objc, you can use the optional keyword in your function declaration. The problem here is that now your delegate must be an Objective-C object. Just like our last example, this is creating additional constraints on your delegate, and this time they are even more restrictive. 4. There Can Be Only One The delegate pattern only allows for one delegate to respond to events. This may be just fine for some situations, but if you need multiple objects to be notified of an event, the delegate pattern may not work for you. Another common scenario you may come across is when you need different objects to be notified of different delegate events. The delegate pattern can be a very useful tool, which is why it is so widely used, but recognizing the limitations that it creates is important when you are deciding whether it is the right solution for any given problem. About the author Anthony Miller is the lead iOS developer at App-Order in Las Vegas, Nevada, USA. He has written and released numerous apps on the App Store and is an avid open source contributor. When he's not developing, Anthony loves board games, line-dancing, and frequent trips to Disneyland.
Read more
  • 0
  • 0
  • 17579
article-image-neutron-api-basics
Packt
18 Mar 2016
13 min read
Save for later

Neutron API Basics

Packt
18 Mar 2016
13 min read
In this article by James Denton, the author of the book OpenStack Networking Essentials, you can see that Neutron is a virtual networking service that allows users to define network connectivity and IP addressing for instances and other cloud resources using an application programmable interface (API). The Neutron API is made up of core elements that define basic network architectures and extensions that extend base functionality. Neutron accomplishes this by virtue of its data model that consists of networks, subnets, and ports. These objects help define characteristics of the network in an easily storable format. (For more resources related to this topic, see here.) These core elements are used to build a logical network data model using information that corresponds to layers 1 through 3 of the OSI model, shown in the following screenshot: For more information on the OSI model, check out the Wikipedia article at https://en.wikipedia.org/wiki/OSI_model. Neutron uses plugins and drivers to identify network features and construct the virtual network infrastructure based on information stored in the database. A core plugin, such as the Modular Layer 2 (ML2) plugin included with Neutron, implements the core Neutron API and is responsible for adapting the logical network described by networks, ports, and subnets into something that can be implemented by the L2 agent and IP address management system running on the hosts. The extension API, provided by service plugins, allows users to manage the following resources, among others: Security groups Quotas Routers Firewalls Load balancers Virtual private networks Neutron's extensibility means that new features can be implemented in the form of extensions and plugins that extend the API without requiring major changes. This allows vendors to introduce features and functionality that would otherwise not be available with the base API. The following diagram demonstrates at a high level how the Neutron API server interacts with the various plugins and agents responsible for constructing the virtual and physical network across the cloud: The previous diagram demonstrates the interaction between the Neutron API service, Neutron plugins and drivers, and services such as the L2 and L3 agents. As network actions are performed by users via the API, the Neutron server publishes messages to the message queue that are consumed by agents. L2 agents build and maintain the virtual network infrastructure, while L3 agents are responsible for building and maintaining Neutron routers and associated functionality. The Neutron API specifications can be found on the OpenStack wiki at https://wiki.openstack.org/wiki/Neutron/APIv2-specification. In the next few sections, we will look at some of the core elements of the API and the data models used to represent those elements. Networks A network is the central object of the Neutron v2.0 API data model and describes an isolated L2 segment. In a traditional infrastructure, machines are connected to switch ports that are often grouped together into virtual local area networks (VLANs) identified by unique IDs. Machines in the same network or VLAN can communicate with one another but cannot communicate with other networks in other VLANs without the use of a router. The following diagram demonstrates how networks are isolated from one another in a traditional infrastructure: Neutron network objects have attributes that describe the network type and the physical interface used for traffic. The attributes also describe the segmentation ID used to differentiate traffic between other networks connected to virtual switches on the underlying host. The following diagram shows how a Neutron network describes various Layer 1 and Layer 2 attributes: Traffic between instances on different hosts requires underlying connectivity between the hosts. This means that the hosts must reside on the same physical switching infrastructure so that VLAN-tagged traffic can pass between them. Traffic between hosts can also be encapsulated using L2-in-L3 technologies such as GRE or VXLAN. Neutron supports multiple L2 methods of segmenting traffic, including using 802.1q VLANs, VXLANs, GRE, and more, depending on the plugin and configured drivers and agents. Devices in the same network are in the same broadcast domain, even though they may reside on different hosts and attach to different virtual switches. Neutron network attributes are very important in defining how traffic between virtual machine instances should be forwarded between hosts. Network attributes The following table describes base attributes associated with network objects, and more details can be found at the Neutron API specifications wiki referenced earlier in this article: Attribute Type Required Default Notes id uuid-str N/A Auto generated The UUID for the network name string no None The human-readable name for the network admin_state_up boolean no True The administrative state of the network status string N/A Null Indicates whether the network is currently operational subnets list no Empty list The subnets associated with the network shared boolean no False Specifies whether the network can be accessed by any tenant tenant_id uuid-str no N/A The owner of the network Networks are typically associated with tenants or projects and are usable by any user that is a member of the same tenant or project. Networks can also be shared with all other projects or a subnet of projects using Neutron's role-based access control (RBAC) functionality. Neutron RBAC first became available in the Liberty release of OpenStack. For more information on using the RBAC features, check out my blog at the following URL: https://developer.rackspace.com/blog/A-First-Look-at-RBAC-in-the-Liberty-Release-of-Neutron/. Provider attributes One of the earliest extensions to the Neutron API is known as the provider extension. The provider network extension maps virtual networks to physical networks by adding additional network attributes that describe the network type, segmentation ID, and physical interface. The following table shows various provider attributes and their associated values: Attribute Type Required Options Default Notes provider:network_type string yes vlan,flat,local, vxlan,gre Based on the configuration   provider:segmentation_id int optional Depends on the network type Based on the configuration The segmentation ID range varies among L2 technologies provider:physical_network string optional Provider label Based on the configuration This specifies the physical interface used for traffic (flat or VLAN-only) All networks have provider attributes. However, because provider attributes specify particular network configuration settings and mappings, only users with the admin role can specify them when creating networks. Users without the admin role can still create networks, but the Neutron server, not the user, will determine the type of network created and any corresponding interface or segmentation ID. Additional attributes The external-net extension adds an attribute to networks that is used to determine whether or not the network can be used as the external, or gateway, network for a Neutron router. When set to true, the network becomes eligible for use as a floating IP pool when attached to routers. Using the Neutron router-gateway-set command, routers can be attached to external networks. The following table shows the external network attribute and its associated values: Attribute Type Required Default Notes router:external Boolean no false When true, the network is eligible for use as a floating IP pool when attached to a router Subnets In the Neutron data model, a subnet is an IPv4 or IPv6 address block from which IP addresses can be assigned to virtual machine instances and other network resources. Each subnet must have a subnet mask represented by a classless inter-domain routing (CIDR) address and must be associated with a network, as shown here: In the preceding diagram, three isolated VLAN networks each have a corresponding subnet. Instances and other devices cannot be attached to networks without an associated subnet. Instances connected to a network can communicate among one another but are unable to connect to other networks or subnets without the use of a router. The following diagram shows how a Neutron subnet describes various Layer 3 attributes in the OSI model: When creating subnets, users can specify IP allocation pools that limit which addresses in the subnet are available for allocation. Users can also define a custom gateway address, a list of DNS servers, and individual host routes that can be pushed to virtual machine instances using DHCP. The following table describes attributes associated with subnet objects: Attribute Type Required Default Notes id uuid-str n/a Auto Generated The UUID for the subnet network_id uuid-str Yes N/A The UUID of the associated network name string no None The human-readable name for the subnet ip_version int Yes 4 IP version 4 or 6 cidr string Yes N/A The CIDR address representing the IP address range for the subnet gateway_ip string or null no First address in CIDR The default gateway used by devices in the subnet dns_nameservers list(str) no None The DNS name servers used by hosts in the subnet allocation_pools list(dict) no Every address in the CIDR (excluding the gateway) The subranges of the CIDR available for dynamic allocation. tenant_id uuid-str no N/A The owner of the subnet enable_dhcp boolean no True This indicates whether or not DHCP is enabled for the subnet host_routes list(dict) no N/A Additional static routes Ports In the Neutron data model, a port represents a switch port on a logical switch that spans the entire cloud and contains information about the connected device. Virtual machine interfaces (VMIFs) and other network objects, such as router and DHCP server interfaces, are mapped to Neutron ports. The ports define both the MAC address and the IP address to be assigned to the device associated with them. Each port must be associated with a Neutron network. The following diagram shows how a port describes various Layer 2 attributes in the OSI model: The following table describes attributes associated with port objects: Attribute Type Required Default Notes id uuid-str n/a Auto generated The UUID for the subnet network_id uuid-str Yes N/A The UUID of the associated network name string no None The human-readable name for the subnet admin_state_up Boolean no True The administrative state of the port status string N/A N/A The current status of the port (for example, ACTIVE, BUILD, or DOWN) mac_address string no Auto generated The MAC address of the port fixed_ips list(dict) no Auto allocated The IP address(es) associated with the port device_id string no None The instance ID or other resource associated with the port device_owner string no None   tenant_id uuid-str no ID of tenant adding resource The owner of the port When Neutron is first installed, no ports exist in the database. As networks and subnets are created, ports may be created for each of the DHCP servers reflected by the logical switch model, seen here: As instances are created, a single port is created for each network interface attached to the instance, as shown here: A port can only be associated with a single network. Therefore, if an instance is connected to multiple networks, it will be associated with multiple ports. As instances and other cloud resources are created, the logical switch may scale to hundreds or thousands of ports over time, as shown in the following diagram: There is no limit to the number of ports that can be created in Neutron. However, quotas exist that limit tenants to a small number of ports that can be created. As the number of Neutron ports scale out, the performance of the Neutron API server and the implementation of networking across the cloud may degrade over time. It's a good idea to keep quotas in place to ensure a high-performing cloud, but the defaults and subsequent quota increases should be kept reasonable. The Neutron workflow In the standard Neutron workflow, networks must be created first, followed by subnets and then ports. The following sections describe the workflows involved with booting and deleting instances. Booting an instance Before an instance can be created, it must be associated with a network that has a corresponding subnet or a precreated port that is associated with a network. The following process documents the steps involved in booting an instance and attaching it to a network: The user creates a network. The user creates a subnet and associates it with the network. The user boots a virtual machine instance and specifies the network. Nova interfaces with Neutron to create a port on the network. Neutron assigns a MAC address and IP address to the newly created port using attributes defined by the subnet. Nova builds the instance's libvirt XML file containing local network bridge and MAC address information and starts the instance. The instance sends a DHCP request during boot, at which point the DHCP server responds with the IP address corresponding to the MAC address of the instance. If multiple network interfaces are attached to an instance, each network interface will be associated with a unique Neutron port and may send out DHCP requests to retrieve their respective network information. How the logical model is implemented Neutron agents are services that run on network and compute nodes and are responsible for taking information described by networks, subnets, and ports and using it to implement the virtual and physical network infrastructure. In the Neutron database, the relationship between networks, subnets, and ports can be seen in the following diagram: This information is then implemented on the compute node by way of virtual network interfaces, virtual switches or bridges, and IP addresses, as shown in the following diagram: In the preceding example, the instance was connected to a network bridge on a compute node that provides connectivity from the instance to the physical network. For now, it's only necessary to know how the data model is implemented into something that is usable. Deleting an instance The following process documents the steps involved in deleting an instance: The user destroys virtual machine instance. Nova interfaces with Neutron to destroy the ports associated with the instances. Nova deletes local instance data. The allocated IP and MAC addresses are returned to the pool. When instances are deleted, Neutron removes all virtual network connections from the respective compute node and removes corresponding port information from the database. Summary In this article, we looked at the basics of the Neutron API and its data model made up of networks, subnets, and ports. These objects were used to describe in a logical way how the virtual network is architected and implemented across the cloud. Resources for Article: Further resources on this subject: Introducing OpenStack Trove[article] Concepts for OpenStack[article] Monitoring OpenStack Networks[article]
Read more
  • 0
  • 0
  • 3426

article-image-get-your-apps-ready-android-n
Packt
18 Mar 2016
9 min read
Save for later

Get your Apps Ready for Android N

Packt
18 Mar 2016
9 min read
It seems likely that Android N will get its first proper outing in May, at this year's Google I/O conference, but there's no need to wait until then to start developing for the next major release of the Android platform. Thanks to Google's decision to release preview versions early you can start getting your apps ready for Android N today. In this article by Jessica Thornsby, author of the book Android UI Design, going to look at the major new UI features that you can start experimenting with right now. And since you'll need something to develop your Android N-ready apps in, we're also going to look at Android Studio 2.1, which is currently the recommended development environment for Android N. (For more resources related to this topic, see here.) Multi-window mode Beginning with Android N, the Android operating system will give users the option to display more than one app at a time, in a split-screen environment known as multi-window mode. Multi-window paves the way for some serious multi-app multi-tasking, allowing users to perform tasks such as replying to an email without abandoning the video they were halfway through watching on YouTube, and reading articles in one half of the screen while jotting down notes in Google Keep on the other. When two activities are sharing the screen, users can even drag data from one activity and drop it into another activity directly, for example dragging a restaurant's address from a website and dropping it into Google Maps. Android N users can switch to multi-window mode either by: Making sure one of the apps they want to view in multi-window mode is visible onscreen, then tapping their device's Recent Apps softkey (that's the square softkey). The screen will split in half, with one side displaying the current activity and the other displaying the Recent Apps carousel. The user can then select the secondary app they want to view, and it'll fill the remaining half of the screen. Navigating to the home screen, and then pressing the Recent Apps softkey to open the Recent Apps carousel. The user can then drag one of these apps to the edge of the screen, and it'll open in multi-window mode. The user can then repeat this process for the second activity. If your app targets Android N or higher, the Android operating system assumes that your app supports multi-window mode unless you explicitly state otherwise. To prevent users from displaying your app in multi-window mode, you'll need to add android:resizeableActivity="false" to the <activity> or <application> section of your project's Manifest file. If your app does support multi-window mode, you may want to prevent users from shrinking your app's UI beyond a specified size, using the android:minimalSize attribute. If the user attempts to resize your app so it's smaller than the android:minimalSize value, the system will crop your UI instead of shrinking it. Direct reply notifications Google are adding a few new features to notifications in Android N, including an inline reply action button that allows users to reply to notifications directly from the notification UI.   This is particularly useful for messaging apps, as it means users can reply to messages without even having to launch the messaging application. You may have already encountered direct reply notifications in Google Hangouts. To create a notification that supports direct reply, you need to create an instance of RemoteInput.Builder and then add it to your notification action. The following code adds a RemoteInput to a Notification.Action, and creates a Quick Reply key. When the user triggers the action, the notification prompts the user to input their response: private static final String KEY_QUICK_REPLY = "key_quick_reply"; String replyLabel = getResources().getString(R.string.reply_label); RemoteInput remoteInput = new RemoteInput.Builder(KEY_QUICK_REPLY) .setLabel(replyLabel) .build(); To retrieve the user's input from the notification interface, you need to call: getResultsFromIntent(Intent) and pass the notification action's intent as the input parameter: Bundle remoteInput = RemoteInput.getResultsFromIntent(intent); //This method returns a Bundle that contains the text response// if (remoteInput != null) { return remoteInput.getCharSequence(KEY_QUICK_REPLY); //Query the bundle using the result key, which is provided to the RemoteInput.Builder constructor// Bundled notifications Don't you just hate it when you connect to the World Wide Web first thing in the morning, and Gmail bombards you with multiple new message notifications, but doesn't give you anymore information about the individual emails? Not particularly helpful! When you receive a notification that consists of multiple items, the only thing you can really do is launch the app in question and take a closer look at the events that make up this grouped notification. Android N overcomes this drawback, by letting you group multiple notifications from the same app into a single, bundled notification via a new notification style: bundled notifications. A bundled notification consists of a parent notification that displays summary information for that group, plus individual notification items. If the user wants to see more information about one or more individual items, they can unfurl the bundled notification into separate notifications by swiping down with two fingers. The user can then act on each mini-notification individually, for example they might choose to dismiss the first three notifications about spam emails, but open the forth e-mail. To group notifications, you need to call setGroup() for each notification you want to add to the same notification stack, and then assign these notifications the same key. final static String GROUP_KEY_MESSAGES = "group_key_messages"; Notification notif = new NotificationCompat.Builder(mContext) .setContentTitle("New SMS from " + sender1) .setContentText(subject1) .setSmallIcon(R.drawable.new_message) .setGroup(GROUP_KEY_MESSAGES) .build(); Then when you create another notification that belongs to this stack, you just need to assign it the same group key. Notification notif2 = new NotificationCompat.Builder(mContext) .setContentTitle("New SMS from " + sender1) .setContentText(subject2) .setGroup(GROUP_KEY_MESSAGES) .build(); The second Android N developer preview introduced an Android-specific implementation of the Vulkan API. Vulkan is a cross-platform, 3D rendering API for providing high-quality, real-time 3D graphics. For draw-call heavy applications, Vulkan also promises to deliver a significant performance boost, thanks to a threading-friendly design and a reduction of CPU overhead. You can try Vulkan for yourself on devices running Developer Preview 2, or learn more about Vulkan at the official Android docs (https://developer.android.com/ndk/guides/graphics/index.html?utm_campaign=android_launch_npreview2_041316&utm_source=anddev&utm_medium=blog). Android N Support in Android Studio 2.1 The two Developer Previews aren't the only important releases for developers who want to get their apps ready for Android N. Google also recently released a stable version of Android Studio 2.1, which is the recommended IDE for developing Android N apps. Crucially, with the release of Android Studio 2.1 the emulator can now run the N Developer Preview Emulator System Images, so you can start testing your apps against Android N. Particularly with features like multi-window mode, it's important to test your apps across multiple screen sizes and configurations, and creating various Android N Android Virtual Devices (AVDs) is the quickest and easiest ways to do this. Android 2.1 also adds the ability to use the new Jack compiler (Java Android Compiler Kit), which compiles Java source code into Android dex bytecode. Jack is particularly important as it opens the door to using Java 8 language features in your Android N projects, without having to resort to additional tools or resources. Although not Android N-specific, Android 2.1 makes some improvements to the Instant Run feature, which should result in faster editing and deploy builds for all your Android projects. Previously, one small change in the Java code would cause all Java sources in the module to be recompiled. Instant Run aims to reduce compilation time by analyzing the changes you've made and determining how it can deploy them in the fastest way possible. This is instead of Android Studio automatically going through the lengthy process of recompiling the code, converting it to dex format, generating an APK and installing it on the connected device or emulator every time you make even a small change to your project. To start using Instant Run, select Android Studio from the toolbar followed by Preferences…. In the window that appears, select Build, Execution, Deployment from the side-menu and select Instant Run. Uncheck the box next to Restart activity on code changes. Instant Run is supported only when you deploy a debug build for Android 4.0 or higher. You'll also need to be using Android Plugin for Gradle version 2.0 or higher. Instant Run isn't currently compatible with the Jack toolchain. To use Instant Run, deploy your app as normal. Then if you make some changes to your project you'll notice that a yellow thunderbolt icon appears within the Run icon, indicating that Android Studio will push updates via Instant Run when you click this button. You can update to the latest version of Android Studio by launching the IDE and then selecting Android Studio from the toolbar, followed by Check for Updates…. Summary In this article, we looked at the major new UI features currently available in the Android N Developer Preview. We also looked at the Android Studio 2.1 features that are particularly useful for developing and testing apps that target the upcoming Android N release. Although we should expect some pretty dramatic changes between these early previews and the final release of Android N, taking the time to explore these features now means you'll be in a better position to update your apps when Android N is finally released. Resources for Article: Further resources on this subject: Drawing and Drawables In Android Canvas [article] Behavior-Driven Development With Selenium Webdriver [article] Development of Iphone Applications [article]
Read more
  • 0
  • 0
  • 10786

article-image-support-vector-machines-classification-engine
Packt
17 Mar 2016
9 min read
Save for later

Support Vector Machines as a Classification Engine

Packt
17 Mar 2016
9 min read
In this article by Tomasz Drabas, author of the book, Practical Data Analysis Cookbook, we will discuss on how Support Vector Machine models can be used as a classification engine. (For more resources related to this topic, see here.) Support Vector Machines Support Vector Machines (SVMs) are a family of extremely powerful models that can be used in classification and regression problems. They aim at finding decision boundaries that separate observations with differing class memberships. While many classifiers exist that can classify linearly separable data (for example, logistic regression), SVMs can handle highly non-linear problems using a kernel trick that implicitly maps the input vectors to higher-dimensional feature spaces. The transformation rearranges the dataset in such a way that it is then linearly solvable. The mechanics of the machine Given a set of n points of a form (x1,y1)...(xn,yn), where xi is a z-dimensional input vector and  yi is a class label, the SVM aims at finding the maximum margin hyperplane that separates the data points: In a two-dimensional dataset, with linearly separable data points (as shown in the preceding figure), the maximum margin hyperplane would be a line that would maximize the distance between each of the classes. The hyperplane could be expressed as a dot product of the set of input vectors  x and a vector normal to the hyperplane W:W.X=b, where b is the offset from the origin of the coordinate system. To find the hyperplane, we solve the following optimization problem: The constraint of our optimization problem effectively states that no point can cross the hyperplane if it does not belong to the class on that side of the hyperplane. Linear SVM Building a linear SVM classifier in Python is easy. There are multiple Python packages that can estimate a linear SVM but here, we decided to use MLPY (http://mlpy.sourceforge.net): import pandas as pd import numpy as np import mlpy as ml First, we load the necessary modules that we will use later, namely pandas (http://pandas.pydata.org), NumPy (http://www.numpy.org), and the aforementioned MLPY. We use pandas to read the data (https://github.com/drabastomek/practicalDataAnalysisCookbook repository to download the data): # the file name of the dataset r_filename = 'Data/Chapter03/bank_contacts.csv' # read the data csv_read = pd.read_csv(r_filename) The dataset that we use was described in S. Moro, P. Cortez, and P. Rita. A data-driven approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 and found here http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. It consists of over 41.1k outbound marketing calls of a bank. Our aim is to classify these calls into two buckets: those that resulted in a credit application and those that did not. Once the file was loaded, we split the data into training and testing datasets; we also keep the input and class indicator data separately. To this end, we use the split_dataset(...) method: def split_data(data, y, x = 'All', test_size = 0.33): ''' Method to split the data into training and testing ''' import sys # dependent variable variables = {'y': y} # and all the independent if x == 'All': allColumns = list(data.columns) allColumns.remove(y) variables['x'] = allColumns else: if type(x) != list: print('The x parameter has to be a list...') sys.exit(1) else: variables['x'] = x # create a variable to flag the training sample data['train'] = np.random.rand(len(data)) < (1 - test_size) # split the data into training and testing train_x = data[data.train] [variables['x']] train_y = data[data.train] [variables['y']] test_x = data[~data.train][variables['x']] test_y = data[~data.train][variables['y']] return train_x, train_y, test_x, test_y, variables['x'] We randomly set 1/3 of the dataset aside for testing purposes and use the remaining 2/3 for the training of the model: # split the data into training and testing train_x, train_y, test_x, test_y, labels = hlp.split_data( csv_read, y = 'credit_application' ) Once we read the data and split it into training and testing datasets, we can estimate the model: # create the classifier object svm = ml.LibSvm(svm_type='c_svc', kernel_type='linear', C=100.0) # fit the data svm.learn(train_x,train_y) The svm_type parameter of the .LibSvm(...) method controls what algorithm to use to estimate the SVM. Here, we use the c_svc method—a C-support Vector Classifier. The method specifies how much you want to avoid misclassifying observations: the larger values of C parameter will shrink the margin for the hyperplane (theb) so that more of the observations are correctly classified. You can also specify nu_svc with a nu parameter that controls how much of your sample (at most) can be misclassified and how many of your observations (at least) can become support vectors. Here, we estimate an SVM with a linear kernel, so let's talk about kernels. Kernels A kernel function K is effectively a function that computes a dot product between two n-dimensional vectors, K: Rn.Rn --> R. In other words, the kernel function takes two vectors and produces a scalar: The linear kernel does not effectively transform the data into a higher dimensional space. This is not true for polynomial or Radial Basis Function (RBF) kernels that transform the input feature space into higher dimensions. In case of the polynomial kernel of degree d, the obtained feature space has (n+d/d) dimensions for the Rn dimensional input feature space. As you can see, the number of additional dimensions can grow very quickly and this would pose significant problems in estimating the model if we would explicitly transform the data into higher dimensions. Thankfully, we do not have to do this as that's where the kernel trick comes into play. The truth is that SVMs do not have to work explicitly in higher dimensions but can rather implicitly map the data to higher dimensions using pairwise inner products (instead of an explicit dot product) and then use it to find the maximum margin hyperplane. You can find a really good explanation of the kernel trick at http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html. Back to our example The .learn(...) method of the .LibSvm(...) object estimates the model. Once the model is estimated, we can test how well it performs. First, we use the estimated model to predict the classes for the observations in the testing dataset: predicted_l = svm.pred(test_x) Next, we will use some of the scikit-learn methods to print the basic statistics for our model: def printModelSummary(actual, predicted): ''' Method to print out model summaries ''' import sklearn.metrics as mt print('Overall accuracy of the model is {0:.2f} percent' .format( (actual == predicted).sum() / len(actual) * 100)) print('Classification report: n', mt.classification_report(actual, predicted)) print('Confusion matrix: n', mt.confusion_matrix(actual, predicted)) print('ROC: ', mt.roc_auc_score(actual, predicted)) First, we calculate the overall accuracy of the model expressed as a ratio of properly classified observations to the total number of observations in the testing sample. Next, we print the classification report: The precision is the model's ability to avoid classifying an observation as positive when it is not. It is a ratio of true positives to the overall number of positively classified records. The overall precision score is a weighted average of the individual precision scores where the weight is the support. The support is the total number of actual observations in each class. The total precision for our model is not too bad—89 out of 100. However, when we look at the precision to classify the true positives, the situation is not as good—only 63 out of 100 were properly classified. Recall can be viewed as the model's capacity to find all the positive samples. It is a ratio of true positives to the sum of true positives and false negatives. The recall for the class 0.0 is almost perfect but for class 1.0, it looks really bad. This might be a problem with the fact that our sample is not balanced, but it is more likely that the features we use to classify the data do not really capture the differences between the two groups. The f1-score is effectively a weighted amalgam of the precision and recall: it is a ratio of twice the product of precision and recall to their sum. In one measure, it shows whether the model performs well or not. At the general level, the model does not perform badly but when looked at the model's ability to classify the true signal, it fails gravely. It is a perfect example why judging the model at the general level might be misleading when dealing with samples that are heavily unbalanced. RBF kernel SVM Given that the linear kernel performed poorly, our dataset might not be linearly separable. Thus, let's try the RBF kernel. The RBF kernel is given as K(x,y)=e ||x-y||2/2a2, where ||x-y||2 is a Euclidean distance between the two vectors, x and y, and σ is a free parameter. The value of RBF equals to 1 when x=y and gradually falls to 0 when the distance approaches infinity. To fit an RBF version of our model, we can specify our svm object as follows: svm = ml.LibSvm(svm_type='c_svc', kernel_type='rbf', gamma=0.1, C=1.0) The gamma parameter here specifies how far the influence of a single support vector reaches. Visually, you can investigate the relationship between gamma and C parameters at http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html. The rest of the code for the model estimation follows in a similar fashion as with the linear kernel and we obtain the following results: The results are even worse than the linear kernel as the precision and recall were lost across the board. The SVM with the RBF kernel performed worse when classifying calls that resulted in applying for the credit card and those that did not. Summary In this article, we saw that the problem is not with the model but rather, the dataset that we use does not explain the variance sufficiently. This requires going back to the drawing board and selecting other features. Resources for Article: Further resources on this subject: Push your data to the Web [article] Transferring Data from MS Access 2003 to SQL Server 2008 [article] Exporting data from MS Access 2003 to MySQL [article]
Read more
  • 0
  • 0
  • 16014
article-image-building-iphone-app-using-swift-part-1
Ryan Loomba
17 Mar 2016
6 min read
Save for later

Building an iPhone App Using Swift: Part 1

Ryan Loomba
17 Mar 2016
6 min read
In this post, I’ll be showing you how to create an iPhone app using Apple’s new Swift programming language. Swift is a new programming language that Apple released in June at their special WWDC event in San Francisco, CA. You can find more information about Swift on the official page. Apple has released a book on Swift, The Swift Programming Language, which is available on the iBook Store or can be viewed online here. OK—let’s get started! The first thing you need in order to write an iPhone app using Swift is to download a copy of Xcode 6. Currently, the only way to get a copy of Xcode 6 is to sign up for Apple’s developer program. The cost to enroll is $99 USD/year, so enroll here. Once enrolled, click on the iOS 8 GM Seed link, and scroll down to the link that says Xcode 6 GM Seed. Once Xcode is installed, go to File -> New -> New Project. We will click on Application within the iOS section and choose a Single View Application: Click on the play button in the top left of the project to build the project. You should see the iPhone simular open with a blank white screen. Next, click on the top-left blue Sample Swift App project file and navigate to the general tab. In the Deployment Info section, select portrait for the device orientation. This will force the app to only be viewed in portrait mode. First View Controller If we navigate on the left to Main.storyboard, we see a single View Controller, with a single View. First, make sure that Use Size Classes is unchecked in the Interface Builder Document section. Let’s add a text view to the top of our view. In the bottom right text box, search for Text View. Drag the Text View and position it at the top of the View. Click on the Attributes inspectoron the right toolbar to adjust the font and alignment. If we click the play button to build the project, we should see the same white screen, but now with our Swift Sample App text. View a web page Let’s add our first feature–a button that will open up a web page. First embed our controller in a navigation controller, so we can easily navigate back and forth between views. Select the view controller in the storyboard, then go to Editor -> Embed in -> Navigation controller. Note that you might need to resize the text view you added in the previous step. Now, let’s add a button that will open up a web view. Back to our view, in the bottom right let’s search for a button and drag it somewhere in the view and label it Web View. The final product should look like this: If we build the project and click on the button, nothing will happen. We need to create a destination controller that will contain the web view. Go to File -> New and create a new Cocoa Touch Class: Let’s name our new controller WebViewController and make it a subclass of UIViewController. Make sure you choose Swift as the language. Click Create to save the controller file. Back to our storyboard, search for a View Controller in the bottom-right search box and drag to the storyboard. In the Attributes inspector toolbar on the right side of the screen, let’s give this controller the title WebViewController. In the identity inspector, let’s give this view controller a custom class of WebViewController: Let’s wire up our two controllers. Ctrl + click on the Web View button we created earlier and hold. Drag your cursor over to your newly created WebViewController. Upon release, choose push. On our storyboard, let’s search for a web view in the lower-right search box and drag it into our newly created WebViewController. Resize the web view so that it takes up the entire screen, except for the top nav bar area. If we hit the large play button at the top left to build our app, clicking on the Web View link will take us to a blank screen. We’ll also have a back button that takes us back to the first screen. Writing some Swift code Let’s have the web view load up a pre-determined website. Time to get our hands dirty writing some Swift! The first thing we need to do is link the WebView in our controller to the WebViewController.swift file. In the storyboard, click on the Assistant editor button at the top-right of the screen. You should see the storyboard view of WebViewController and WebViewController.swift next to each other. Control click on WebViewController in the storyboard and drag it over to the line right before the WebViewController class is defined. Name the variable webView: In the viewDidLoad function, we are going to add some intitialization to load up our webpage. After super.viewDidLoad(), let’s first declare the URL we want to use. This can be any URL; for the example, I’m going to use my own homepage. It will look something like this: let requestURL = NSURL(string: http://ryanloomba.com) In Swift, the keyword let is used to desiginate contsants, or variables that will not change. Next, we will convert this URL into an NSURLRequest object. Finally, we will tell our WebView to make this request and pass in the request object: import UIKit class WebViewController: UIViewController { @IBOutlet var webView: UIWebView! override func viewDidLoad() { super.viewDidLoad() let requestURL = NSURL(string: "http://ryanloomba.com") let request = NSURLRequest(URL: requestURL) webView.loadRequest(request) // Do any additional setup after loading the view. } override func didReceiveMemoryWarning() { super.didReceiveMemoryWarning() // Dispose of any resources that can be recreated. } /* // MARK: - Navigation // In a storyboard-based application, you will often want to do a little preparation before navigation override func prepareForSegue(segue: UIStoryboardSegue!, sender: AnyObject!) { // Get the new view controller using segue.destinationViewController. // Pass the selected object to the new view controller. } */ } Try changing the URL to see different websites. Here’s an example of what it should look like: About the author Ryan is a software engineer and electronic dance music producer currently residing in San Francisco, CA. Ryan started up as a biomedical engineer but fell in love with web/mobile programming after building his first Android app, you can find him on GitHub @rloomba
Read more
  • 0
  • 0
  • 14481

Packt
17 Mar 2016
9 min read
Save for later

Microservices – Brave New World

Packt
17 Mar 2016
9 min read
In this article by David Gonzalez, author of the book Developing Microservices with Node.js, we will cover the need for microservices, explain the monolithic approach, and study how to build and deploy microservices. (For more resources related to this topic, see here.) Need for microservices The world of software development has evolved quickly over the past 40 years. One of the key points of this evolution has been the size of these systems. From the days of MS-DOS, we taken a hundred-fold leap into our present systems. This growth in size creates a need for better ways of organizing the code and software components. Usually, when a company grows due to business needs, which is known as organic growth, the software gets organized on a monolithic architecture as it is the easiest and quickest way of building software. After few years (or even months), adding new features becomes harder due to the coupled nature of the created software. Monolithic software There are a few companies that have already started building their software using microservices, which is the ideal scenario. The problem is that not all the companies can plan their software upfront. Instead of planning, these companies build the software based on the organic growth experienced: few software components that group business flows by affinity. It is not rare to see companies having two big software components: the user facing website and the internal administration tools. This is usually known as a monolithic software architecture. Some of these companies face big problems when trying to scale the engineering teams. It is hard to coordinate the teams that build, deploy, and maintain a single software component. Clashes on releases and reintroduction of bugs are a common problem that drains a big chunk of energy from the teams. One of the solution to this problem (it also has other benefits) is to split the monolithic software into microservices so that the teams are able to specialize in few smaller modules and autonomous and isolated software components that can be versioned, updated, and deployed without interfering with the rest of the systems of the company. One of the most interesting solutions to this problem is splitting the monolithic architecture into microservices. This enables the engineering team to create isolated and autonomous units of work that are highly specialized in a given task (such as sending e-mails, processing card payment, and so on). Microservices in the real world Microservices are small software components that specialize in one task and work together to achieve a higher-level task. Forget about software for a second and think about how a company works. When someone applies for a job in a company, he applies for a given position: software engineer, systems administrator, or office manager The reason for it can be summarized in one word—specialization. If you are used to working as a software engineer, you will get better with the experience and add more value to the company. The fact that you don’t know how to deal with a customer, won’t affect your performance as it is not your area of expertise and will hardly add any value to your day-to-day work. A microservice is an autonomous unit of work that can execute one task without interfering with other parts of the system, similar to what a job position is to a company. This has a number of benefits that can be used in favor of the engineering team in order to help to scale the systems of a company. Nowadays, hundreds of systems are built using a microservices-oriented architectures, as follows: Netflix: They are one of the most popular streaming services and have built an entire ecosystem of applications that collaborate in order to provide a reliable and scalable streaming system used across the globe. Spotify: They are one of the leading music streaming services in the world and have built this application using microservices. Every single widget of the application (which is a website exposed as a desktop app using Chromium Embedded Framework (CEF)) is a different microservice that can be updated individually. First, there was the monolith A huge percentage (my estimate is around 90%) of the modern enterprise software is built following a monolithic approach. Huge software components that run in a single container and have a well-defined development life cycle that goes completely against the following agile principles, deliver early and deliver often (https://en.wikipedia.org/wiki/Release_early,_release_often): Deliver early: The sooner you fail, the easier it is to recover. If you are working for two years in a software component and then, it is released, there is a huge risk of deviation from the original requirements, which are usually wrong and changing every few days. Deliver often: Everything of the software is delivered to all the stake holders so that they can have their inputs and see the changes reflected in the software. Errors can be fixed in a few days and improvements are identified easily. Companies build big software components instead of smaller ones that work together as it is the natural thing to do, as follows: The developer has a new requirement. He builds a new method on an existing class on the service layer. The method is exposed on the API via HTTP, SOAP, or any other protocol. Now, repeat it by the number of developers in your company and you will obtain something called organic growth. Organic growth is the type of uncontrolled and unplanned growth on software systems under business pressure without an adequate long-term planning, and it is bad. How to tackle the organic growth? The first thing needed to tackle the organic growth is make sure that business and IT are aligned in the company. Usually, in big companies, IT is not seen as a core part of the business. Organizations outsource their IT systems, keeping the cost in mind, but not the quality so that the partners building these software components are focused on one thing: deliver on time and according to the specification, even if it is incorrect. This produces a less-than-ideal ecosystem to respond to the business needs with a working solution for an existing problem. IT is lead by people who barely understand how the systems are built and usually overlook the complexity of the software development. Fortunately, this is a changing tendency as IT systems have become the drivers of 99% of the businesses around the world, but we need to be smarter about how we build them. The first measure to tackle the organic growth is to align IT and business stakeholders in order to work together, educating the non-technical stakeholders is the key to success. If we go back to the example from the previous section (few releases with quite big changes). Can we do it better? Of course, we can. Divide the work into manageable software artifacts that model a single and well-defined business activity and give it an entity. It does not need to be a microservice at this stage, but keeping the logic inside a separated, well-defined, easy testable, and decoupled module will give us a huge advantage towards future changes in the application. Building microservices – The fallback strategy When you design a system, we usually think about the replaceability of the existing components. For example, when using a persistence technology in Java, we tend to lean towards the standards (Java Persistence API (JPA)) so that we can replace the underneath implementation without too much effort. Microservices take the same approach, but they isolate the problem instead of working towards an easy replaceability. Also, e-mailing is something that, although it seems simple, always ends up giving problems. Consider that we want to replace Mandrill with a plain SMTP server, such as Gmail. We don't need to do anything special, we just change the implementation and rollout the new version of our microservice, as follows: var nodemailer = require('nodemailer'); var seneca = require("seneca")(); var transporter = nodemailer.createTransport({ service: 'Gmail', auth: { user: 'info@micromerce.com', pass: 'verysecurepassword' } }); /** * Sends an email including the content. */ seneca.add({area: "email", action: "send"}, function(args, done) { var mailOptions = { from: 'Micromerce Info ✔ <info@micromerce.com>', to: args.to, subject: args.subject, html: args.body }; transporter.sendMail(mailOptions, function(error, info){ if(error){ done({code: e}, null); } done(null, {status: "sent"}); }); }); For the outer world, our simplest version of the e-mail sender is now at all lights, using SMTP through Gmail to deliver our e-mails. We could even rollout one server with this version and send some traffic to it in order to validate our implementation without affecting all the customers (in other words, contain the failure). Deploying microservices Deployment is usually the ugly friend of the software development life cycle party. There is a missing contact point in between development and system administration, which DevOps is going to solve in the following few years (or has already done it and no one told me). The following is the graph showing the cost of fixing software bugs versus the various phases of development: From the continuous integration up to continuous delivery, the process should be automated as much as possible, where as much as possible means 100%. Remember, humans are imperfect…if we rely on humans carrying on a manual repetitive process for a bug-free software, we are walking the wrong path. Remember that a machine will always be error free (as long as the algorithm that is executed is error free) so…why not let a machine control our infrastructure? Summary In this article, we saw how microservices are required in complex software systems, how the monolithic approach is useful, and how to build and deploy microservices. Resources for Article: Further resources on this subject: Making a Web Server in Node.js [article] Node.js Fundamentals and Asynchronous JavaScript [article] An Introduction to Node.js Design Patterns [article]
Read more
  • 0
  • 0
  • 17495
Modal Close icon
Modal Close icon