Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7014 Articles
article-image-getting-started-predictive-analytics
Packt
04 Jul 2017
32 min read
Save for later

Getting Started with Predictive Analytics

Packt
04 Jul 2017
32 min read
In this article by Ralph Winters, the author of the book Practical Predictive Analytics we will explore the idea of how to start with predictive analysis. "In God we trust, all other must bring Data" – Deming (For more resources related to this topic, see here.) I enjoy explaining Predictive Analytics to people because it is based upon a simple concept:  Predicting the probability of future events based upon historical data. Its history may date back to at least 650 BC. Some early examples include the Babylonians, who tried to predict short term weather changes based cloud appearances and haloes Medicine also has a long history of a need to classify diseases.  The Babylonian king Adad-apla-iddina decreed that medical records be collected to form the “Diagnostic Handbook”. Some “predictions” in this corpus list treatments based on the number of days the patient had been sick, and their pulse rate. One of the first instances of bioinformatics! In later times, specialized predictive analytics were developed at the onset of the Insurance underwriting industries. This was used as a way to predict the risk associated with insuring Marine Vessels. At about the same time, Life Insurance companies began predicting the age that a person would live in order to set the most appropriate premium rates. [i]Although the idea of prediction always seemed to be rooted early in humans’ ability to want to understand and classify, it was not until the 20th century, and the advent of modern computing that it really took hold. In addition to aiding the US government in the 1940 with breaking the code, Alan Turing also worked on the initial computer chess algorithms which pitted man vs. machine.  Monte Carlo simulation methods originated as part of the Manhattan project, where mainframe computers crunched numbers for days in order to determine the probability of nuclear attacks. In the 1950’s Operation Research theory developed, in which one could optimize the shortest distance between two points. To this day, these techniques are used in logistics by companies such as UPS and Amazon. Non mathematicians have also gotten into the act.  In the 1970’s, Cardiologist Lee Goldman (who worked aboard a submarine) spend years developing a decision tree which did this efficiently.  This helped the staff determine whether or not the submarine needed to resurface in order to help the chest pain sufferer! What many of these examples had in common was that history was used to predict the future.  Along with prediction, came understanding of cause and effect and how the various parts of the problem were interrelated.  Discovery and insight came about through methodology and adhering to the scientific method. Most importantly, the solutions came about in order to find solutions to important, and often practical problems of the times.  That is what made them unique. Predictive Analytics adopted by some many different industries We have come a long way from then, and Practical Analytics solutions have furthered growth in so many different industries.  The internet has had a profound effect on this; it has enabled every click to be stored and analyzed. More data is being collected and stored, some with very little effort. That in itself has enabled more industries to enter Predictive Analytics. Marketing has always been concerned with customer acquisition and retention, and has developed predictive models involving various promotional offers and customer touch points, all geared to keeping customers and acquiring new ones.  This is very pronounced in certain industries, such as wireless and online shopping cards, in which customers are always searching for the best deal. Specifically, advanced analytics can help answer questions like "If I offer a customer 10% off with free shipping, will that yield more revenue than 15% off with no free shipping?".  The 360-degree view of the customer has expanded the number of ways one can engage with the customer, therefore enabling marketing mix and attribution modeling to become increasingly important.  Location based devices have enabled marketing predictive applications to incorporate real time data to issue recommendation to the customer while in the store. Predictive Analytics in Healthcare has its roots in clinical trials, which uses carefully selected samples to test the efficacy of drugs and treatments.  However, healthcare has been going beyond this. With the advent of sensors, data can be incorporated into predictive analytics to monitor patients with critical illness, and to send alerts to the patient when he is at risk. Healthcare companies can now predict which individual patients will comply with courses of treatment advocated by health providers.  This will send early warning signs to all parties, which will prevent future complications, as well as lower the total costs of treatment. Other examples can be found in just about every other industry.  Here are just a few: Finance:    Fraud detection is a huge area. Financial institutions can monitor clients internal and external transactions for fraud, through pattern recognition, and then alert a customer concerning suspicious activity.    Wall Street program trading. Trading algorithms will predict intraday highs and lows, and will decide when to buy and sell securities. Sports Management    Sports management are able to predict which sports events will yield the greatest attendance and institute variable ticket pricing based upon audience interest.     In baseball, a pitchers’ entire game can be recorded and then digitally analyzed. Sensors can also be attached to their arm, to alert when future injury might occur Higher Education    Colleges can predict how many, and which kind of students are likely to attend the next semester, and be able to plan resources accordingly.    Time based assessments of online modules can enable professors to identify students’ potential problems areas, and tailor individual instruction. Government    Federal and State Governments have embraced the open data concept, and have made more data available to the public. This has empowered “Citizen Data Scientists” to help solve critical social and government problems.    The potential use of using the data for the purpose of emergency service, traffic safety, and healthcare use is overwhelmingly positive. Although these industries can be quite different, the goals of predictive analytics are typically implement to increase revenue, decrease costs, or alter outcomes for the better. Skills and Roles which are important in Predictive Analytics So what skills do you need to be successful in Predictive Analytics? I believe that there are 3 basic skills that are needed: Algorithmic/Statistical/programming skills -These are the actual technical skills needed to implement a technical solution to a problem. I bundle these all together since these skills are typically used in tandem. Will it be a purely statistical solution, or will there need to be a bit of programming thrown in to customize an algorithm, and clean the data?  There are always multiple ways of doing the same task and it will be up to you, the predictive modeler to determine how it is to be done. Business skills –These are the skills needed for communicating thoughts and ideas among groups of all of the interested parties.  Business and Data Analysts who have worked in certain industries for long periods of time, and know their business very well, are increasingly being called upon to participate in Predictive Analytics projects.  Data Science is becoming a team sport, and most projects include working with others in the organization, Summarizing findings, and having good presentation and documentation skills are important.  You will often hear the term ‘Domain Knowledge’ associated with this, since it is always valuable to know the inner workings of the industry you are working in.  If you do not have the time or inclination to learn all about the inner workings of the problem at hand yourself, partner with someone who does. Data Storage / ETL skills:   This can refer to specialized knowledge regarding extracting data, and storing it in a relational, or non-relational NoSQL data store. Historically, these tasks were handled exclusively within a data warehouse.  But now that the age of Big Data is upon us, specialists have emerged who understand the intricacies of data storage, and the best way to organize it. Related Job skills and terms Along with the term Predictive Analytics, here are some terms which are very much related: Predictive Modeling:  This specifically means using a Mathematical/statistical model to predict the likelihood of a dependent or Target Variable Artificial Intelligence:  A broader term for how machines are able to rationalize and solve problems.  AI’s early days were rooted in Neural Networks Machine Learning- A subset of Artificial Intelligence. Specifically deals with how a machine learns automatically from data, usually to try to replicate human decision making or to best it. At this point, everyone knows about Watson, who beat two human opponents in “Jeopardy” [ii] Data Science - Data Science encompasses Predictive Analytics but also adds algorithmic development via coding, and good presentation skills via visualization. Data Engineering - Data Engineering concentrates on data extract and data preparation processes, which allow raw data to be transformed into a form suitable for analytics. A knowledge of system architecture is important. The Data Engineer will typically produce the data to be used by the Predictive Analysts (or Data Scientists) Data Analyst/Business Analyst/Domain Expert - This is an umbrella term for someone who is well versed in the way the business at hand works, and is an invaluable person to learn from in terms of what may have meaning, and what may not Statistics – The classical form of inference, typically done via hypothesis testing. Predictive Analytics Software Originally predictive analytics was performed by hand, by statisticians on mainframe computers using a progression of various language such as FORTRAN etc.  Some of these languages are still very much in use today.  FORTRAN, for example, is still one of the fasting performing languages around, and operators with very little memory. Nowadays, there are some many choices on which software to use, and many loyalists remain true to their chosen software.  The reality is, that for solving a specific type of predictive analytics problem, there exists a certain amount of overlap, and certainly the goal is the same. Once you get a hang of the methodologies used for predictive analytics in one software packages, it should be fairly easy to translate your skills to another package. Open Source Software Open source emphasis agile development, and community sharing.  Of course, open source software is free, but free must also be balance in the context of TCO (Total Cost of Ownership) R The R language is derived from the "S" language which was developed in the 1970’s.  However, the R language has grown beyond the original core packages to become an extremely viable environment for predictive analytics. Although R was developed by statisticians for statisticians, it has come a long way from its early days.  The strength of R comes from its 'package' system, which allows specialized or enhanced functionality to be developed and 'linked' to the core system. Although the original R system was sufficient for statistics and data mining, an important goal of R was to have its system enhanced via user written contributed packages.  As of this writing, the R system contains more than 8,000 packages.  Some are of excellent quality, and some are of dubious quality.  Therefore, the goal is to find the truly useful packages that add the most value.  Most, if not all of the R packages in use, address most of the common predictive analytics tasks that you will encounter.  If you come across a task that does not fit into any category, chances are good that someone in the R community has done something similar.  And of course, there is always a chance that someone is developing a package to do exactly what you want it to do.  That person could be eventually be you!. Closed Source Software Closed Source Software such as SAS and SPSS were on the forefront of predictive analytics, and have continued to this day to extend their reach beyond the traditional realm of statistics and machine learning.  Closed source software emphasis stability, better support, and security, with better memory management, which are important factors for some companies.  There is much debate nowadays regarding which one is 'better'.  My prediction is that they both will coexist peacefully, with one not replacing the other.  Data sharing and common API's will become more common.  Each has its place within the data architecture and ecosystem is deemed correct for a company.  Each company will emphasis certain factors, and both open and closed software systems and constantly improving themselves. Other helpful tools Man does not live by bread alone, so it would behoove you to learn additional tools in addition to R, so as to advance your analytic skills. SQL - SQL is a valuable tool to know, regardless of which language/package/environment you choose to work in. Virtually every analytics tool will have a SQL interface, and a knowledge of how to optimize SQL queries will definitely speed up your productivity, especially if you are doing a lot of data extraction directly from a SQL database. Today’s common thought is to do as much preprocessing as possible within the database, so if you will be doing a lot of extracting from databases like MySQL, Postgre, Oracle, or Teradata, it will be a good thing to learn how queries are optimized within their native framework. In the R language, there are several SQL packages that are useful for interfacing with various external databases.  We will be using SQLDF which is a popular R package for interfacing with R dataframes.  There are other packages which are specifically tailored for the specific database you will be working with Web Extraction Tools –Not every data source will originate from a data warehouse. Knowledge of API’s which extract data from the internet will be valuable to know. Some popular tools include Curl, and Jsonlite. Spreadsheets.  Despite their problems, spreadsheets are often the fastest way to do quick data analysis, and more importantly, enable them to share your results with others!  R offers several interface to spreadsheets, but again, learning standalone spreadsheet skills like PivotTables, and VBA will give you an advantage if you work for corporations in which these skills are heavily used. Data Visualization tools: Data Visualization tools are great for adding impact to an analysis, andfor concisely encapsulating complex information.  Native R visualization tools are great, but not every company will be using R.  Learn some third party visualization tools such as D3.js, Google Charts, Qlikview, or Tableau Big data Spark, Hadoop, NoSQL Database:  It is becoming increasingly important to know a little bit about these technologies, at least from the viewpoint of having to extract and analyze data which resides within these frameworks. Many software packages have API’s which talk directly to Hadoop and can run predictive analytics directly within the native environment, or extract data and perform the analytics locally. After you are past the basics Given that the Predictive Analytics space is so huge, once you are past the basics, ask yourself what area of Predictive analytics really interests you, and what you would like to specialize in.  Learning all you can about everything concerning Predictive Analytics is good at the beginning, but ultimately you will be called upon because you are an expert in certain industries or techniques. This could be research, algorithmic development, or even for managing analytics teams. But, as general guidance, if you are involved in, or are oriented towards data the analytics or research portion of data science, I would suggest that you concentrate on data mining methodologies and specific data modeling techniques which are heavily prevalent in the specific industries that interest you.  For example, logistic regression is heavily used in the insurance industry, but social network analysis is not. Economic research is geared towards time-series analysis, but not so much cluster analysis. If you are involved more on the data engineering side, concentrate more on data cleaning, being able to integrate various data sources, and the tools needed to accomplish this.  If you are a manager, concentrate on model development, testing and control, metadata, and presenting results to upper management in order to demonstrate value. Of course, predictive analytics is becoming more of a team sport, rather than a solo endeavor, and the Data Science team is very much alive.  There is a lot that has been written about the components of a Data Science team, much of it which can be reduced to the 3 basic skills that I outlined earlier. Two ways to look at predictive analytics Depending upon how you intend to approach a particular problem, look at how two different analytical mindsets can affect the predictive analytics process. Minimize prediction error goal: This is a very commonly used case within machine learning. The initial goal is to predict using the appropriate algorithms in order to minimize the prediction error. If done incorrectly, an algorithm will ultimately fail and it will need to be continually optimized to come up with the “new” best algorithm. If this is performed mechanically without regard to understanding the model, this will certainly result in failed outcomes.  Certain models, especially over optimized ones with many variables can have a very high prediction rate, but be unstable in a variety of ways. If one does not have an understanding of the model, it can be difficult to react to changes in the data inputs.  Understanding model goal: This came out of the scientific method and is tied closely with the concept of hypothesis testing.  This can be done in certain kinds of models, such as regression and decision trees, and is more difficult in other kinds of models such as SVM and Neural Networks.  In the understanding model paradigm, understanding causation or impact becomes more important than optimizing correlations. Typically, “Understanding” models have a lower prediction rate, but have the advantage of knowing more about the causations of the individual parts of the model, and how they are related. E.g. industries which rely on understanding human behavior emphasize model understanding goals.  A limitation to this orientation is that we might tend to discard results that are not immediately understood Of course the above examples illustrate two disparate approaches. Combination models, which use the best of both worlds should be the ones we should strive for.  A model which has an acceptable prediction error, is stable over, and is simple enough to understand. You will learn later that is this related to Bias/Variance Tradeoff R Installation R Installation is typically done by downloading the software directly from the CRAN site Navigate to https://cran.r-project.org/ Install the version of R appropriate for your operating system Alternate ways of exploring R Although installing R directly from the CRAN site is the way most people will proceed, I wanted to mention some alternative R installation methods. These methods are often good in instances when you are not always at your computer. Virtual Environment: Here are few methods to install R in the virtual environment: Virtual Box or VMware- Virtual environments are good for setting up protected environments and loading preinstalled operating systems and packages.  Some advantages are that they are good for isolating testing areas, and when you do not which to take up additional space on your own machine. Docker – Docker resembles a Virtual Machine, but is a bit more lightweight since it does not emulate an entire operating system, but emulates only the needed processes.  (See Rocker, Docker container) Cloud Based- Here are few methods to install R in the cloud based environment: AWS/Azure – These are Cloud Based Environments.  Advantages to this are similar to the reasons as virtual box, with the additional capability to run with very large datasets and with more memory.  Not free, but both AWS and Azure offer free tiers. Web Based - Here are few methods to install R in the web based environment: Interested in running R on the Web?  These sites are good for trying out quick analysis etc. R-Fiddle is a good choice, however there are other including: R-Web, ideaone.com, Jupyter, DataJoy, tutorialspoint, and Anaconda Cloud are just a few examples. Command Line – If you spend most of your time in a text editor, try ESS (Emacs Speaks Statistics) How is a predictive analytics project organized? After you install R on your own machine, I would give some thought about how you want to organize your data, code, documentation, etc. There probably be many different kinds of projects that you will need to set up, all ranging from exploratory analysis, to full production grade implementations.  However, most projects will be somewhere in the middle, i.e. those projects which ask a specific question or a series of related questions.  Whatever their purpose, each project you will work on will deserve their own project folder or directory. Set up your Project and Subfolders We will start by creating folders for our environment. Create a sub directory named “PracticalPredictiveAnalytics” somewhere on your computer. We will be referring to it by this name throughout this book. Often project start with 3 sub folders which roughly correspond with 1) Data Source, 2) Code Generated Outputs, and 3) The Code itself (in this case R) Create 3 subdirectories under this Project Data, Outputs, and R. The R directory will hold all of our data prep code, algorithms etc.  The Data directory will contain our raw data sources, and the Output directory will contain anything generated by the code.  This can be done natively within your own environment, e.g. you can use Windows Explorer to create these folders. Some important points to remember about constructing projects It is never a good idea to ‘boil the ocean’, or try to answer too many questions at once. Remember, predictive analytics is an iterative process. Another trap that people fall into is not having their project reproducible.  Nothing is worse than to develop some analytics on a set of data, and then backtrack, and oops! Different results. When organizing code, try to write code as building block, which can be reusable.  For R, write code liberally as functions. Assume that anything concerning requirements, data, and outputs will change, and be prepared. Considering the dynamic nature of the R language. Changes in versions, and packages could all change your analysis in various ways, so it is important to keep code and data in sync, by using separate folders for the different levels of code, data, etc.  or by using version management package use as subversion, git, or cvs GUI’s R, like many languages and knowledge discovery systems started from the command line (one reason to learn Linux), and is still used by many.  However, predictive analysts tend to prefer Graphic User Interfaces, and there are many choices available for each of the 3 different operating systems.   Each of them have their strengths and weakness, and of course there is always a matter of preference.  Memory is always a consideration with R, and if that is of critical concern to you, you might want to go with a simpler GUI, like the one built in with R. If you want full control, and you want to add some productive tools, you could choose RStudio, which is a full blown GUI and allows you to implement version control repositories, and has nice features like code completion.   RCmdr, and Rattle’s unique features are that they offer menus which allow guided point and click commands for common statistical and data mining tasks.  They are always both code generators.  This is a good for learning, and you can learn by looking at the way code is generated. Both RCmdr and RStudio offer GUI's which are compatible with Windows, Apple, and Linux operator systems, so those are the ones I will use to demonstrate examples in this book.  But bear in mind that they are only user interfaces, and not R proper, so, it should be easy enough to paste code examples into other GUI’s and decide for yourself which ones you like.   Getting started with RStudio After R installation has completed, download and install the RStudio executable appropriate for your operating system Click the RStudio Icon to bring up the program:  The program initially starts with 3 tiled window panes, as shown below. Before we begin to do any actual coding, we will want to set up a new Project. Create a new project by following these steps: Identify the Menu Bar, above the icons at the top left of the screen. Click “File” and then “New Project”   Select “Create project from Directory” Select “Empty Project” Name the directory “PracticalPredictiveAnalytics” Then Click the Browse button to select your preferred directory. This will be the directory that you created earlier Click “Create Project” to complete The R Console  Now that we have created a project, let’s take a look at of the R Console Window.  Click on the window marked “Console” and perform the following steps: Enter getwd() and press enter – That should echo back the current working directory Enter dir() – That will give you a list of everything in the current working directory The getwd() command is very important since it will always tell you which directory you are in. Sometimes you will need to switch directories within the same project or even to another project. The command you will use is setwd().  You will supply the directory that you want to switch to, all contained within the parentheses. This is a situation we will come across later. We will not change anything right now.  The point of this, is that you should always be aware of what your current working directory is. The Script Window The script window is where all of the R Code is written.  You can have several script windows open, all at once. Press Ctrl + Shift + N  to create a new R script.  Alternatively, you can go through the menu system by selecting File/New File/R Script.   A new blank script window will appear with the name “Untitled1” Our First Predictive Model Now that all of the preliminary things are out of the way, we will code our first extremely simple predictive model. Our first R script is a simple two variable regression model which predicts women’s height based upon weight.  The data set we will use is already built into the R package system, and is not necessary to load externally.   For quick illustration of techniques, I will sometimes use sample data contained within specific R packages to demonstrate. Paste the following code into the “Untitled1” scripts that was just created: require(graphics) data(women) head(women) utils::View(women) plot(women$height,women$weight) Click Ctrl+Shift+Enter to run the entire code.  The display should change to something similar as displayed below. Code Description What you have actually done is: Load the “Women” data object. The data() function will load the specified data object into memory. In our case, data(women)statement says load the 'women' dataframe into memory. Display the raw data in three different ways: utils::View(women) – This will visually display the dataframe. Although this is part of the actual R script, viewing a dataframe is a very common task, and is often issued directly as a command via the R Console. As you can see in the figure above, the “Women” data frame has 15 rows, and 2 columns named height and weight. plot(women$height,women$weight) – This uses the native R plot function which plots the values of the two variables against each other.  It is usually the first step one does to begin to understand the relationship between 2 variables. As you can see the relationship is very linear. Head(women) – This displays the first N rows of the women  data frame to the console. If you want no more than a certain number of rows, add that as a 2nd argument of the function.  E.g.  Head(women,99) will display UP TO 99 rows in the console. The tail() function works similarly, but displays the last rows of data. The very first statement in the code “require” is just a way of saying that R needs a specific package to run.  In this case require(graphics) specifies that the graphics package is needed for the analysis, and it will load it into memory.  If it is not available, you will get an error message.  However, “graphics” is a base package and should be available To save this script, press Ctrl-S (File Save) , navigate to the PracticalPredictiveAnalytics/R folder that was created, and name it Chapter1_DataSource Your 2nd script Create another Rscript by Ctrl + Shift + N  to create a new R script.  A new blank script window will appear with the name “Untitled2” Paste the following into the new script window lm_output <- lm(women$height ~ women$weight) summary(lm_output) prediction <- predict(lm_output) error <- women$height-prediction plot(women$height,error) Press Ctrl+Shift+Enter to run the entire code.  The display should change to something similar to what is displayed below. Code Description Here are some notes and explanations for the script code that you have just ran: lm() function: This functionruns a simple linear regression using lm() function. This function  predicts woman’s height based upon the value of their weight.  In statistical parlance, you will be 'regressing' height on weight. The line of code which accomplishes this is: lm_output <- lm(women$height ~ women$weight There are two operations that you will become very familiar with when running Predictive Models in R. The ~ operator (also called the tilde) is a shorthand way for separating what you want to predict, with what you are using to predict.   This is expression in formula syntax. What you are predicting (the dependent or Target variable) is usually on the left side of the formula, and the predictors (independent variables, features) are on the right side. Independent and dependent variables are height and weight, and to improve readability, I have specified them explicitly by using the data frame name together with the column name, i.e. women$height and women$weight The <- operator (also called assignment) says assign whatever function operators are on the right side to whatever object is on the left side.  This will always create or replace a new object that you can further display or manipulate. In this case we will be creating a new object called lm_output, which is created using the function lm(), which creates a Linear model based on the formula contained within the parentheses. Note that the execution of this line does not produce any displayed output.  You can see if the line was executed by checking the console.  If there is any problem with running the line (or any line for that matter) you will see an error message in the console. summary(lm_output): The following statement displays some important summary information about the object lm_output and writes to output to the R Console as pictured above summary(lm_output) The results will appear in the Console window as pictured in the figure above. Look at the lines market (Intercept), and women$weight which appear under the Coefficients line in the console.  The Estimate Column shows the formula needed to derive height from weight.  Like any linear regression formula, it includes coefficients for each independent variable (in our case only one variable), as well as an intercept. For our example the English rule would be "Multiply weight by 0.2872 and add 25.7235 to obtain height". Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.723456 1.043746 24.64 2.68e-12 *** women$weight 0.287249 0.007588 37.85 1.09e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.44 on 13 degrees of freedom Multiple R-squared: 0.991, Adjusted R-squared: 0.9903 F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14 We have already assigned the output of the lm() function to the lm_output object. Let’s apply another function to lm_output as well. The predict() function “reads” the output of the lm function and predicts (or scores the value), based upon the linear regression equation.  In the code we have assigned the output of this function to a new object named "prediction”. Switch over to the console area, and type “prediction” to see the predicted values for the 15 women. The following should appear in the console. > prediction 1 2 3 4 5 6 7 58.75712 59.33162 60.19336 61.05511 61.91686 62.77861 63.64035 8 9 10 11 12 13 14 64.50210 65.65110 66.51285 67.66184 68.81084 69.95984 71.39608 15 72.83233 There are 15 predictions.  Just to verify that we have one for each of our original observations we will use the nrow() function to count the number of rows. At the command prompt in the console area, enter the command: nrow(women) The following should appear: >nrow(women) [1] 15 The error object is a vector that was computed by taking the difference between the predicted value of height and the actual height.  These are also known as the residual errors, or just residuals. Since the error object is a vector, you cannot use the nrows() function to get its size.   But you can use the length() function: >length(error) [1] 15 In all of the above cases, the counts all compute as 15, so all is good. plot(women$height,error) :This plots the predicted height vs. the residuals.  It shows how much the prediction was ‘off’ from the original value.  You can see that the errors show a non-random pattern.  This is not good.  In an ideal regression model, you expect to see prediction errors randomly scatter around the 0 point on the why axis. Some important points to be made regarding this first example: The R-Square for this model is artificially high. Regression is often used in an exploratory fashion to explore the relationship between height and weight.  This does not mean a causal one.  As we all know, weight is caused by many other factors, and it is expected that taller people will be heavier. A predictive modeler who is examining the relationship between height and weight would want probably want to introduce additional variables into the model at the expense of a lower R-Square. R-Squares can be deceiving, especially when they are artificially high After you are done, press Ctrl-S (File Save), navigate to the PracticalPredictiveAnalytics/R folder that was created, and name it Chapter1_LinearRegression Installing a package Sometimes the amount of information output by statistic packages can be overwhelming. Sometime we want to reduce the amount of output and reformat it so it is easier on the eyes. Fortunately, there is an R package which reformats and simplifies some of the more important statistics. One package I will be using is named “stargazer”. Create another R script by Ctrl + Shift + N  to create a new R script.  Enter the following lines and then Press Ctrl+Shift+Enter to run the entire script.  install.packages("stargazer") library(stargazer) stargazer(lm_output, title="Lm Regression on Height", type="text") After the script has been run, the following should appear in the Console: Code Description install.packages("stargazer") This line will install the package to the default package directory on your machine.  Make sure you choose a CRAN mirror before you download. library(stargazer) This line loads the stargazer package stargazer(lm_output, title="Lm Regression on Height", type="text") The reformatted results will appear in the R Console. As you can see, the output written to the console is much cleaner and easier to read  After you are done, press Ctrl-S (File Save), navigate to the PracticalPredictiveAnalytics/Outputs folder that was created, and name it Chapter1_LinearRegressionOutput Installing other packages The rest of the book will concentrate on what I think are the core packages used for predictive modeling. There are always new packages coming out. I tend to favor packages which have been on CRAN for a long time and have large user base. When installing something new, I will try to reference the results against other packages which do similar things.  Speed is another reason to consider adopting a new package. Summary In this article we have learned a little about what predictive analytics is and how they can be used in various industries. We learned some things about data, and how they can be organized in projects.  Finally, we installed RStudio, and ran a simple linear regression, and installed and used our first package. We learned that it is always good practice to examine data after it has been brought into memory, and a lot can be learned from simply displaying and plotting the data. Resources for Article: Further resources on this subject: Stata as Data Analytics Software [article] Metric Analytics with Metricbeat [article] Big Data Analytics [article]
Read more
  • 0
  • 0
  • 11161

article-image-writing-your-first-cucumber-appium-test
Packt
27 Jun 2017
12 min read
Save for later

Writing Your First Cucumber Appium Test

Packt
27 Jun 2017
12 min read
In this article, by Nishant Verma, author of the book Mobile Test Automation with Appium, you will learn about creating a new cucumber, appium Java project in IntelliJ. Next, you will learn to write a sample feature and automate, thereby learning how to start appium server session with an app using appium app, find locators using appium inspector and write java classes for each step implementation in the feature file. We will also discuss, how to write the test for mobile web and use Chrome Developer Tool to find the locators. Let's get started! In this article, we will discuss the following topics: Create a sample Java project (using gradle) Introduction to Cucumber Writing first appium test Starting appium server session and finding locators Write a test for mobile web (For more resources related to this topic, see here.) Create a sample Java project (using gradle) Let's create a sample appium Java project in IntelliJ. Below steps will help you do the same: Launch IntelliJ and click Create New Project on Welcome Screen. On the New Project screen, select Gradle from left pane. Project SDK should get populated with the Java version. Click on Next, enter the GroupId as com.test and ArtifactId as HelloAppium. Version would already be populated. Click on Next. Check the option Use Auto-Import and make sure Gradle JVM is populated. Click on Next. Project name field would be auto populated with what you gave as ArtifactId. Choose a Project location and click on Finish. IntelliJ would be running the background task (Gradle build) which can be seen in the status bar. We should have a project created with default structure. Open the build.gradle file. You would see a message as shown below, click on Ok, apply suggestion! Enter the below two lines in build.gradle. This would add appium and cucumber-jvm under dependencies. compile group: 'info.cukes', name: 'cucumber-java', version: '1.2.5' compile group: 'io.appium', name: 'java-client', version: '5.0.0-BETA6' Below is how the gradle file should look like: group 'com.test' version '1.0-SNAPSHOT' apply plugin: 'java' sourceCompatibility = 1.5 repositories { mavenCentral() } dependencies { testCompile group: 'junit', name: 'junit', version: '4.11' compile group: 'info.cukes', name: 'cucumber-java', version:'1.2.5' compile group: 'io.appium', name: 'java-client', version:'5.0.0-BETA6' } Once done, navigate to View -> Tools Window -> Gradle and click on Refresh all gradle projects icon. This would pull all the dependency in External Libraries. Navigate to Preferences -> Plugins, search for cucumber for Java and click on Install (if it's not previously installed). Repeat the above step for gherkin and install the same. Once done restart IntelliJ, if it prompts. We are now ready to write our first sample feature file but before that let's try to understand a brief about cucumber. Introduction to Cucumber Cucumber is a test framework which supports behaviour driven development (or BDD). The core idea behind BDD is a domain specific language (known as DSL), where the tests are written in normal English, expressing how the application or system has to behave. DSL is an executable test, which starts with a known state, perform some action and verify the expected state. For e.g. Feature: Registration with Facebook/Google Scenario: Registration Flow Validation via App As a user I should be able to see Facebook/Google button. When I try to register myself in Quikr. Given I launch the app When I click on Register Then I should see register with Facebook and Google Dan North (creator of BDD) defined behaviour-driven development in 2009 as- BDD is a second-generation, outside-in, pull-based, multiple-stakeholder, multiple-scale, high-automation, agile methodology. It describes a cycle of interactions with well-defined outputs, resulting in the delivery of working, tested software that matters. Cucumber feature files serve as a living documentation which can be implemented in many languages. It was first implemented in Ruby and later extended to Java. Some of the basic features of Cucumber are: The core of cucumber is text files called feature which contains scenario. These scenarios expresses the system or application behaviour. Scenario files comprise of steps which are written following the syntax of Gherkin. Each step will have step implementation which is the code behind which interacts with application. So in the above example, Feature, Scenario, Given, When, Then are keywords. Feature: Cucumber tests are grouped into features. We use this name because we want engineers to describe the features that a user will be able to use. Scenario: A Scenario expresses the behaviour we want, each feature contains several scenarios. Each scenario is an example of how the system should behave in a particular situation. The expected behaviour of the feature would be the total scenarios. For a feature to pass all scenarios must pass. Test Runner: There are different way to run the feature file however we would be using the JUnit runner initially and then move on to gradle command for command line execution. So I am hoping now we have a brief idea of what cucumber is. Further details can be read on their site (https://cucumber.io/). In the coming section, we will create a feature file, write a scenario, implement the code behind and execute it. Writing first appium test Till now we have created a sample Java project and added the appium dependency. Next we need to add a cucumber feature file and implement the code behind. Let's start that: Under Project folder, create the directory structure src/test/java/features. Right click on the features folder and select New -> File and enter name as Sample.feature In the Sample.feature file, let's write a scenario as shown below which is about log in using Google. Feature: Hello World. Scenario: Registration Flow Validation via App. As a user I should be able to see my google account. when I try to register myself in Quikr. When I launch Quikr app. And I choose to log in using Google. Then I see account picker screen with my email address "testemail@gmail.com". Right click on the java folder in IntelliJ, select New -> Package and enter name as steps. Next step is to implement the cucumber steps, click on the first line in Sample.feature file When I launch Quikr app and press Alt+Enter, then select the option Create step definition. It will present you with a pop up to enter File name, File location and File type. We need to enter below values: File name: HomePageSteps File Location: Browse it to the steps folder created above File Type: Java So the idea is that the steps will belong to a page and each page would typically have it's own step implementation class. Once you click on OK, it will create a sample template in the HomePageSteps class file. Now we need to implement these methods and write the code behind to launch Quikr app on emulator. Starting appium server session and finding locators First thing we need to do, is to download a sample app (Quikr apk in this case). Download the Quikr app (version 9.16). Create a folder named app under the HelloAppium project and copy the downloaded apk under that folder. Launch the appium GUI app. Launch the emulator or connect your device (assuming you have Developer Options enabled). On the appium GUI app, click on the android icon and select the below options: App Path - browse to the .apk location under the app folder. Platform Name - Android. Automation Name - Appium. Platform Version - Select the version which matches the emulator from the dropdown, it allows to edit the value. Device Name - enter any string e.g. Nexus6. Once the above settings are done, click on General Settings icon and choose the below mentioned settings. Once the setup is done, click on the icon to close the pop up. Select Prelaunch application Select Strict Capabilities Select Override Existing Sessions Select Kill Processes Using Server Port Before Launch Select New Command Timeout and enter the value 7200 Click on Launch This would start the appium server session. Once you click on Appium Inspector, it will install the app on emulator and launch the same. If you click on the Record button, it will generate the boilerplate code which has Desired Capabilities respective to the run environment and app location: We can copy the above line and put into the code template generated for the step When I launch Quikr app. This is how the code should look like after copying it in the method: @When("^I launch Quikr app$") public void iLaunchQuikrApp() throws Throwable { DesiredCapabilities capabilities = new DesiredCapabilities(); capabilities.setCapability("appium-version", "1.0"); capabilities.setCapability("platformName", "Android"); capabilities.setCapability("platformVersion", "5.1"); capabilities.setCapability("deviceName", "Nexus6"); capabilities.setCapability("app", "/Users/nishant/Development/HelloAppium/app/quikr.apk"); AppiumDriver wd = new AppiumDriver(new URL("http://0.0.0.0:4723/wd/hub"), capabilities); wd.manage().timeouts().implicitlyWait(60, TimeUnit. SECONDS ); } Now the above code only sets the Desired Capabilities, appium server is yet to be started. For now, we can start it from outside like terminal (or command prompt) by running the command appium. We can close the appium inspector and stop the appium server by click in onStop on the appium GUI app. To run the above test, we need to do the following: Start the appium server via command line (Command: appium --session-override ). In IntelliJ, right click on the feature file and choose the option to "Run...". Now the scope of AppiumDriver is local to the method, hence we can refactor and extract appiumDriver as field. To continue with the other steps automation, we can use the appium inspector to find the element handle. We can launch appium inspector using the above mentioned steps, then click on the element whose locator we want to find out as shown in the below mentioned screen. Once we have the locator, we can use the appium api (as shown below) to click it: appiumDriver.findElement(By.id("sign_in_button")).click(); This way we can implement the remaining steps. Write a small test for mobile web To automate mobile web app, we don't need to install the app on the device. We need a browser and the app url which is sufficient to start the test automation. We can tweak the above written code by adding a desired capability browserName. We can write a similar scenario and make it mobile web specific: Scenario: Registration Flow Validation via web As a User I want to verify that I get the option of choosing Facebook when I choose to register When I launch Quikr mobile web And I choose to register Then I should see an option to register using Facebook So the method for mobile web would look like: @When("^I launch Quikr mobile web$") public void iLaunchQuikrMobileWeb() throws Throwable { DesiredCapabilities desiredCapabilities = new DesiredCapabilities(); desiredCapabilities.setCapability("platformName", "Android"); desiredCapabilities.setCapability("deviceName", "Nexus"); desiredCapabilities.setCapability("browserName", "Browser"); URL url = new URL("http://127.0.0.1:4723/wd/hub"); appiumDriver = new AppiumDriver(url, desiredCapabilities); appiumDriver.get("http://m.quikr.com"); } So in the above code, we don't need platformVersion and we need a valid value for browserName parameter. Possible values for browserName are: Chrome - For Chrome browser on Android Safari - For Safari browser on iOS Browser - For stock browser on Android We can follow the same steps as above to run the test. Finding locators in mobile web app To implement the remaining steps of above mentioned feature, we need to find locators for the elements we want to interact with. Once the locators are found then we need to perform the desired operation which could be click, send keys etc. Below mentioned are the steps which will help us find the locators for a mobile web app: Launch the chrome browser on your machine and navigate to the mobile site (in our case:  http://m.quikr.com) Select More Tools -> Developer Tools from the Chrome Menu In the Developer Tool menu items, click on the Toggle device toolbar icon. Once done the page would be displayed in a mobile layout format. In order to find the locator of any UI element, click on the first icon of the dev tool bar and then click on the desired element. The HTML in the dev tool layout would change to highlight the selected element. Refer the below screenshot which shows the same. In the highlight panel on the right side, we can see the following properties name=query and id=query . We can choose to use id and implement the step as: appiumDriver.findElement(By.id("query")).click(); Using the above way, we can find the locator of the various elements we need to interact with and proceed with our test automation. Summary So in this article, we briefly described how we would go about writing test for a native app as well as a mobile web. We discussed how to create a project in IntelliJ and write a sample feature file. We also learned how to start the appium inspector and look for locator. We learned about the chrome dev tool and how can use the same to find locator for mobile web. Resources for Article: Further resources on this subject: Appium Essentials [article] Ensuring Five-star Rating in the MarketPlace [article] Testing in Agile Development and the State of Agile Adoption [article]
Read more
  • 0
  • 0
  • 28397

article-image-exploring-compilers
Packt
23 Jun 2017
17 min read
Save for later

Exploring Compilers

Packt
23 Jun 2017
17 min read
In this article by Gabriele Lanaro, author of the book, Python High Performance - Second Edition, we will see that Python is a mature and widely used language and there is a large interest in improving its performance by compiling functions and methods directly to machine code rather than executing instructions in the interpreter. In this article, we will explore two projects--Numba and PyPy--that approach compilation in a slightly different way. Numba is a library designed to compile small functions on the fly. Instead of transforming Python code to C, Numba analyzes and compiles Python functions directly to machine code. PyPy is a replacement interpreter that works by analyzing the code at runtime and optimizing the slow loops automatically. (For more resources related to this topic, see here.) Numba Numba was started in 2012 by Travis Oliphant, the original author of NumPy, as a library for compiling individual Python functions at runtime using the Low-Level Virtual Machine  ( LLVM ) toolchain. LLVM is a set of tools designed to write compilers. LLVM is language agnostic and is used to write compilers for a wide range of languages (an important example is the clang compiler). One of the core aspects of LLVM is the intermediate representation (the LLVM IR), a very low-level platform-agnostic language similar to assembly, that can be compiled to machine code for the specific target platform. Numba works by inspecting Python functions and by compiling them, using LLVM, to the IR. As we have already seen in the last article, the speed gains can be obtained when we introduce types for variables and functions. Numba implements clever algorithms to guess the types (this is called type inference) and compiles type-aware versions of the functions for fast execution. Note that Numba was developed to improve the performance of numerical code. The development efforts often prioritize the optimization of applications that intensively use NumPy arrays. Numba is evolving really fast and can have substantial improvements between releases and, sometimes, backward incompatible changes.  To keep up, ensure that you refer to the release notes for each version. In the rest of this article, we will use Numba version 0.30.1; ensure that you install the correct version to avoid any error. The complete code examples in this article can be found in the Numba.ipynb notebook. First steps with Numba Getting started with Numba is fairly straightforward. As a first example, we will implement a function that calculates the sum of squares of an array. The function definition is as follows: def sum_sq(a): result = 0 N = len(a) for i in range(N): result += a[i] return result To set up this function with Numba, it is sufficient to apply the nb.jit decorator: from numba import nb @nb.jit def sum_sq(a): ... The nb.jit decorator won't do much when applied. However, when the function will be invoked for the first time, Numba will detect the type of the input argument, a , and compile a specialized, performant version of the original function. To measure the performance gain obtained by the Numba compiler, we can compare the timings of the original and the specialized functions. The original, undecorated function can be easily accessed through the py_func attribute. The timings for the two functions are as follows: import numpy as np x = np.random.rand(10000) # Original %timeit sum_sq.py_func(x) 100 loops, best of 3: 6.11 ms per loop # Numba %timeit sum_sq(x) 100000 loops, best of 3: 11.7 µs per loop You can see how the Numba version is order of magnitude faster than the Python version. We can also compare how this implementation stacks up against NumPy standard operators: %timeit (x**2).sum() 10000 loops, best of 3: 14.8 µs per loop In this case, the Numba compiled function is marginally faster than NumPy vectorized operations. The reason for the extra speed of the Numba version is likely that the NumPy version allocates an extra array before performing the sum in comparison with the in-place operations performed by our sum_sq function. As we didn't use array-specific methods in sum_sq, we can also try to apply the same function on a regular Python list of floating point numbers. Interestingly, Numba is able to obtain a substantial speed up even in this case, as compared to a list comprehension: x_list = x.tolist() %timeit sum_sq(x_list) 1000 loops, best of 3: 199 µs per loop %timeit sum([x**2 for x in x_list]) 1000 loops, best of 3: 1.28 ms per loop Considering that all we needed to do was apply a simple decorator to obtain an incredible speed up over different data types, it's no wonder that what Numba does looks like magic. In the following sections, we will dig deeper and understand how Numba works and evaluate the benefits and limitations of the Numba compiler. Type specializations As shown earlier, the nb.jit decorator works by compiling a specialized version of the function once it encounters a new argument type. To better understand how this works, we can inspect the decorated function in the sum_sq example. Numba exposes the specialized types using the signatures attribute. Right after the sum_sq definition, we can inspect the available specialization by accessing the sum_sq.signatures, as follows: sum_sq.signatures # Output: # [] If we call this function with a specific argument, for instance, an array of float64 numbers, we can see how Numba compiles a specialized version on the fly. If we also apply the function on an array of float32, we can see how a new entry is added to the sum_sq.signatures list: x = np.random.rand(1000).astype('float64') sum_sq(x) sum_sq.signatures # Result: # [(array(float64, 1d, C),)] x = np.random.rand(1000).astype('float32') sum_sq(x) sum_sq.signatures # Result: # [(array(float64, 1d, C),), (array(float32, 1d, C),)] It is possible to explicitly compile the function for certain types by passing a signature to the nb.jit function. An individual signature can be passed as a tuple that contains the type we would like to accept. Numba provides a great variety of types that can be found in the nb.types module, and they are also available in the top-level nb namespace. If we want to specify an array of a specific type, we can use the slicing operator, [:], on the type itself. In the following example, we demonstrate how to declare a function that takes an array of float64 as its only argument: @nb.jit((nb.float64[:],)) def sum_sq(a): Note that when we explicitly declare a signature, we are prevented from using other types, as demonstrated in the following example. If we try to pass an array, x, as float32, Numba will raise a TypeError: sum_sq(x.astype('float32')) # TypeError: No matching definition for argument type(s) array(float32, 1d, C) Another way to declare signatures is through type strings. For example, a function that takes a float64 as input and returns a float64 as output can be declared with the float64(float64) string. Array types can be declared using a [:] suffix. To put this together, we can declare a signature for our sum_sq function, as follows: @nb.jit("float64(float64[:])") def sum_sq(a): You can also pass multiple signatures by passing a list: @nb.jit(["float64(float64[:])", "float64(float32[:])"]) def sum_sq(a): Object mode versus native mode So far, we have shown how Numba behaves when handling a fairly simple function. In this case, Numba worked exceptionally well, and we obtained great performance on arrays and lists.The degree of optimization obtainable from Numba depends on how well Numba is able to infer the variable types and how well it can translate those standard Python operations to fast type-specific versions. If this happens, the interpreter is side-stepped and we can get performance gains similar to those of Cython. When Numba cannot infer variable types, it will still try and compile the code, reverting to the interpreter when the types can't be determined or when certain operations are unsupported. In Numba, this is called object mode and is in contrast to the intepreter-free scenario, called native mode. Numba provides a function, called inspect_types, that helps understand how effective the type inference was and which operations were optimized. As an example, we can take a look at the types inferred for our sum_sq function: sum_sq.inspect_types() When this function is called, Numba will print the type inferred for each specialized version of the function. The output consists of blocks that contain information about variables and types associated with them. For example, we can examine the N = len(a) line: # --- LINE 4 --- # a = arg(0, name=a) :: array(float64, 1d, A) # $0.1 = global(len: <built-in function len>) :: Function(<built-in function len>) # $0.3 = call $0.1(a) :: (array(float64, 1d, A),) -> int64 # N = $0.3 :: int64 N = len(a) For each line, Numba prints a thorough description of variables, functions, and intermediate results. In the preceding example, you can see (second line) that the argument a is correctly identified as an array of float64 numbers. At LINE 4, the input and return type of the len function is also correctly identified (and likely optimized) as taking an array of float64 numbers and returning an int64. If you scroll through the output, you can see how all the variables have a well-defined type. Therefore, we can be certain that Numba is able to compile the code quite efficiently. This form of compilation is called native mode. As a counter example, we can see what happens if we write a function with unsupported operations. For example, as of version 0.30.1, Numba has limited support for string operations. We can implement a function that concatenates a series of strings, and compiles it as follows: @nb.jit def concatenate(strings): result = '' for s in strings: result += s return result Now, we can invoke this function with a list of strings and inspect the types: concatenate(['hello', 'world']) concatenate.signatures # Output: [(reflected list(str),)] concatenate.inspect_types() Numba will return the output of the function for the reflected list (str) type. We can, for instance, examine how line 3 gets inferred. The output of concatenate.inspect_types() is reproduced here: # --- LINE 3 --- # strings = arg(0, name=strings) :: pyobject # $const0.1 = const(str, ) :: pyobject # result = $const0.1 :: pyobject # jump 6 # label 6 result = '' You can see how this time, each variable or function is of the generic pyobject type rather than a specific one. This means that, in this case, Numba is unable to compile this operation without the help of the Python interpreter. Most importantly, if we time the original and compiled function, we note that the compiled function is about three times slower than the pure Python counterpart: x = ['hello'] * 1000 %timeit concatenate.py_func(x) 10000 loops, best of 3: 111 µs per loop %timeit concatenate(x) 1000 loops, best of 3: 317 µs per loop This is because the Numba compiler is not able to optimize the code and adds some extra overhead to the function call.As you may have noted, Numba compiled the code without complaints even if it is inefficient. The main reason for this is that Numba can still compile other sections of the code in an efficient manner while falling back to the Python interpreter for other parts of the code. This compilation strategy is called object mode. It is possible to force the use of native mode by passing the nopython=True option to the nb.jit decorator. If, for example, we apply this decorator to our concatenate function, we observe that Numba throws an error on first invocation: @nb.jit(nopython=True) def concatenate(strings): result = '' for s in strings: result += s return result concatenate(x) # Exception: # TypingError: Failed at nopython (nopython frontend) This feature is quite useful for debugging and ensuring that all the code is fast and correctly typed. Numba and NumPy Numba was originally developed to easily increase performance of code that uses NumPy arrays. Currently, many NumPy features are implemented efficiently by the compiler. Universal functions with Numba Universal functions are special functions defined in NumPy that are able to operate on arrays of different sizes and shapes according to the broadcasting rules. One of the best features of Numba is the implementation of fast ufuncs. We have already seen some ufunc examples in article 3, Fast Array Operations with NumPy and Pandas. For instance, the np.log function is a ufunc because it can accept scalars and arrays of different sizes and shapes. Also, universal functions that take multiple arguments still work according to the  broadcasting rules. Examples of universal functions that take multiple arguments are np.sum or np.difference. Universal functions can be defined in standard NumPy by implementing the scalar version and using the np.vectorize function to enhance the function with the broadcasting feature. As an example, we will see how to write the Cantor pairing function. A pairing function is a function that encodes two natural numbers into a single natural number so that you can easily interconvert between the two representations. The Cantor pairing function can be written as follows: import numpy as np def cantor(a, b): return int(0.5 * (a + b)*(a + b + 1) + b) As already mentioned, it is possible to create a ufunc in pure Python using the np.vectorized decorator: @np.vectorize def cantor(a, b): return int(0.5 * (a + b)*(a + b + 1) + b) cantor(np.array([1, 2]), 2) # Result: # array([ 8, 12]) Except for the convenience, defining universal functions in pure Python is not very useful as it requires a lot of function calls affected by interpreter overhead. For this reason, ufunc implementation is usually done in C or Cython, but Numba beats all these methods by its convenience. All that is needed to do in order to perform the conversion is using the equivalent decorator, nb.vectorize. We can compare the speed of the standard np.vectorized version which, in the following code, is called cantor_py, and the same function is implemented using standard NumPy operations: # Pure Python %timeit cantor_py(x1, x2) 100 loops, best of 3: 6.06 ms per loop # Numba %timeit cantor(x1, x2) 100000 loops, best of 3: 15 µs per loop # NumPy %timeit (0.5 * (x1 + x2)*(x1 + x2 + 1) + x2).astype(int) 10000 loops, best of 3: 57.1 µs per loop You can see how the Numba version beats all the other options by a large margin! Numba works extremely well because the function is simple and type inference is possible. An additional advantage of universal functions is that, since they depend on individual values, their evaluation can also be executed in parallel. Numba provides an easy way to parallelize such functions by passing the target="cpu" or target="gpu" keyword argument to the nb.vectorize decorator. Generalized universal functions One of the main limitations of universal functions is that they must be defined on scalar values. A generalized universal function, abbreviated gufunc, is an extension of universal functions to procedures that take arrays. A classic example is the matrix multiplication. In NumPy, matrix multiplication can be applied using the np.matmul function, which takes two 2D arrays and returns another 2D array. An example usage of np.matmul is as follows: a = np.random.rand(3, 3) b = np.random.rand(3, 3) c = np.matmul(a, b) c.shape # Result: # (3, 3) As we saw in the previous subsection, a ufunc broadcasts the operation over arrays of scalars, its natural generalization will be to broadcast over an array of arrays. If, for instance, we take two arrays of 3 by 3 matrices, we will expect np.matmul to take to match the matrices and take their product. In the following example, we take two arrays containing 10 matrices of shape (3, 3). If we apply np.matmul, the product will be applied matrix-wise to obtain a new array containing the 10 results (which are, again, (3, 3) matrices): a = np.random.rand(10, 3, 3) b = np.random.rand(10, 3, 3) c = np.matmul(a, b) c.shape # Output # (10, 3, 3) The usual rules for broadcasting will work in a similar way. For example, if we have an array of (3, 3) matrices, which will have a shape of (10, 3, 3), we can use np.matmul to calculate the matrix multiplication of each element with a single (3, 3) matrix. According to the broadcasting rules, we obtain that the single matrix will be repeated to obtain a size of (10, 3, 3): a = np.random.rand(10, 3, 3) b = np.random.rand(3, 3) # Broadcasted to shape (10, 3, 3) c = np.matmul(a, b) c.shape # Result: # (10, 3, 3) Numba supports the implementation of efficient generalized universal functions through the nb.guvectorize decorator. As an example, we will implement a function that computes the euclidean distance between two arrays as a gufunc. To create a gufunc, we have to define a function that takes the input arrays, plus an output array where we will store the result of our calculation. The nb.guvectorize decorator requires two arguments: The types of the input and output: two 1D arrays as input and a scalar as output The so called layout string, which is a representation of the input and output sizes; in our case, we take two arrays of the same size (denoted arbitrarily by n), and we output a scalar In the following example, we show the implementation of the euclidean function using the nb.guvectorize decorator: @nb.guvectorize(['float64[:], float64[:], float64[:]'], '(n), (n) - > ()') def euclidean(a, b, out): N = a.shape[0] out[0] = 0.0 for i in range(N): out[0] += (a[i] - b[i])**2 There are a few very important points to be made. Predictably, we declared the types of the inputs a and b as float64[:], because they are 1D arrays. However, what about the output argument? Wasn't it supposed to be a scalar? Yes, however, Numba treats scalar argument as arrays of size 1. That's why it was declared as float64[:]. Similarly, the layout string indicates that we have two arrays of size (n) and the output is a scalar, denoted by empty brackets--(). However, the array out will be passed as an array of size 1. Also, note that we don't return anything from the function; all the output has to be written in the out array. The letter n in the layout string is completely arbitrary; you may choose to use k  or other letters of your liking. Also, if you want to combine arrays of uneven sizes, you can use layouts strings, such as (n, m). Our brand new euclidean function can be conveniently used on arrays of different shapes, as shown in the following example: a = np.random.rand(2) b = np.random.rand(2) c = euclidean(a, b) # Shape: (1,) a = np.random.rand(10, 2) b = np.random.rand(10, 2) c = euclidean(a, b) # Shape: (10,) a = np.random.rand(10, 2) b = np.random.rand(2) c = euclidean(a, b) # Shape: (10,) How does the speed of euclidean compare to standard NumPy? In the following code, we benchmark a NumPy vectorized version with our previously defined euclidean function: a = np.random.rand(10000, 2) b = np.random.rand(10000, 2) %timeit ((a - b)**2).sum(axis=1) 1000 loops, best of 3: 288 µs per loop %timeit euclidean(a, b) 10000 loops, best of 3: 35.6 µs per loop The Numba version, again, beats the NumPy version by a large margin! Summary Numba is a tool that compiles fast, specialized versions of Python functions at runtime. In this article, we learned how to compile, inspect, and analyze functions compiled by Numba. We also learned how to implement fast NumPy universal functions that are useful in a wide array of numerical applications.  Tools such as PyPy allow us to run Python programs unchanged to obtain significant speed improvements. We demonstrated how to set up PyPy, and we assessed the performance improvements on our particle simulator application. Resources for Article: Further resources on this subject: Getting Started with Python Packages [article] Python for Driving Hardware [article] Python Data Science Up and Running [article]
Read more
  • 0
  • 0
  • 12993

article-image-setting-your-raspberry-pi
Packt
23 Jun 2017
19 min read
Save for later

Setting up your Raspberry Pi

Packt
23 Jun 2017
19 min read
In this article by Pradeeka Seneviratne and John Sirach, the authors of the book Raspberry Pi 3 Projects for Java Programmers we will cover following topics: Getting started with the Raspberry Pi Installing Raspbian (For more resources related to this topic, see here.) Getting started with the Raspberry Pi With the release of the Raspberry Pi 3 the Raspberry Pi foundation has made a very big step in the history of the Raspberry Pi. The current hardware architecture is now based on a 1.2 Ghz 64 bit ARMv7. This latest release of the Raspberry Pi also includes support for wireless networking and has an onboard Bluetooth 4.1 chip available. To get started with the Raspberry Pi you will be needing the following components: Keyboard and mouse Having both a keyboard and mouse present will greatly help with the installation of the Raspbian distribution. Almost any keyboard or mouse will work. Display You can attach any compatible HDMI display which can be a computer display or a television. The Raspberry Pi also has composite output shared with the audio connector. You will be needing an A/V cable if you want to use this output. Power adapter Because of all the enhancements done the Raspberry Pi foundation recommends a 5V adapter capable to deliver 2.5 A. You would be able to use a lower rated one, but I strongly advice against this if you are planning to use all the available USB ports. The connector for powering the device is done with a Micro USB cable. MicroSD card The Raspberry Pi 3 uses a microSD card. I would advice to use at least a 8 GB class 10 version. This will allow to use the additional space to install applications and as our projects will log data you won’t be running out of space soon. The Raspberry Pi 3 Last but not least a Raspberry Pi 3. Some of our projects will be using the on-board Bluetooth chip and this version is also being focussed on in this article. Our first step will be preparing a SD card for usage with the Raspberry Pi. You will be needing a MicroSD card as the Raspberry Pi 3 only supports this format. The preparation of the SD card is being done on a normal PC so it is wise to purchase one with an adapter fitting a full size SD card slot. There are webshops selling pre-formatted SD cards with the NOOBS installer already present on the card. If you have bought one of these pre-formatted cards you can skip to the Installing Raspbian section. Get a compatible SD card There are a large numbers of SD cards available. The Raspberry Pi foundation advices an 8 GB card leaving space to install different kind of applications and supplies enough space for us to write any log data. When you buy a SD card it is wise to keep your eyes open for the quality of these cards. Buying them from well known and established manufactures often supplies better quality then the counterfeit ones. SD cards are being sold with different class definitions. These classes explain the minimal combined read and write speeds. Class 6 should provide at least 6 MB (Mega Byte) per second and class 10 cards should provide at least 10 MB/s. There is a good online resource available which provides tested results of used SD cards with the Raspberry Pi. If you would need any resource to check for compatible SD cards I would advice you to go to the embedded Linux page at http://elinux.org/RPi_SD_cards. Preparing and formatting the SD card To be able to use the SD card it first needs to be formatted. Most cards are already formatted with the FAT32 file system, which the Raspberry Pi NOOBS installer requires, unless you have bought a large SD card it is possible it is formatted with the exFAT file system. These then should also be formatted as FAT32. To format the SD card we will be using the SD association’s SDformatter utility which you can download from http://elinux.org/RPi_SD_cards as default Operating System supplied formatters are not always providing optimal results. In the below screenshot, the SDformatter for the Mac is shown. This utility is also available for Windows and has the same options. If you are using Linux you can use GParted. Make sure when using GParted you use FAT32 as the formatting option. As in the screenshot select the Overwrite format option and give the SD card a label. The example shows RPI3JAVA but this can be a personal label of your choice to quickly recognize the card when inserted: Press the Format button to start formatting the SD card. Depending on the size of the SD card this can take some time enabling you to get a cup of coffee. The utility will show a done message in the form of Card Format complete when the formatting is done. You will now have an usable SD card. To be able to use the NOOBS installer you will be needing to follow the following steps: Download the NOOBS installer from https://www.raspberrypi.org/downloads/. Unzip the file with your favorite unzip utility. Most Operating Systems already have one installed. Copy the contents of the unzipped file into the SD card’s root directory so the copy result is shown. When selecting the NOOBS for download do only select the lite version if you do not mind to install Raspbian using the Raspberry Pi’s network connection. Now after we have copied the required files into the SD card we can start installing the Raspbian Operating System. Installing Raspbian To install Raspbian we need to get the Raspberry Pi ready for use. As the Raspberry Pi has no power on and off button the powering of the Raspberry Pi will be done as the last step: At the bottom of the Raspberry Pi on the side you will see a slot to put your MicroSD card. Insert the SD card with the connectors pointing to the board. Next connect the HDMI or the Composite connector and your keyboard and mouse. You won’t be needing a network cable as we will be using the wireless functionality build into the Raspberry Pi. We will now connect the Raspberry Pi with the micro USB cabled power supply. When the Raspberry Pi boots up you will be presented with the installation options of Operating Systems available to be installed. Depending on the download of NOOBS you have done you will be able to see if the Raspbian Operating System is already available on the SD card or it will be installed by downloading it. This is being visualized by showing an SD card image or a network image behind the Operating system name. In the below screenshot you see the NOOBS installer with the Raspbian Image available on the SD card. At the bottom of the installation screen you will find the Language and Keyboard drop down menu’s. Make sure you select the appropriate language and keyboard selection otherwise it will become quite difficult to enter correct characters on  the command line and other tools requiring text input. Select the Raspbian [RECOMMENDED] option and click the Install (i) button to start installing the Operating System: You will be prompted with a popup confirming the installation as it will overwrite any existing installed Operating Systems. As we are using a clean SD card we will not be overwriting any. It is safe to press Yes to start the installation. This installation will take up a couple of minutes, so it is a good time to go for a second cup of coffee. When the installation is done you can press Ok in the popup which appears and the Raspberry Pi will reboot. Because Raspbian is a Linux OS you will see text scrolling by of services which are being started by the OS. hen all services are started the Raspberry Pi will start the default graphical environment called LXDE which is one of the Linux window managers. Configuring Raspbian Now that we have installed Raspbian and have it booting into the graphical environment we can start configuring the Raspberry Pi for our purposes. To be able to configure the Raspberry Pi the graphical has got an utility tool installed which eases up the configuration called Raspberry Pi Configuration. To open this tool use the mouse and click on the Menu button on the top left, navigate to Preferences and press the Raspberry Pi Configuration menu option like shown in the screenshot: When clicked on the Raspberry Pi Configuration tool menu option a popup will appear with the graphical version of the known raspi-config command line tool. In thegraphicalpopup we see 4 tabs explaining different parts of possible configuration options. We first focus on the System tab which allows us to:       Change the Password      Change the Hostname which helps to identify the Raspberry Pi in the network      Change the Boot method to, as we are looking at, To Desktop or the CLI which is the command line interface      And set the Network at Boot option With the system newly installed the default username is pi and the password is set to raspberry. Because these are the default settings it is recommended that we change the password into a new one. Press the Change Password button and enter a newly chosen password twice. One time for setting the password and the second time to make sure we have entered the new password correctly. Press Ok when the password has been entered twice. Try to come up with a password which contains capital letters, numbers and some strange characters as this will make it more difficult to guess. Now after we have set a new password we are going to change the hostname of the Raspberry Pi. With the hostname we are able to identify the device on the network. I have changed the hostname into RASPI3JAVA which helps me to identify this Raspberry Pi to be used for the article. The hostname is used on the Command Line Interface so you will immediately identify this Raspberry Pi when you login. By default the Raspberry Pi boots into the graphical user interface which you are looking at right now. Because a future project will require us to make use of a display with our application we will be choosing to boot into the CLI mode. Click on the radio button which says To CLI. The next time we are rebooting we will be shown the command line interface. Because we will be going to use the integrated WI-FI  connection on this Raspberry Pi we are going to change the Network at Boot option to set it to have it waiting for the network. Tick the box which says Wait for network. We are done with setting some default settings which sets some primary options to help us identify the Raspberry Pi and changed the boot mode. We will now be changing some advanced settings which enables us to make use of the hardware provided by the Raspberry Pi. Click on the Interface tab which will give us a list of the available hardware provided. This list consists of:      Camera: The Official Raspberry Pi Camera interface      SSH: To be able to login in from remote locations      SPI: Serial Peripheral Interface Bus for communicating with hardware      I2C: serial communication bus mostly used between chips      Serial: The serial communication interface      1-Wire: Low data, power supplying bus interface conceptual based on I2C      Remote GPIO A future project we will be working on will require some kind of Camera interface. This project will be able to use both the local attached official Raspberry Pi Camera module as well as a USB connected webcam. If you got the camera module tick Enabled radio box behind Camera. We will be deploying our applications immediately from the editor. This means we need to enable the SSH option. By default this already is so we leave the setting as is. If the SSH option is not enabled tick the radio button Enabled behind the SSH option. For now you can leave the other interfaces disabled as we will only enable them when we need them. As we now have enabled default interfaces we will be going to need, we are going to do some performance tweaking. Click on the Performance tab to open the performance options. We will not be needing to overclock the Raspberry Pi so you can leave this options as it is. Later on in the article we will be interfacing with the Raspberry Pi’s display and do some neat tricks with it. For this we need some amount of memory for the Graphical Processor Unit, the GPU. By default this is set to 64 MB. We will ask for the most amount of memory possible to be assigned to the GPU which is 512 MB. Put 512 behind the GPU Memory option, there is no need to enter the text “MB”. The memory on the Raspberry Pi is shared between the system and the GPU. By having this option set to 512 MB results in only 512 MB available for the system. I can assure you this is more then sufficient. Now that we are done with the system configuration we are making sure we can work with the Raspberry Pi. Click on the Localisation tab to show the options applicable to the location the Raspberry Pi resides. We have the options:      Set Locale Where you set your locale settings      Timezone The time zone you currently at      Keyboard The layout of the keyboard      Wi-Fi  Country The country you will be making the Wi-Fi  connection This article is focused on US-English language with a broad character set. Unless you prefer to continue with your own personal preferences change the following by pressing the Set Locale button:      Language to en (English)      Country to US (USA)      Character Set to UTF-8 Press the OK button to continue. As this needs to build up the Locale settings this can take up about 20 seconds to setup, you will be notified with a small window until this process is finished. The next step is to set the Timezone. This is needed as we want to have time and dates be shown correctly. Click on the Set Timezone button and select your appropriate Area and Location from the drop down menu’s. When done press the OK button. To make sure that the text we enter in any input field is the correct one we  are going to set the layout of our keyboard. There are a lot of layouts available so you need to check yours. The Raspberry Pi is quite helpful in providing any keyboard options. Press the Set Keyboard button to open up a popup showing the keyboard options. Here you are able to select your country and the keyboard layout available for this country. In my case I have to select United States as the Country and the Variant as English (US, with euro on 5). After you have made the selection you can test your keyboard setup in the input field below the Country and Variant selection lists. Press the OK button when you are satisfied with your selection. Unless you are connecting your Raspberry Pi with a wired network connection we are going to setup the country we are going to make the Wi-Fi  connection so we are able to connect remotely to the Raspberry Pi. Press the Set Wi-Fi  Country button to have a the Wi-Fi  Country Code shown which provides us the list of available countries for t. he Wi-Fi  connection. Press the OK button after you have made the selection. We are now done with the minimal Raspberry Pi system configuration. Press OK in the settings window to have all our settings stored and press No in the popup following which says a reboot is needed to have the settings applied as we are not completely done yet. Our final step is to set up the local Wi-Fi chip on the Raspberry Pi. We will now set up the Wi-Fi on the Raspberry Pi. Unless you want your Raspberry Pi connected with a Network cable you can skip this section and head over to the Set fixed IP section: To set up the Wi-Fi  click on the Network icon which is shown on top of the screen between the Bluetooth and Speaker Volume icon. When you click this button The Raspberry Pi will start to scan for available wireless networks. Give it a couple of seconds if your Network does not appear immediately. When you see you network appearing click on itandifyou network is secured you will be asked to supply the credentials to be able to connect to your wireless network. If there are any troubles with connecting to your wireless network without any message log in to your router and change the Wi-Fi channel to a channel lower then channel 11. When you have entered your credentials and pressed the OK button you will see the icon changing from the two network computers to the wireless icon trying to connect to the wireless network. Now that the Raspberry Pi has rebooted we have configured the wireless network to make sure the wireless network will keep it’s connection. As the Raspberry Pi is an embedded device targeting low power consumption the Wi-Fi  connection is possible set to sleep mode after a specific time there is no network usage. To make sure the Wi-Fi  is not going into power sleep mode we will be changing a setting through the command line which will make sure this won’t happen. To open a command line interface we need to open a Terminal. A terminal is a window which will show the command prompt where we are able to provide commands. When you look at the graphical interface you will notice a small computer screen icon on the top in the menu bar. When we hover this icon it shows the text Terminal. Press this icon to open up a terminal. A popup will open with a large blackscreenandshowing the command prompt like shown in the screenshot: Do you notice the hostname we have set earlier? This is the same prompt as we will see when we log in remotely. Now that we have a command line open we need to enter a command to make sure the wireless network will not go to sleep after a period of no network activity. Enter the following in the terminal: sudo iw dev wlan0 set power_save off Press Enter. This command sets the power save mode to off so the wlan0 (wireless device) won’t enter power save mode and stays connected to the network. We are almost done with setting up the Raspberry Pi. To be able to connect to the Raspberry Pi from a remote location we need to know the IP address of the Raspberry Pi. This final configuration step involves setting a fixed IP address into the Raspberry Pi settings. To open the settings for a fixed IP configuration we are going to open up the settings by pressing the wireless network icon with the right mouse button and press the option Wi-Fi  Networks (dhcpcdui) Settings. A popup will appear providing settings we can change. As we will will only change the settings of the Wi-Fi  connection we select interface next to the Configure option. When interface is selected we are able to select the wlan0 option in the drop down menu next to the interface selection. If you have chosen to use a wired instead of the wireless connection you can select the eth0 option next to the interface option. We now have a couple of options available to enter IP address related information. Please refer to the documentation of your router to find out which IP address is available to you which you can use. My advice is to only enter the IP address in the available fields whichleaves the other options automatically configured like in the screenshot below. Notice that the entered IP address is the correct one; this only applies to my configuration which could differ from yours: After you have entered the IP address you can click Apply and Close. It is now time to restart our Raspberry Pi and have it boot to the CLI. While rebooting you will see a lot of text scrolling which shows the services starting and at the end instead of starting the graphical interface we are now shown the text based command line interface as shown in the screenshot: If you want to return to the graphical interface just type in: startx Press Enter and wait a couple of seconds for the graphical user interface appear again. We are now ready to install the Oracle JDK which we will be using to run our Java applications. Summary In this article we have learned how to start with with the Raspberry Pi and how to install Raspbian. Resources for Article: Further resources on this subject: The Raspberry Pi and Raspbian The Raspberry Pi and Raspbian Raspberry Pi Gaming Operating Systems Raspberry Pi Gaming Operating Systems Sending Notifications using Raspberry Pi Zero Sending Notifications using Raspberry Pi Zero
Read more
  • 0
  • 0
  • 30936

article-image-spatial-data
Packt
23 Jun 2017
12 min read
Save for later

Spatial Data

Packt
23 Jun 2017
12 min read
In this article by Dominik Mikiewicz, the author of the book Mastering PostGIS, we will see about exporting data from PostgreSQL/PostGIS to files or other data sources. Sharing data via the Web is no less important, but it has its own specific process. There may be different reasons for having to export data from a database, but certainly sharing it with others is among the most popular ones. Backing the data up or transferring it to other software packages for further processing are other common reasons for learning the export techniques. (For more resources related to this topic, see here.) In this article we'll have a closer look at the following: Exporting data using COPY (and COPY) Exporting vector data using pgsql2shp Exporting vector data using ogr2ogr Exporting data using GIS clients Outputting rasters using GDAL Outputting rasters using psql Using the PostgreSQL backup functionality We just do the steps the other way round. In other words, this article may give you a bit of a déjà vu feeling. Exporting data using COPY in psql When we were importing data, we used the psql COPY FROM command to copy data from a file to a table. This time, we'll do it the other way round—from a table to a file—using the COPY TO command. COPY TO can not only copy a full table, but also the result of a SELECT query, and that means we can actually output filtered sub datasets of the source tables. Similarly to the method we used to import, we can execute COPY or COPY in different scenarios: We'll use psql in interactive and non-interactive mode, and we'll also do the very same thing in PgAdmin. It is worth remembering that COPY can only read/write files that can be accessed by an instance of the server, so usually files that reside on the same machine as the database server. For detailed information on COPY syntax and parameters, type: h copy Exporting data in psql interactively In order to export the data in interactive mode we first need to connect to the database using psql: psql -h localhost -p 5434 -U postgres Then type the following: c mastering_postgis Once connected, we can execute a simple command: copy data_import.earthquakes_csv TO earthquakes.csv WITH DELIMITER ';' CSV HEADER The preceding command exported a data_import. earthquakes_csv table to a file named earthquakes.csv, with ';' as a column separator. A header in the form of column names has also been added to the beginning of the file. The output should be similar to the following: COPY 50 Basically, the database told us how many records have been exported. The content of the exported file should exactly resemble the content of the table we exported from: time;latitude;longitude;depth;mag;magtype;nst;gap;dmin;rms;net;id;updated;place;type;horizontalerror;deptherror;magerror;magnst;status;locationsource;magsource 2016-10-08 14:08:08.71+02;36.3902;-96.9601;5;2.9;mb_lg;;60;0.029;0.52;us;us20007csd;2016-10-08 14:27:58.372+02;15km WNW of Pawnee, Oklahoma;earthquake;1.3;1.9;0.1;26;reviewed;us;us As mentioned, COPY can also output the results of a SELECT query. This means we can tailor the output to very specific needs, as required. In the next example, we'll export data from a 'spatialized' earthquakes table, but the geometry will be converted to a WKT (well-known text) representation. We'll also export only a subset of columns: copy (select id, ST_AsText(geom) FROM data_import.earthquakes_subset_with_geom) TO earthquakes_subset.csv WITH CSV DELIMITER '|' FORCE QUOTE * HEADER Once again, the output just specifies the amount of records exported: COPY 50 The executed command exported only the id column and a WKT-encoded geometry column. The export force wrapped the data into quote symbols, with a pipe (|) symbol used as a delimiter. The file has header: id|st_astext "us20007csd"|"POINT(-96.9601 36.3902)" "us20007csa"|"POINT(-98.7058 36.4314) Exporting data in psql non-interactively If you're still in psql, you can execute a script by simply typing the following: i path/to/the/script.sql For example: i code/psql_export.sql The output will not surprise us, as it will simply state the number of records that were outputted: COPY 50 If you happen to have already quitted psql, the cmd i equivalent is -f, so the command should look like this: Psql -h localhost -p 5434 -U postgres -d mastering_postgis -f code/psql_export.sql Not surprisingly, the cmd output is once again the following: COPY 50 Exporting data in PgAdmin In PgAdmin, the command is COPY rather than COPY. The rest of the code remains the same. Another difference is that we need to use an absolute path, while in psql we can use paths relative to the directory we started psql in. So the first psql query 'translated' to the PgAdmin SQL version looks like this: copy data_import.earthquakes_csv TO 'F:mastering_postgischapter06earthquakes.csv' WITH DELIMITER ';' CSV HEADER The second query looks like this: copy (select id, ST_AsText(geom) FROM data_import.earthquakes_subset_with_geom) TO 'F:mastering_postgischapter06earthquakes_subset.csv' WITH CSV DELIMITER '|' FORCE QUOTE * HEADER Both produce a similar output, but this time it is logged in PgAdmin's query output pane 'Messages' tab: Query returned successfully: 50 rows affected, 55 msec execution time. It is worth remembering that COPY is executed as part of an SQL command, so it is effectively the DB server that tries to write to a file. Therefore, it may be the case that the server is not able to access a specified directory. If your DB server is on the same machine as the directory that you are trying to write to, relaxing directory access permissions should help. Exporting vector data using pgsql2shp pgsql2shp is a command-line tool that can be used to output PostGIS data into shapefiles. Similarly to outgoing COPY, it can either export a full table or the result of a query, so this gives us flexibility when we only need a subset of data to be outputted and we do not want to either modify the source tables or create temporary, intermediate ones. pgsql2sph command line In order to get some help with the tool just type the following in the console: pgsql2shp The general syntax for the tool is as follows: pgsql2shp [<options>] <database> [<schema>.]<table> pgsql2shp [<options>] <database> <query> Shapefile is a format that is made up of a few files. The minimum set is SHP, SHX, and DBF. If PostGIS is able to determine the projection of the data, it will also export a PRJ file that will contain the SRS information, which should be understandable by the software able to consume a shapefile. If a table does not have a geometry column, then only a DBF file that is the equivalent of the table data will be exported. Let's export a full table first: pgsql2shp -h localhost -p 5434 -u postgres -f full_earthquakes_dataset mastering_postgis data_import.earthquakes_subset_with_geom The following output should be expected: Initializing... Done (postgis major version: 2). Output shape: Point Dumping: X [50 rows] Now let's do the same, but this time with the result of a query: pgsql2shp -h localhost -p 5434 -u postgres -f full_earthquakes_dataset mastering_postgis "select * from data_import.earthquakes_subset_with_geom limit 1" To avoid being prompted for a password, try providing it within the command via the -P switch. The output will be very similar to what we have already seen: Initializing... Done (postgis major version: 2). Output shape: Point Dumping: X [1 rows] In the data we previously imported, we do not have examples that would manifest shapefile limitations. It is worth knowing about them, though. You will find a decent description at https://en.wikipedia.org/wiki/Shapefile#Limitations. The most important ones are as follows: Column name length limit: The shapefile can only handle column names with a maximum length of 10 characters; pgsql2shp will not produce duplicate columns, though—if there were column names that would result in duplicates when truncated, then the tool will add a sequence number. Maximum field length: The maximum field length is 255; psql will simply truncate the data upon exporting. In order to demonstrate the preceding limitations, let's quickly create a test PostGIS dataset: Create a schema if an export does not exist: CREATE SCHEMA IF NOT EXISTS data_export; CREATE TABLE IF NOT EXISTS data_export.bad_bad_shp ( id character varying, "time" timestamp with time zone, depth numeric, mag numeric, very_very_very_long_column_that_holds_magtype character varying, very_very_very_long_column_that_holds_place character varying, geom geometry); INSERT INTO data_export.bad_bad_shp select * from data_import.earthquakes_subset_with_geom limit 1; UPDATE data_export.bad_bad_shp SET very_very_very_long_column_that_holds_magtype = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce id mauris eget arcu imperdiet tristique eu sed est. Quisque suscipit risus eu ante vestibulum hendrerit ut sed nulla. Nulla sit amet turpis ipsum. Curabitur nisi ante, luctus nec dignissim ut, imperdiet id tortor. In egestas, tortor ac condimentum sollicitudin, nisi lacus porttitor nibh, a tempus ex tellus in ligula. Donec pharetra laoreet finibus. Donec semper aliquet fringilla. Etiam faucibus felis ac neque facilisis vestibulum. Vivamus scelerisque at neque vel tincidunt. Phasellus gravida, ipsum vulputate dignissim laoreet, augue lacus congue diam, at tempus augue dolor vitae elit.'; Having prepared a vigilante dataset, let's now export it to SHP to see if our SHP warnings were right: pgsql2shp -h localhost -p 5434 -u postgres -f bad_bad_shp mastering_postgis data_export.bad_bad_shp When you now open the exported shapefile in a GIS client of your choice, you will see our very, very long column names renamed to VERY_VERY_ and VERY_VE_01. The content of the very_very_very_long_column_that_holds_magtype field has also been truncated to 255 characters, and is now 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce id mauris eget arcu imperdiet tristique eu sed est. Quisque suscipit risus eu ante vestibulum hendrerit ut sed nulla. Nulla sit amet turpis ipsum. Curabitur nisi ante, luctus nec dignissim ut'. For the sake of completeness, we'll also export a table without geometry so we are certain that pgsql2shp exports only a BDF file: pgsql2shp -h localhost -p 5434 -u postgres -f a_lonely_dbf mastering_postgis "select id, place from data_import.earthquakes_subset_with_geom limit 1" pgsql2shp gui We have already seen the PgAdmin's GUI for importing shapefiles. As you surely remember, the pgsql2shp GUI also has an Export tab. If you happen to encounter difficulties locating the pgsql2shp GUI in pgAdmin 4, try calling it from the shell/command line by executing shp2pgsql-gui. If because of some reason it is not recognized, try to locate the utility in your DB directory under bin/postgisgui/shp2pgsql-gui.exe. In order to export a shapefile from PostGIS, go to the PluginsPostGIS shapefile and DBF loader 2.2 (version may vary); then you have to switch to the Export tab: It is worth mentioning that you have some options to choose from when exporting. They are rather self-explanatory: When you press the Export button, you can choose the output destination. The log is displayed in the 'Log Window' area of the exporter GUI: Exporting vector data using ogr2ogr We have already seen a little preview of ogr2ogr exporting the data when we made sure that our KML import had actually brought in the proper data. This time we'll expand on the subject a bit and also export a few more formats to give you an idea of how sound a tool ogr2ogr is. In order to get some information on the tool, simply type the following in the console: ogr2ogr Alternatively, if you would like to get some more descriptive info, visit http://www.gdal.org/ogr2ogr.html. You could also type the following: ogr2ogr –long-usage The nice thing about ogr2ogr is that the tool is very flexible and offers some options that allow us to export exactly what we are after. You can specify what data you would like to select by specifying the columns in a -select parameter. -where parameter lets you specify the filtering for your dataset in case you want to output only a subset of data. Should you require more sophisticated output preparation logic, you can use a -sql parameter. This is obviously not all there is. The usual gdal/ogr2ogr parameters are available too. You can reproject the data on the fly using the -t_srs parameter, and if, for some reason, the SRS of your data has not been clearly defined, you can use -s_srs to instruct ogr2ogr what the source coordinate system is for the dataset being processed. There are obviously advanced options too. Should you wish to clip your dataset to a specified bounding box, polygon, or coordinate system, have a look at the -clipsrc, -clipdst parameters and their variations. The last important parameter to know is -dsco—dataset creation options. It accepts values in the form of NAME=VALUE. When you want to pass more than one option this way, simply repeat the parameter. The actual dataset creation options depend on the format used, so it is advised that you consult the appropriate format information pages available via the ogr2pgr website. Summary There are many ways of getting the data out of a database. Some are PostgreSQL specific, some are PostGIS specific. The point is that you can use and mix any tools you prefer. There will be scenarios where simple data extraction procedures will do just fine; some other cases will require a more specialized setup, SQL or psql, or even writing custom code in external languages. I do hope this article gives you a toolset you can use with confidence in your daily activities. Resources for Article: Further resources on this subject: Spatial Data Services [article] Data Around Us [article] R ─ Classification and Regression Trees [article]
Read more
  • 0
  • 0
  • 1952

article-image-use-real-world-application
Packt
23 Jun 2017
11 min read
Save for later

Use in Real World Application

Packt
23 Jun 2017
11 min read
In this article by Giorgio Zarrelli author of the book Mastering Bash, we are moving a step now into the real world creating something that can turn out handy for your daily routine and during this process we will have a look at the common pitfalls in coding and how to make our script reliable. Be short or a long script, we must always ask ourselves the same questions: What do we really want to accomplish? How much time do we have? Do we have all the resources needed? Do we have the knowledge required for the task? (For more resources related to this topic, see here.) We will start coding with a Nagios plugin which will give us a broad understanding of how this monitoring system is and how to make a script dynamically interact with other programs. What is Nagios Nagios is one of the most widely adopted Open Source IT infrastructure monitoring tool whose main interesting feature being the fact that it does not know how to monitor anything. Well, it can sound like a joke but actually Nagios can be defined as an evaluating core which takes some informations as input and reacting accordingly. How this information is gathered? It is not the main concern of this tool and this leads us to an interesting point: Nagios leave sthe task of getting the monitored data to an external plugin which: Knows how to connect to the monitored services Knows how to collect the data from the monitored services Knows how to evaluate the data Inform Nagios if the values gathered are beyond or in the boundaries to raise an alarm. So, a plugin does a lot of things and one would ask himself what does Nagios do then? Imagine it as an exchange pod where information is flowing in and out and decisions are taken based on the configurations set; the core triggers the plugin to monitor a service, the plugin itself returns some information and Nagios takes a decision about: If to raise an alarm Send a notification Whom to notify For how long Which, if any action is taken in order to get back into normality The core Nagios program does everything except actually knock at the door of a service, ask for information and decide if this information shows some issues or not. Planning must be done, but it can be fun. Active and passive checks To understand how to code a plugin we have first to grasp how, on a broad scale, a Nagios check works. There are two different kinds of checks: Active check Based on a time range, or manually triggered, an active check sees a plugin actively connecting to a service and collecting informations. A typical example could be a plugin to check the disk space: once invoked it interfaces with (usually) the operating system, execute a df command, works on the output, extracts the value related to the disk space, evaluates it against some thresholds and report back a status, like OK , WARNING , CRITICAL or UNKNOWN. Passive check In this case, Nagios does not trigger anything but waits to be contacted by some means by the service which must be monitored. It seems quite confusing but let’s make a real life example. How would you monitor if a disk backup has been completed successfully? One quick answer would be: knowing when the backup task starts and how long it lasts, we can define a time and invoke a script to check the task at that given hour. Nice, but when we plan something we must have a full understanding of how real life goes and a backup is not our little pet in the living room, it’s rather a beast which does what it wants. A backup can last a variable amount of time depending on unpredictable factor. For instance, your typical backup task would copy 1 TB of data in 2 hours, starting at 03:00, out of a 6 TB disk. So, the next backup task would start at 03:00+02:00=05:00 AM, give or take some minutes, and you setup an active check for it at 05:30 and it works well for a couple of months. Then, one early morning your receive a notification on your smartphone, the backup is in CRITICAL. You wake up, connect to the backup console and see that at 06:00 in the morning you are asleep and the backup task has not even been started by the console. Then you have to wait until 08:00 AM until some of your colleagues shows up at the office to find out that the day before the disk you backup has been filled with 2 extra TB of data due to an unscheduled data transfer. So, the backup task preceding the one you are monitoring lasted not for a couple of hours but 6 hours, and the task you are monitoring then started at 09:30 AM. Long story short, your active check has been fired up too early, that is why it failed. Maybe your are tempted to move your schedule some hours ahead, but simply do not do it, these time slots are not sliding frames. If you move your check ahead you should then move all the checks for the subsequent tasks ahead. You do it in one week, the project manager will ask someone to delete the 2 TB in excess (they are no more of any use for the project), and your schedules will be 2 hours ahead making your monitoring useless. So, as we insisted before, planning and analyzing the context are the key factors in making a good script and, in this case, a good plugin. We have a service that does not run 24/7 like a web service or a mail service, what is specific to the backup is that it is run periodically but we do not know exactly when. The best approach to this kind of monitoring is letting the service itself to notify us when it finished its task and what was its outcome. That is usually accomplished using the ability of most of the backup programs to send a Simple Network Monitoring Protocol (SNMP) trap to a destination to inform it of the outcome and for our case it would be the Nagios server which would have been configured to receive the trap and analyze. Add to this an event horizon so that if you do not receive that specific trap in, let’s say, 24 hours we raise an alarm anyway and you are covered: whenever the backup task gets completed, or when it times out, we receive a notification. Nagios notifications flowchart Return codes and thresholds Before coding a plugin we must face some concepts that will be the stepping stone of our Nagios coding in base, one of these being the return codes of the plugin itself. As we already discussed, once the plugin collects the data about how the service is going, it evaluates these data and determines if the situation falls under one of the following status: Return code Status Description 0 OK The plugin checked the service and the results are inside the acceptable range 1 WARNING The plugin checked the service and the results are above a warning threshold. We must keep an eye on the service 2 CRITICAL The plugin checked the service and the results are above a CRITICAL threshold or the service not responding. We must react now. 3 UNKNOWN Either we passed the wrong arguments to the plugin or there is some internal error in it. So, our plugin will check a service, evaluate the results, and based on a threshold will return to Nagios one of the values listed in the tables and a meaningful message like we can see in the description column in the following image: Notice the service check in red and the message in the figure above. In the image we can see that some checks are green, meaning ok, and they have an explicative message in the description section: what we see in this section is the output of the plugin written in the stdout and it is what we will craft as a response to Nagios. Pay attention at the ssh check, it is red, it is failing because it is checking the service at the default port which is 22 but on this server the ssh daemon is listening on a different port. This leads us to a consideration: our plugin will need a command line parser able to receive some configuration options and some threshold limits as well because we need to know what to check, where to check and what are the acceptable working limits for a service: Where: In Nagios there can be a host without service checks (except for the implicit host alive carried on by a ping), but no services without a host to be performed onto. So any plugin must receive on the command line the indication of the host to be run against, be it a dummy host but there must be one. How: This is where our coding comes in, we will have to write the lines of code that instruct the plugin how to connect to the server, query, collect and parse the answer. What: We must instruct the plugin, usually with some meaningful options on the command line, on what are the acceptable working limits so that it can evaluate them and decide if notify an OK, WARNING or CRITICAL message. That is all for our script, who to notify, when, how, for how many times and so forth. These are tasks carried on by the core, a Nagios plugin is unaware of all of this. What he really must know for an effective monitoring is what are the correct values that identify a working service. We can pass to our script two different kinds of value: Range:A series of numeric values with a starting and ending point, like from 3 to 7 or from one number to infinite. Threshold: It is a range with an associated alert level. So, when our plugins perform its check, it collects a numeric value that is within or outside a range, based on the threshold we impose then, based on the evaluation it replies to Nagios with a return code and a message. How do we specify some ranges on the command line? Essentially in the following way: [@] start_value:end_value If the range starts from 0, the part from : to the left can be omitted. The start_value must always be a lower number than end_value. If the range starts as start_value, it means from that number to infinity. Negative infinity can be specified using ~ Alert is generated when the collected value resides outside the range specified, comprised of the endpoints. If @ is specified, the alert is generated if the value resides inside the range. Let's see some practical example on how we would call our script imposing some thresholds: Plugin call Meaning ./my_plugin -c 10 CRITICAL if less than 0 or higher than 10 ./my_plugin -w 10:20 WARNING if less than 10 or higher than 20 /my_plugin -w ~:15 -c 16 WARNING if between -infinite and 15, critical from 16 and higher ./my_plugin -c 35: CRITICAL if the value collected is below 35 ./my_plugin -w @100:200 CRITICAL if the value is from 100 to 200, OK otherwise We covered the basic requirements for our plugin that in its simplest form should be called with the following syntax: ./my_plugin -h hostaddress|hostname -w value -c value We already talked about the need to relate a check to a host and we can do this either using a host name or a host address. It is up to us what to use but we will not fill in this piece of information because it will be drawn by the service configuration as a standard macro. We just introduced a new concept, service configuration, which is essential in making our script work in Nagios. Summary In this article, we learned that how the real world can turn out handy for your daily routine and during this process we also looked at the common pitfalls in coding and how we make our script reliable. Resources for Article: Further resources on this subject: From Code to the Real World [article] Getting Ready to Launch Your PhoneGap App in the Real World [article] Implementing a WCF Service in the Real World [article]
Read more
  • 0
  • 0
  • 2613
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-getting-started-python-and-machine-learning
Packt
23 Jun 2017
34 min read
Save for later

Getting Started with Python and Machine Learning

Packt
23 Jun 2017
34 min read
In this article by Ivan Idris, Yuxi (Hayden) Liu, and Shoahoa Zhang author of the book Python Machine Learning By Example we cover basic machine learning concepts. If you need more information, then your local library, the Internet, and Wikipedia in particular should help you further. The topics that will be covered in this article are as follows: (For more resources related to this topic, see here.) What is machine learning and why do we need it? A very high level overview of machine learning Generalizing with data Overfitting and the bias variance trade off Dimensions and features Preprocessing, exploration, and feature engineering Combining models Installing software and setting up Troubleshooting and asking for help What is machine learning and why do we need it? Machine learning is a term coined around 1960 composed of two words – machine corresponding to a computer, robot, or other device, and learning an activity, which most humans are good at. So we want machines to learn, but why not leave learning to the humans? It turns out that there are many problems involving huge datasets, for instance, where it makes sense to let computers do all the work. In general, of course, computers and robots don't get tired, don't have to sleep, and may be cheaper. There is also an emerging school of thought called active learning or human-in-the-loop, which advocates combining the efforts of machine learners and humans. The idea is that there are routine boring tasks more suitable for computers, and creative tasks more suitable for humans. According to this philosophy machines are able to learn, but not everything. Machine learning doesn't involve the traditional type of programming that uses business rules. A popular myth says that the majority of the code in the world has to do with simple rules possibly programmed in Cobol, which covers the bulk of all the possible scenarios of client interactions. So why can't we just hire many coders and continue programming new rules? One reason is the cost of defining and maintaining rules becomes very expensive over time. A lot of the rules involve matching strings or numbers, and change frequently. It's much easier and more efficient to let computers figure out everything themselves from data. Also the amount of data is increasing, and actually the rate of growth is itself accelerating. Nowadays the floods of textual, audio, image, and video data are hard to fathom. The Internet of Things is a recent development of a new kind of Internet, which interconnects everyday devices. The Internet of Things will bring data from household appliances and autonomous cars to the forefront. The average company these days has mostly human clients, but for instance social media companies tend to have many bot accounts. This trend is likely to continue and we will have more machines talking to each other. An application of machine learning that you may be familiar is the spam filter, which filters e-mails considered to be spam. Another is online advertising where ads are served automatically based on information advertisers have collected about us. Yet another application of machine learning is search engines. Search engines extract information about web pages, so that you can efficiently search the Web. Many online shops and retail companies have recommender engines, which recommend products and services using machine learning. The list of applications is very long and also includes fraud detection, medical diagnosis, sentiment analysis, and financial market analysis. In the 1983, War Games movie a computer made life and death decisions, which could have resulted in Word War III. As far as I know technology wasn't able to pull off such feats at the time. However, in 1997 the Deep Blue supercomputer did manage to beat a world chess champion. In 2005, a Stanford self-driving car drove by itself for more than 130 kilometers in a desert. In 2007, the car of another team drove through regular traffic for more than 50 kilometers. In 2011, the Watson computer won a quiz against human opponents. In 2016 the AlphaGo program beat one of the best Go players in the world. If we assume that computer hardware is the limiting factor, then we can try to extrapolate into the future. Ray Kurzweil did just that and according to him, we can expect human level intelligence around 2029. A very high level overview of machine learning Machine learning is a subfield of artificial intelligence—a field of computer science concerned with creating systems, which mimic human intelligence. Software engineering is another field of computer science, and we can label Python programming as a type of software engineering. Machine learning is also closely related to linear algebra, probability theory, statistics, and mathematical optimization. We use optimization and statistics to find the best models, which explain our data. If you are not feeling confident about your mathematical knowledge, you probably are wondering, how much time you should spend learning or brushing up. Fortunately to get machine learning to work for you, most of the time you can ignore the mathematical details as they are implemented by reliable Python libraries. You do need to be able to program. As far as I know if you want to study machine learning, you can enroll into computer science, artificial intelligence, and more recently, data science masters. There are various data science bootcamps; however, the selection for those is stricter and the course duration is often just a couple of months. You can also opt for the free massively online courses available on the Internet. Machine learning is not only a skill, but also a bit of sport. You can compete in several machine learning competitions; sometimes for decent cash prizes. However, to win these competitions, you may need to utilize techniques, which are only useful in the context of competitions and not in the context of trying to solve a business problem. A machine learning system requires inputs—this can be numerical, textual, visual, or audiovisual data. The system usually has outputs, for instance, a floating point number, an integer representing a category (also called a class), or the acceleration of a self-driving car. We need to define (or have it defined for us by an algorithm) an evaluation function called loss or cost function, which tells us how well we are learning. In this setup, we create an optimization problem for us with the goal of learning in the most efficient way. For instance, if we fit data to a line also called linear regression, the goal is to have our line be as close as possible to the data points on average. We can have unlabeled data, which we want to group or explore – this is called unsupervised learning. Unsupervised learning can be used to detect anomalies, such as fraud or defective equipment. We can also have labeled examples to use for training—this is called supervised learning. The labels are usually provided by human experts, but if the problem is not too hard, they also may be produced by any members of the public through crowd sourcing for instance. Supervised learning is very common and we can further subdivide it in regression and classification. Regression trains on continuous target variables, while classification attempts to find the appropriate class label. If not all examples are labeled, but still some are we have semi supervised learning. A chess playing program usually applies reinforcement learning—this is a type of learning where the program evaluates its own performance by, for instance playing against itself or earlier versions of itself. We can roughly categorize machine learning algorithms in logic-based learning, statistical learning, artificial neural networks, and genetic algorithms. In fact, we have a whole zoo of algorithms with popularity varying over time. The logic-based systems were the first to be dominant. They used basic rules specified by human experts, and with these rules systems tried to reason using formal logic. In the mid-1980s, artificial neural networks came to the foreground, to be then pushed aside by statistical learning systems in the 1990s. Artificial neural networks imitate animal brains, and consist of interconnected neurons that are also an imitation of biological neurons. Genetic algorithms were pretty popular in the 1990s (or at least that was my impression). Genetic algorithms mimic the biological process of evolution. We are currently (2016) seeing a revolution in deep learning, which we may consider to be a rebranding of neural networks. The term deep learning was coined around 2006, and it refers to deep neural networks with many layers. The breakthrough in deep learning is amongst others caused by the availability of graphics processing units (GPU), which speed up computation considerably. GPUs were originally intended to render video games, and are very good in parallel matrix and vector algebra. It is believed that deep learning resembles, the way humans learn, therefore it may be able to deliver on the promise of sentient machines. You may have heard of Moore's law—an empirical law, which claims that computer hardware improves exponentially with time. The law was first formulated by Gordon Moore, the co-founder of Intel, in 1965. According to the law the number of transistors on a chip should double every two years. In the following graph, you can see that the law holds up nicely (the size of the bubbles corresponds to the average transistor count in GPUs): The consensus seems to be that Moore's law should continue to be valid for a couple of decades. This gives some credibility to Ray Kurzweil's predictions of achieving true machine intelligence in 2029. We will encounter many of the types of machine learning later in this book. Most of the content is about supervised learning with some examples of unsupervised learning. The popular Python libraries support all these types of learning. Generalizing with data The good thing about data is that we have a lot of data in the world. The bad thing is that it is hard to process this data. The challenges stem from the diversity and noisiness of the data. We as humans, usually process data coming in our ears and eyes. These inputs are transformed into electrical or chemical signals. On a very basic level computers and robots also work with electrical signals. These electrical signals are then translated into ones and zeroes. However, we program in Python in this book, and on that level normally we represent the data either as numbers, images, or text. Actually images and text are not very convenient, so we need to transform images and text into numerical values. Especially in the context of supervised learning we have a scenario similar to studying for an exam. We have a set of practice questions and the actual exam. We should be able to answer exam questions without knowing the answers for them. This is called generalization – we learn something from our practice questions and hopefully are able to apply this knowledge to other similar questions. Finding good representative examples is not always an easy task depending on the complexity of the problem we are trying to solve and how generic the solution needs to be. An old-fashioned programmer would talk to a business analyst or other expert, and then implement a rule that adds a certain value multiplied by another value corresponding for instance to tax rules. In a machine learning setup we can give the computer example input values and example output values. Or if we are more ambitious, we can feed the program the actual tax texts and let the machine process the data further just like an autonomous car doesn't need a lot of human input. This means implicitly that there is some function, for instance, a tax formula that we are trying to find. In physics we have almost the same situation. We want to know how the universe works, and formulate laws in a mathematical language. Since we don't know the actual function all we can do is measure what our error is, and try to minimize it. In supervised learning we compare our results against the expected values. In unsupervised learning we measure our success with related metrics. For instance, we want clusters of data to be well defined. In reinforcement learning a program evaluates its moves, for example, in a chess game using some predefined function. Overfitting and the bias variance trade off Overfitting, (one word) is such an important concept, that I decided to start discussing it very early in this book. If you go through many practice questions for an exam, you may start to find ways to answer questions, which have nothing to do with the subject material. For instance, you may find that if you have the word potato in the question, the answer is A, even if the subject has nothing to do with potatoes. Or even worse, you may memorize the answers for each question verbatim. We can then score high on the practice questions, which are called the train set in machine learning. However, we will score very low on the exam questions, which are called the test set in machine learning. This phenomenon of memorization is called bias. Overfitting almost always means that we are trying too hard to extract information from the data (random quirks), and using more training data will not help. The opposite scenario is called underfitting. When we underfit we don't have a very good result on the train set, but also not a very good result on the test set. Underfitting may indicate that we are not using enough data, or that we should try harder to do something useful with the data. Obviously we want to avoid both scenarios. Machine learning errors are the sum of bias and variance. Variance measures how much the error varies. Imagine that we are trying to decide what to wear based on the weather outside. If you have grown up in a country with a tropical climate, you may be biased towards always wearing summer clothes. If you lived in a cold country, you may decide to go to a beach in Bali wearing winter clothes. Both decisions have high bias. You may also just wear random clothes that are not appropriate for the weather outside – this is an outcome with high variance. It turns out that when we try to minimize bias or variance, we tend to increase the other – a concept known as the bias variance tradeoff. The expected error is given by the following equation: The last term is the irreducible error. Cross-validation Cross-validation is a technique, which helps to avoid overfitting. Just like we would have a separation of practice questions and exam questions, we also have training sets and test sets. It's important to keep the data separated and only use the test set in the final stage. There are many cross-validation schemes in use. The most basic setup splits the data given a specified percentage – for instance 75 % train data and 25 % test set. We can also leave out a fixed number of observations in multiple rounds, so that these items are in the test set, and the rest of the data is in the train set. For instance, we can apply leave-one-outcross-validation (LOOCV) and let each datum be in the test set once. For a large dataset LOOCV requires as many rounds as there are data items, and can therefore be too slow. The k-fold cross-validation performs better than LOOCV and randomly splits the data into k (a positive integer) folds. One of the folds becomes the test set, and the rest of the data becomes the train set. We repeat this process k times with each fold being the designated test set once. Finally, we average the k train and test errors for the purpose of evaluation. Common values for k are five and ten. The following table illustrates the setup for five folds: Iteration Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 1 Test Train Train Train Train 2 Train Test Train Train Train 3 Train Train Test Train Train 4 Train Train Train Test Train 5 Train Train Train Train Test  We can also randomly split the data into train and test set multiple times. The problem with this algorithm is that some items may never end in the test set, while others may be selected in the test set multiple times. The nested cross-validation is a combination of cross-validations. Nested cross-validation consists of the following cross-validations: The inner cross-validation does optimization to find the best fit, and can be implemented as k-fold cross validation. The outer cross-validation is used to evaluate performance and do statistical analysis. Regularization Regularization like cross-validation is a general technique to fight overfitting. Regularization adds extra parameters to the error function we are trying to minimize, in order to penalize complex models. For instance, if we are trying to fit a curve to a high order polynomial, we may use regularization to reduce the influence of the higher degrees, thereby effectively reducing the order of the polynomial. Simpler methods are to be preferred, at least according to the principle of Occam's razor. William Occam was a monk and philosopher who around 1320 came up with the idea that the simplest hypothesis that fits data should be preferred. One justification is that you can invent fewer simple models than complex models. For instance, intuitively we know that there are more higher polynomial models than linear ones. The reason is that a line (f(x) = ax + b) is governed by only two numbers – the intercept b and slope a. The possible coefficients for a line span two-dimensional space. A quadratic polynomial adds an extra coefficient for the quadratic term, and we can span a three-dimensional space with the coefficients. Therefore it is less likely that we find a linear model than more complex models, because the search space for linear models is much smaller (although it is infinite). And of course simpler models are just easier to use and require less computation time. We can also stop a program early as a form of regularization. If we give a machine learner less time it is more likely to produce a simpler model, and we hope less likely to overfit. Of course, we have to do this in an ordered fashion, so this means that the algorithm should be aware of the possibility of early termination. Dimensions and features We typically represent the data as a grid of numbers (a matrix). Each column represents a variable, which we call a feature in machine learning. In supervised learning one of the variables is actually not a feature, but the label that we are trying to predict. And in supervised learning each row is an example that we can use for training or testing. The number of features corresponds to the dimensionality of the data. Our machine learning approach depends on the number of dimensions versus the number of examples. For instance, text and image data are very high dimensional, while stock market data has relatively fewer dimensions. Fitting high dimensional data is computationally expensive, and is also prone to overfitting due to the high complexity. Higher dimensions are also impossible to visualize, and therefore we can't use simple diagnostic methods. Not all the features are useful and they may only add randomness to our results. It is therefore often important to do good feature selection. Feature selection is a form of dimensionality reduction. Some machine learning algorithms are actually able to automatically perform feature selection. We should be careful not to omit features that do contain information. Some practitioners evaluate the correlation of a single feature and the target variable. In my opinion you shouldn't do that, because correlation of one variable with the target in itself doesn't tell you much, instead use an off-the-shelf algorithm and evaluate the results of different feature subsets. In principle, feature selection boils down to multiple binary decisions whether to include a feature or not. For n features we get 2n feature sets, which can be a very large number for a large number of features. For example, for 10 features we have 1024 possible feature sets (for instance if we are deciding what clothes to wear the features can be temperature, rain, the weather forecast, where we are going, and so on). At a certain point brute force evaluation becomes infeasible. We will discuss better methods in this book. Basically we have two options: we either start with all the features, and remove features iteratively or we start with a minimum set of features and add features iteratively. We then take the best feature sets for each iteration, and then compare them. Another common dimensionality reduction approach is to transform high-dimensional data in lower-dimensional space. This transformation leads to information loss, but we can keep the loss to a minimum. We will cover this in more detail later on. Preprocessing, exploration, and feature engineering Data mining, a buzzword in the 1990 is the predecessor of data science (the science of data). One of the methodologies popular in the data mining community is called cross industry standard process for data mining (CRISP DM). CRISP DM was created in 1996, and is still used today. I am not endorsing CRISP DM, however I like its general framework. The CRISP DM consists of the following phases, which are not mutually exclusive and can occur in parallel: Business understanding: This phase is often taken care of by specialized domain experts. Usually we have a business person formulate a business problem, such as selling more units of a certain product. Data understanding: This is also a phase, which may require input from domain experts, however, often a technical specialist needs to get involved more than in the business understanding phase. The domain expert may be proficient with spreadsheet programs, but have trouble with complicated data. In this book I usually call this phase exploration. Data preparation: This is also a phase where a domain expert with only Excel knowhow may not be able to help you. This is the phase where we create our training and test datasets. In this book I usually call this phase preprocessing. Modeling: This is the phase, which most people associate with machine learning. In this phase we formulate a model, and fit our data. Evaluation: In this phase, we evaluate our model, and our data to check whether we were able to solve our business problem. Deployment: This phase usually involves setting up the system in a production environment (it is considered good practice to have a separate production system). Typically this is done a specialized team. When we learn, we require high quality learning material. We can't learn from gibberish, so we automatically ignore anything that doesn't make sense. A machine learning system isn't able to recognize gibberish, so we need to help it by cleaning the input data. It is often claimed that cleaning the data forms a large part of machine learning. Sometimes cleaning is already done for us, but you shouldn't count on it. To decide how to clean the data, we need to be familiar with the data. There are some projects, which try to automatically explore the data, and do something intelligent, like producing a report. For now unfortunately we don't have a solid solution, so you need to do some manual work. We can do two things, which are not mutually exclusive – first scan the data and second visualizing the data. This also depends on the type of data we are dealing with whether we have a grid of numbers, images, audio, text, or something else. At the end, a grid of numbers is the most convenient form, and we will always work towards having numerical features. We want to know if features miss values, how the values are distributed and what type of features we have. Values can approximately follow a Normal distribution, a Binomial distribution, a Poisson distribution or another distribution altogether. Features can be binary – either yes or no, positive or negative, and so on. They can also be categorical – pertaining to a category, for instance continents (Africa, Asia, Europe, Latin America, North America, and so on).Categorical variables can also be ordered – for instance high, medium, and low. Features can also be quantitative – for example temperature in degrees or price in dollars. Feature engineering is the process of creating or improving features. It's more of a dark art than a science. Features are often created based on common sense, domain knowledge or prior experience. There are certain common techniques for feature creation, however there is no guarantee that creating new features will improve your results. We are sometimes able to use the clusters found by unsupervised learning as extra features. Deep neural networks are often able to create features automatically. Missing values Quite often we miss values for certain features. This could happen for various reasons. It can be inconvenient, expensive or even impossible to always have a value. Maybe we were not able to measure a certain quantity in the past, because we didn't have the right equipment, or we just didn't know that the feature is relevant. However, we are stuck with missing values from the past. Sometimes it's easy to figure out that we miss values and we can discover this just by scanning the data, or counting the number of values we have for a feature and comparing to the number of values we expect based on the number of rows. Certain systems encode missing values with for example values as 999999. This makes sense if the valid values are much smaller than 999999. If you are lucky you will have information about the features provided by whoever created the data in the form of a data dictionary or metadata. Once we know that we miss values the question arises of how to deal with them. The simplest answer is to just ignore them. However, some algorithms can't deal with missing values, and the program will just refuse to continue. In other circumstances ignoring missing values will lead to inaccurate results. The second solution is to substitute missing values by a fixed value – this is called imputing. We can impute the arithmetic mean, median or mode of the valid values of a certain feature. Ideally, we will have a relation between features or within a variable that is somewhat reliable. For instance, we may know the seasonal averages of temperature for a certain location and be able to impute guesses for missing temperature values given a date. Label encoding Humans are able to deal with various types of values. Machine learning algorithms with some exceptions need numerical values. If we offer a string such as Ivan unless we are using specialized software the program will not know what to do. In this example we are dealing with a categorical feature – names probably. We can consider each unique value to be a label. (In this particular example we also need to decide what to do with the case – is Ivan the same as ivan). We can then replace each label by an integer – label encoding. This approach can be problematic, because the learner may conclude that there is an ordering. One hot encoding The one-of-K or one-hot-encoding scheme uses dummy variables to encode categorical features. Originally it was applied to digital circuits. The dummy variables have binary values like bits, so they take the values zero or one (equivalent to true or false). For instance, if we want to encode continents we will have dummy variables, such as is_asia, which will be true if the continent is Asia and false otherwise. In general, we need as many dummy variables, as there are unique labels minus one. We can determine one of the labels automatically from the dummy variables, because the dummy variables are exclusive. If the dummy variables all have a false value, then the correct label is the label for which we don't have a dummy variable. The following table illustrates the encoding for continents:   Is_africa Is_asia Is_europe Is_south_america Is_north_america Africa True False False False False Asia False True False False False Europe False False True False False South America False False False True False North America False False False False True Other False False False False False  The encoding produces a matrix (grid of numbers) with lots of zeroes (false values) and occasional ones (true values). This type of matrix is called a sparse matrix. The sparse matrix representation is handled well by the SciPy package, and shouldn't be an issue. We will discuss the SciPy package later in this article. Scaling Values of different features can differ by orders of magnitude. Sometimes this may mean that the larger values dominate the smaller values. This depends on the algorithm we are using. For certain algorithms to work properly we are required to scale the data. There are several common strategies that we can apply: Standardization removes the mean of a feature and divides by the standard deviation. If the feature values are normally distributed, we will get a Gaussian, which is centered around zero with a variance of one. If the feature values are not normally distributed, we can remove the median and divide by the interquartile range. The interquartile range is a range between the first and third quartile (or 25th and 75th percentile). Scaling features to a range is a common choice of range which is a range between zero and one. Polynomial features If we have two features a and b, we can suspect that there is a polynomial relation, such as a2 + ab + b2. We can consider each term in the sum to be a feature – in this example we have three features. The product ab in the middle is called an interaction. An interaction doesn't have to be a product, although this is the most common choice, it can also be a sum, a difference or a ratio. If we are using a ratio to avoid dividing by zero, we should add a small constant to the divisor and dividend. The number of features and the order of the polynomial for a polynomial relation are not limited. However, if we follow Occam's razor we should avoid higher order polynomials and interactions of many features. In practice, complex polynomial relations tend to be more difficult to compute and not add much value, but if you really need better results they may be worth considering. Power transformations Power transforms are functions that we can use to transform numerical features into a more convenient form, for instance to conform better to a normal distribution. A very common transform for values, which vary by orders of magnitude, is to take the logarithm. Taking the logarithm of a zero and negative values is not defined, so we may need to add a constant to all the values of the related feature before taking the logarithm. We can also take the square root for positive values, square the values or compute any other power we like. Another useful transform is the Box-Cox transform named after its creators. The Box-Cox transform attempts to find the best power need to transform the original data into data that is closer to the normal distribution. The transform is defined as follows: Binning Sometimes it's useful to separate feature values into several bins. For example, we may be only interested whether it rained on a particular day. Given the precipitation values we can binarize the values, so that we get a true value if the precipitation value is not zero, and a false value otherwise. We can also use statistics to divide values into high, low, and medium bins. The binning process inevitably leads to loss of information. However, depending on your goals this may not be an issue, and actually reduce the chance of overfitting. Certainly there will be improvements in speed and memory or storage requirements. Combining models In (high) school we sit together with other students, and learn together, but we are not supposed to work together during the exam. The reason is, of course, that teachers want to know what we have learned, and if we just copy exam answers from friends, we may have not learned anything. Later in life we discover that teamwork is important. For example, this book is the product of a whole team, or possibly a group of teams. Clearly a team can produce better results than a single person. However, this goes against Occam's razor, since a single person can come up with simpler theories compared to what a team will produce. In machine learning we nevertheless prefer to have our models cooperate with the following schemes: Bagging Boosting Stacking Blending Voting and averaging Bagging Bootstrap aggregating or bagging is an algorithm introduced by Leo Breiman in 1994, which applies bootstrapping to machine learning problems. Bootstrapping is a statistical procedure, which creates datasets from existing data by sampling with replacement. Bootstrapping can be used to analyze the possible values that arithmetic mean, variance or other quantity can assume. The algorithm aims to reduce the chance of overfitting with the following steps: We generate new training sets from input train data by sampling with replacement. Fit models to each generated training set. Combine the results of the models by averaging or majority voting. Boosting In the context of supervised learning we define weak learners as learners that are just a little better than a baseline such as randomly assigning classes or average values. Although weak learners are weak individually like ants, together they can do amazing things just like ants can. It makes sense to take into account the strength of each individual learner using weights. This general idea is called boosting. There are many boosting algorithms; boosting algorithms differ mostly in their weighting scheme. If you have studied for an exam, you may have applied a similar technique by identifying the type of practice questions you had trouble with and focusing on the hard problems. Face detection in images is based on a specialized framework, which also uses boosting. Detecting faces in images or videos is a supervised learning. We give the learner examples of regions containing faces. There is an imbalance, since we usually have far more regions (about ten thousand times more) that don't have faces. A cascade of classifiers progressively filters out negative image areas stage by stage. In each progressive stage the classifiers use progressively more features on fewer image windows. The idea is to spend the most time on image patches, which contain faces. In this context boosting is used to select features and combine results. Stacking Stacking takes the outputs of machine learning estimators and then uses those as inputs for another algorithm. You can, of course, feed the output of the higher-level algorithm to another predictor. It is possible to use any arbitrary topology, but for practical reasons you should try a simple setup first as also dictated by Occam's razor. Blending Blending was introduced by the winners of the one million dollar Netflix prize. Netflix organized a contest with the challenge of finding the best model to recommend movies to their users. Netflix users can rate a movie with a rating of one to five stars. Obviously each user wasn't able to rate each movie, so the user movie matrix is sparse. Netflix published an anonymized training and test set. Later researchers found a way to correlate the Netflix data to IMDB data. For privacy reasons the Netflix data is no longer available. The competition was won in 2008 by a group of teams combining their models. Blending is a form of stacking. The final estimator in blending however trains only on a small portion of the train data. Voting and averaging We can arrive at our final answer through majority voting or averaging. It's also possible to assign different weights to each model in the ensemble. For averaging we can also use the geometric mean or the harmonic mean instead of the arithmetic mean. Usually combining the results of models, which are highly correlated to each other doesn't lead to spectacular improvements. It's better to somehow diversify the models, by using different features or different algorithms. If we find that two models are strongly correlated, we may for example decide to remove one of them from the ensemble, and increase proportionally the weight of the other model. Installing software and setting up For most projects in this book we need scikit-learn (Refer, http://scikit-learn.org/stable/install.html) and matplotlib (Refer, http://matplotlib.org/users/installing.html). Both packages require NumPy, but we also need SciPy for sparse matrices as mentioned before. The scikit-learn library is a machine learning package, which is optimized for performance as a lot of the code runs almost as fast as equivalent C code. The same statement is true for NumPy and SciPy. There are various ways to speed up the code, however they are out of scope for this book, so if you want to know more, please consult the documentation. Matplotlib is a plotting and visualization package. We can also use the seaborn package for visualization. Seaborn uses matplotlib under the hood. There are several other Python visualization packages that cover different usage scenarios. Matplotlib and seaborn are mostly useful for the visualization for small to medium datasets. The NumPy package offers the ndarray class and various useful array functions. The ndarray class is an array, that can be one or multi-dimensional. This class also has several subclasses representing matrices, masked arrays, and heterogeneous record arrays. In machine learning we mainly use NumPy arrays to store feature vectors or matrices composed of feature vectors. SciPy uses NumPy arrays and offers a variety of scientific and mathematical functions. We also require the pandas library for data wrangling. In this book we will use Python 3. As you may know Python 2 will no longer be supported after 2020, so I strongly recommend switching to Python 3. If you are stuck with Python 2 you still should be able to modify the example code to work for you. In my opinion the Anaconda Python 3 distribution is the best option. Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager conda. The distribution includes more than 200 Python packages, which makes it very convenient. For casual users the Miniconda distribution may be the better choice. Miniconda contains the conda package manager and Python. The procedures to install Anaconda and Miniconda are similar. Obviously, Anaconda takes more disk space. Follow the instructions from the Anaconda website at http://conda.pydata.org/docs/install/quick.html. First, you have to download the appropriate installer for your operating system and Python version. Sometimes you can choose between a GUI and a command line installer. I used the Python 3 installer, although my system Python version is 2.7. This is possible since Anaconda comes with its own Python. On my machine the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. The Miniconda installer installs a miniconda directory in your home directory. Installation instructions for NumPy are at http://docs.scipy.org/doc/numpy/user/install.html. Alternatively install NumPy with pip as follows: $ [sudo] pip install numpy The command for Anaconda users is: $ conda install numpy To install the other dependencies substitute NumPy by the appropriate package. Please read the documentation carefully, not all options work equally well for each operating system. The pandas installation documentation is at http://pandas.pydata.org/pandas-docs/dev/install.html. Troubleshooting and asking for help Currently the best forum is at http://stackoverflow.com. You can also reach out on mailing lists or IRC chat. The following is a list of mailing lists: Scikit-learn: https://lists.sourceforge.net/lists/listinfo/scikit-learn-general. NumPy and Scipy mailing list: https://www.scipy.org/scipylib/mailing-lists.html. IRC channels: #scikit-learn @ freenode #scipy @ freenode Summary In this article we covered the basic concepts of machine learning, a high level overview, generalizing with data, overfitting, dimensions and features, preprocessing, combining models, installing the required software and some places where you can ask for help. Resources for Article: Further resources on this subject: Machine learning and Python – the Dream Team Machine Learning in IPython with scikit-learn How to do Machine Learning with Python
Read more
  • 0
  • 0
  • 15163

article-image-optimize-scans
Packt
23 Jun 2017
20 min read
Save for later

To Optimize Scans

Packt
23 Jun 2017
20 min read
In this article by Paulino Calderon Pale author of the book Nmap Network Exploration and Security Auditing Cookbook, Second Edition, we will explore the following topics: Skipping phases to speed up scans Selecting the correct timing template Adjusting timing parameters Adjusting performance parameters (For more resources related to this topic, see here.) One of my favorite things about Nmap is how customizable it is. If configured properly, Nmap can be used to scan from single targets to millions of IP addresses in a single run. However, we need to be careful and need to understand the configuration options and scanning phases that can affect performance, but most importantly, really think about our scan objective beforehand. Do we need the information from the reverse DNS lookup? Do we know all targets are online? Is the network congested? Do targets respond fast enough? These and many more aspects can really add up to your scanning time. Therefore, optimizing scans is important and can save us hours if we are working with many targets. This article starts by introducing the different scanning phases, timing, and performance options. Unless we have a solid understanding of what goes on behind the curtains during a scan, we won't be able to completely optimize our scans. Timing templates are designed to work in common scenarios, but we want to go further and shave off those extra seconds per host during our scans. Remember that this can also not only improve performance but accuracy as well. Maybe those targets marked as offline were only too slow to respond to the probes sent after all. Skipping phases to speed up scans Nmap scans can be broken in phases. When we are working with many hosts, we can save up time by skipping tests or phases that return information we don't need or that we already have. By carefully selecting our scan flags, we can significantly improve the performance of our scans. This explains the process that takes place behind the curtains when scanning, and how to skip certain phases to speed up scans. How to do it... To perform a full port scan with the timing template set to aggressive, and without the reverse DNS resolution (-n) or ping (-Pn), use the following command: # nmap -T4 -n -Pn -p- 74.207.244.221 Note the scanning time at the end of the report: Nmap scan report for 74.207.244.221 Host is up (0.11s latency). Not shown: 65532 closed ports PORT     STATE SERVICE 22/tcp   open ssh 80/tcp   open http 9929/tcp open nping-echo Nmap done: 1 IP address (1 host up) scanned in 60.84 seconds Now, compare the running time that we get if we don't skip any tests: # nmap -p- scanme.nmap.org Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.11s latency). Not shown: 65532 closed ports PORT     STATE SERVICE 22/tcp   open ssh 80/tcp   open http 9929/tcp open nping-echo Nmap done: 1 IP address (1 host up) scanned in 77.45 seconds Although the time difference isn't very drastic, it really adds up when you work with many hosts. I recommend that you think about your objectives and the information you need, to consider the possibility of skipping some of the scanning phases that we will describe next. How it works... Nmap scans are divided in several phases. Some of them require some arguments to be set to run, but others, such as the reverse DNS resolution, are executed by default. Let's review the phases that can be skipped and their corresponding Nmap flag: Target enumeration: In this phase, Nmap parses the target list. This phase can't exactly be skipped, but you can save DNS forward lookups using only the IP addresses as targets. Host discovery: This is the phase where Nmap establishes if the targets are online and in the network. By default, Nmap sends an ICMP echo request and some additional probes, but it supports several host discovery techniques that can even be combined. To skip the host discovery phase (no ping), use the flag -Pn. And we can easily see what probes we skipped by comparing the packet trace of the two scans: $ nmap -Pn -p80 -n --packet-trace scanme.nmap.org SENT (0.0864s) TCP 106.187.53.215:62670 > 74.207.244.221:80 S ttl=46 id=4184 iplen=44 seq=3846739633 win=1024 <mss 1460> RCVD (0.1957s) TCP 74.207.244.221:80 > 106.187.53.215:62670 SA ttl=56 id=0 iplen=44 seq=2588014713 win=14600 <mss 1460> Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.11s latency). PORT   STATE SERVICE 80/tcp open http Nmap done: 1 IP address (1 host up) scanned in 0.22 seconds For scanning without skipping host discovery, we use the command: $ nmap -p80 -n --packet-trace scanme.nmap.orgSENT (0.1099s) ICMP 106.187.53.215 > 74.207.244.221 Echo request (type=8/code=0) ttl=59 id=12270 iplen=28 SENT (0.1101s) TCP 106.187.53.215:43199 > 74.207.244.221:443 S ttl=59 id=38710 iplen=44 seq=1913383349 win=1024 <mss 1460> SENT (0.1101s) TCP 106.187.53.215:43199 > 74.207.244.221:80 A ttl=44 id=10665 iplen=40 seq=0 win=1024 SENT (0.1102s) ICMP 106.187.53.215 > 74.207.244.221 Timestamp request (type=13/code=0) ttl=51 id=42939 iplen=40 RCVD (0.2120s) ICMP 74.207.244.221 > 106.187.53.215 Echo reply (type=0/code=0) ttl=56 id=2147 iplen=28 SENT (0.2731s) TCP 106.187.53.215:43199 > 74.207.244.221:80 S ttl=51 id=34952 iplen=44 seq=2609466214 win=1024 <mss 1460> RCVD (0.3822s) TCP 74.207.244.221:80 > 106.187.53.215:43199 SA ttl=56 id=0 iplen=44 seq=4191686720 win=14600 <mss 1460> Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.10s latency). PORT   STATE SERVICE 80/tcp open http Nmap done: 1 IP address (1 host up) scanned in 0.41 seconds Reverse DNS resolution: Host names often reveal by themselves additional information and Nmap uses reverse DNS lookups to obtain them. This step can be skipped by adding the argument -n to your scan arguments. Let's see the traffic generated by the two scans with and without reverse DNS resolution. First, let's skip reverse DNS resolution by adding -n to your command: $ nmap -n -Pn -p80 --packet-trace scanme.nmap.orgSENT (0.1832s) TCP 106.187.53.215:45748 > 74.207.244.221:80 S ttl=37 id=33309 iplen=44 seq=2623325197 win=1024 <mss 1460> RCVD (0.2877s) TCP 74.207.244.221:80 > 106.187.53.215:45748 SA ttl=56 id=0 iplen=44 seq=3220507551 win=14600 <mss 1460> Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.10s latency). PORT   STATE SERVICE 80/tcp open http   Nmap done: 1 IP address (1 host up) scanned in 0.32 seconds And if we try the same command but not' skipping reverse DNS resolution, as follows: $ nmap -Pn -p80 --packet-trace scanme.nmap.org NSOCK (0.0600s) UDP connection requested to 106.187.36.20:53 (IOD #1) EID 8 NSOCK (0.0600s) Read request from IOD #1 [106.187.36.20:53] (timeout: -1ms) EID                                                 18 NSOCK (0.0600s) UDP connection requested to 106.187.35.20:53 (IOD #2) EID 24 NSOCK (0.0600s) Read request from IOD #2 [106.187.35.20:53] (timeout: -1ms) EID                                                 34 NSOCK (0.0600s) UDP connection requested to 106.187.34.20:53 (IOD #3) EID 40 NSOCK (0.0600s) Read request from IOD #3 [106.187.34.20:53] (timeout: -1ms) EID                                                 50 NSOCK (0.0600s) Write request for 45 bytes to IOD #1 EID 59 [106.187.36.20:53]:                                                 =............221.244.207.74.in-addr.arpa..... NSOCK (0.0600s) Callback: CONNECT SUCCESS for EID 8 [106.187.36.20:53] NSOCK (0.0600s) Callback: WRITE SUCCESS for EID 59 [106.187.36.20:53] NSOCK (0.0600s) Callback: CONNECT SUCCESS for EID 24 [106.187.35.20:53] NSOCK (0.0600s) Callback: CONNECT SUCCESS for EID 40 [106.187.34.20:53] NSOCK (0.0620s) Callback: READ SUCCESS for EID 18 [106.187.36.20:53] (174 bytes) NSOCK (0.0620s) Read request from IOD #1 [106.187.36.20:53] (timeout: -1ms) EID                                                 66 NSOCK (0.0620s) nsi_delete() (IOD #1) NSOCK (0.0620s) msevent_cancel() on event #66 (type READ) NSOCK (0.0620s) nsi_delete() (IOD #2) NSOCK (0.0620s) msevent_cancel() on event #34 (type READ) NSOCK (0.0620s) nsi_delete() (IOD #3) NSOCK (0.0620s) msevent_cancel() on event #50 (type READ) SENT (0.0910s) TCP 106.187.53.215:46089 > 74.207.244.221:80 S ttl=42 id=23960 ip                                                 len=44 seq=1992555555 win=1024 <mss 1460> RCVD (0.1932s) TCP 74.207.244.221:80 > 106.187.53.215:46089 SA ttl=56 id=0 iplen                                                =44 seq=4229796359 win=14600 <mss 1460> Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.10s latency). PORT   STATE SERVICE 80/tcp open http Nmap done: 1 IP address (1 host up) scanned in 0.22 seconds Port scanning: In this phase, Nmap determines the state of the ports. By default, it uses SYN/TCP Connect scanning depending on the user privileges, but several other port scanning techniques are supported. Although this may not be so obvious, Nmap can do a few different things with targets without port scanning them like resolving their DNS names or checking whether they are online. For this reason, this phase can be skipped with the argument -sn: $ nmap -sn -R --packet-trace 74.207.244.221 SENT (0.0363s) ICMP 106.187.53.215 > 74.207.244.221 Echo request (type=8/code=0) ttl=56 id=36390 iplen=28 SENT (0.0364s) TCP 106.187.53.215:53376 > 74.207.244.221:443 S ttl=39 id=22228 iplen=44 seq=155734416 win=1024 <mss 1460> SENT (0.0365s) TCP 106.187.53.215:53376 > 74.207.244.221:80 A ttl=46 id=36835 iplen=40 seq=0 win=1024 SENT (0.0366s) ICMP 106.187.53.215 > 74.207.244.221 Timestamp request (type=13/code=0) ttl=50 id=2630 iplen=40 RCVD (0.1377s) TCP 74.207.244.221:443 > 106.187.53.215:53376 RA ttl=56 id=0 iplen=40 seq=0 win=0 NSOCK (0.1660s) UDP connection requested to 106.187.36.20:53 (IOD #1) EID 8 NSOCK (0.1660s) Read request from IOD #1 [106.187.36.20:53] (timeout: -1ms) EID 18 NSOCK (0.1660s) UDP connection requested to 106.187.35.20:53 (IOD #2) EID 24 NSOCK (0.1660s) Read request from IOD #2 [106.187.35.20:53] (timeout: -1ms) EID 34 NSOCK (0.1660s) UDP connection requested to 106.187.34.20:53 (IOD #3) EID 40 NSOCK (0.1660s) Read request from IOD #3 [106.187.34.20:53] (timeout: -1ms) EID 50 NSOCK (0.1660s) Write request for 45 bytes to IOD #1 EID 59 [106.187.36.20:53]: [............221.244.207.74.in-addr.arpa..... NSOCK (0.1660s) Callback: CONNECT SUCCESS for EID 8 [106.187.36.20:53] NSOCK (0.1660s) Callback: WRITE SUCCESS for EID 59 [106.187.36.20:53] NSOCK (0.1660s) Callback: CONNECT SUCCESS for EID 24 [106.187.35.20:53] NSOCK (0.1660s) Callback: CONNECT SUCCESS for EID 40 [106.187.34.20:53] NSOCK (0.1660s) Callback: READ SUCCESS for EID 18 [106.187.36.20:53] (174 bytes) NSOCK (0.1660s) Read request from IOD #1 [106.187.36.20:53] (timeout: -1ms) EID 66 NSOCK (0.1660s) nsi_delete() (IOD #1) NSOCK (0.1660s) msevent_cancel() on event #66 (type READ) NSOCK (0.1660s) nsi_delete() (IOD #2) NSOCK (0.1660s) msevent_cancel() on event #34 (type READ) NSOCK (0.1660s) nsi_delete() (IOD #3) NSOCK (0.1660s) msevent_cancel() on event #50 (type READ) Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.10s latency). Nmap done: 1 IP address (1 host up) scanned in 0.17 seconds In the previous example, we can see that an ICMP echo request and a reverse DNS lookup were performed (We forced DNS lookups with the option -R), but no port scanning was done. There's more... I recommend that you also run a couple of test scans to measure the speeds of the different DNS servers. I've found that ISPs tend to have the slowest DNS servers, but you can make Nmap use different DNS servers by specifying the argument --dns-servers. For example, to use Google's DNS servers, use the following command: # nmap -R --dns-servers 8.8.8.8,8.8.4.4 -O scanme.nmap.org You can test your DNS server speed by comparing the scan times. The following command tells Nmap not to ping or scan the port and only perform a reverse DNS lookup: $ nmap -R -Pn -sn 74.207.244.221 Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up. Nmap done: 1 IP address (1 host up) scanned in 1.01 seconds To further customize your scans, it is important that you understand the scan phases of Nmap. See Appendix-Scanning Phases for more information. Selecting the correct timing template Nmap includes six templates that set different timing and performance arguments to optimize your scans based on network condition. Even though Nmap automatically adjusts some of these values, it is recommended that you set the correct timing template to hint Nmap about the speed of your network connection and the target's response time. The following will teach you about Nmap's timing templates and how to choose the more appropriate one. How to do it... Open your terminal and type the following command to use the aggressive timing template (-T4). Let's also use debugging (-d) to see what Nmap option -T4 sets: # nmap -T4 -d 192.168.4.20 --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 500, min 100, max 1250 max-scan-delay: TCP 10, UDP 1000, SCTP 10 parallelism: min 0, max 0 max-retries: 6, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- <Scan output removed for clarity> You may use the integers between 0 and 5, for example,-T[0-5]. How it works... The option -T is used to set the timing template in Nmap. Nmap provides six timing templates to help users tune the timing and performance arguments. The available timing templates and their initial configuration values are as follows: Paranoid(-0)—This template is useful to avoid detection systems, but it is painfully slow because only one port is scanned at a time, and the timeout between probes is 5 minutes: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 300000, min 100, max 300000 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 1 max-retries: 10, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- Sneaky (-1)—This template is useful for avoiding detection systems but is still very slow: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 15000, min 100, max 15000 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 1 max-retries: 10, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- Polite (-2)—This template is used when scanning is not supposed to interfere with the target system, very conservative and safe setting: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 1000, min 100, max 10000 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 1 max-retries: 10, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- Normal (-3)—This is Nmap's default timing template, which is used when the argument -T is not set: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 1000, min 100, max 10000 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 0 max-retries: 10, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- Aggressive (-4)—This is the recommended timing template for broadband and Ethernet connections: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 500, min 100, max 1250 max-scan-delay: TCP 10, UDP 1000, SCTP 10 parallelism: min 0, max 0 max-retries: 6, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- Insane (-5)—This timing template sacrifices accuracy for speed: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 250, min 50, max 300 max-scan-delay: TCP 5, UDP 1000, SCTP 5 parallelism: min 0, max 0 max-retries: 2, host-timeout: 900000 min-rate: 0, max-rate: 0 --------------------------------------------- There's more... An interactive mode in Nmap allows users to press keys to dynamically change the runtime variables, such as verbose, debugging, and packet tracing. Although the discussion of including timing and performance options in the interactive mode has come up a few times in the development mailing list; so far, this hasn't been implemented yet. However, there is an unofficial patch submitted in June 2012 that allows you to change the minimum and maximum packet rate values(--max-rateand --min-rate) dynamically. If you would like to try it out, it's located at http://seclists.org/nmap-dev/2012/q2/883. Adjusting timing parameters Nmap not only adjusts itself to different network and target conditions while scanning, but it can be fine-tuned using timing options to improve performance. Nmap automatically calculates packet round trip, timeout, and delay values, but these values can also be set manually through specific settings. The following describes the timing parameters supported by Nmap. How to do it... Enter the following command to adjust the initial round trip timeout, the delay between probes and a time out for each scanned host: # nmap -T4 --scan-delay 1s --initial-rtt-timeout 150ms --host-timeout 15m -d scanme.nmap.org --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 150, min 100, max 1250 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 0 max-retries: 6, host-timeout: 900000 min-rate: 0, max-rate: 0 --------------------------------------------- How it works... Nmap supports different timing arguments that can be customized. However, setting these values incorrectly will most likely hurt performance rather than improve it. Let's examine closer each timing parameter and learn its Nmap option parameter name. The Round Trip Time (RTT) value is used by Nmap to know when to give up or retransmit a probe response. Nmap estimates this value by analyzing previous responses, but you can set the initial RTT timeout with the argument --initial-rtt-timeout, as shown in the following command: # nmap -A -p- --initial-rtt-timeout 150ms <target> In addition, you can set the minimum and maximum RTT timeout values with--min-rtt-timeout and --max-rtt-timeout, respectively, as shown in the following command: # nmap -A -p- --min-rtt-timeout 200ms --max-rtt-timeout 600ms <target> Another very important setting we can control in Nmap is the waiting time between probes. Use the arguments --scan-delay and --max-scan-delay to set the waiting time and maximum amount of time allowed to wait between probes, respectively, as shown in the following commands: # nmap -A --max-scan-delay 10s scanme.nmap.org # nmap -A --scan-delay 1s scanme.nmap.org Note that the arguments previously shown are very useful when avoiding detection mechanisms. Be careful not to set --max-scan-delay too low because it will most likely miss the ports that are open. There's more... If you would like Nmap to give up on a host after a certain amount of time, you can set the argument --host-timeout: # nmap -sV -A -p- --host-timeout 5m <target> Estimating round trip times with Nping To use Nping to estimate the round trip time taken between the target and you, the following command can be used: # nping -c30 <target> This will make Nping send 30 ICMP echo request packets, and after it finishes, it will show the average, minimum, and maximum RTT values obtained: # nping -c30 scanme.nmap.org ... SENT (29.3569s) ICMP 50.116.1.121 > 74.207.244.221 Echo request (type=8/code=0) ttl=64 id=27550 iplen=28 RCVD (29.3576s) ICMP 74.207.244.221 > 50.116.1.121 Echo reply (type=0/code=0) ttl=63 id=7572 iplen=28 Max rtt: 10.170ms | Min rtt: 0.316ms | Avg rtt: 0.851ms Raw packets sent: 30 (840B) | Rcvd: 30 (840B) | Lost: 0 (0.00%) Tx time: 29.09096s | Tx bytes/s: 28.87 | Tx pkts/s: 1.03 Rx time: 30.09258s | Rx bytes/s: 27.91 | Rx pkts/s: 1.00 Nping done: 1 IP address pinged in 30.47 seconds Examine the round trip times and use the maximum to set the correct --initial-rtt-timeout and --max-rtt-timeout values. The official documentation recommends using double the maximum RTT value for the --initial-rtt-timeout, and as high as four times the maximum round time value for the –max-rtt-timeout. Displaying the timing settings Enable debugging to make Nmap inform you about the timing settings before scanning: $ nmap -d<target> --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 1000, min 100, max 10000 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 0 max-retries: 10, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- To further customize your scans, it is important that you understand the scan phases of Nmap. See Appendix-Scanning Phases for more information. Adjusting performance parameters Nmap not only adjusts itself to different network and target conditions while scanning, but it also supports several parameters that affect the behavior of Nmap, such as the number of hosts scanned concurrently, number of retries, and number of allowed probes. Learning how to adjust these parameters properly can reduce a lot of your scanning time. The following explains the Nmap parameters that can be adjusted to improve performance. How to do it... Enter the following command, adjusting the values for your target condition: $ nmap --min-hostgroup 100 --max-hostgroup 500 --max-retries 2 <target> How it works... The command shown previously tells Nmap to scan and report by grouping no less than 100 (--min-hostgroup 100) and no more than 500 hosts (--max-hostgroup 500). It also tells Nmap to retry only twice before giving up on any port (--max-retries 2): # nmap --min-hostgroup 100 --max-hostgroup 500 --max-retries 2 <target> It is important to note that setting these values incorrectly will most likely hurt the performance or accuracy rather than improve it. Nmap sends many probes during its port scanning phase due to the ambiguity of what a lack of responsemeans; either the packet got lost, the service is filtered, or the service is not open. By default, Nmap adjusts the number of retries based on the network conditions, but you can set this value with the argument --max-retries. By increasing the number of retries, we can improve Nmap's accuracy, but keep in mind this sacrifices speed: $ nmap --max-retries 10<target> The arguments --min-hostgroup and --max-hostgroup control the number of hosts that we probe concurrently. Keep in mind that reports are also generated based on this value, so adjust it depending on how often would you like to see the scan results. Larger groups are optimalto improve performance, but you may prefer smaller host groups on slow networks: # nmap -A -p- --min-hostgroup 100 --max-hostgroup 500 <target> There is also a very important argument that can be used to limit the number of packets sent per second by Nmap. The arguments --min-rate and --max-rate need to be used carefully to avoid undesirable effects. These rates are set automatically by Nmap if the arguments are not present: # nmap -A -p- --min-rate 50 --max-rate 100 <target> Finally, the arguments --min-parallelism and --max-parallelism can be used to control the number of probes for a host group. By setting these arguments, Nmap will no longer adjust the values dynamically: # nmap -A --max-parallelism 1 <target> # nmap -A --min-parallelism 10 --max-parallelism 250 <target> There's more... If you would like Nmap to give up on a host after a certain amount of time, you can set the argument --host-timeout, as shown in the following command: # nmap -sV -A -p- --host-timeout 5m <target> Interactive mode in Nmap allows users to press keys to dynamically change the runtime variables, such as verbose, debugging, and packet tracing. Although the discussion of including timing and performance options in the interactive mode has come up a few times in the development mailing list, so far this hasn't been implemented yet. However, there is an unofficial patch submitted in June 2012 that allows you to change the minimum and maximum packet rate values (--max-rate and --min-rate) dynamically. If you would like to try it out, it's located at http://seclists.org/nmap-dev/2012/q2/883. To further customize your scans, it is important that you understand the scan phases of Nmap. See Appendix-Scanning Phases for more information. Summary In this article, we are finally able to learn how to implement and optimize scans. Nmap scans among several clients, allowing us to save time and take advantage of extra bandwidth and CPU resources. This article is short but full of tips for optimizing your scans. Prepare to dig deep into Nmap's internals and the timing and performance parameters! Resources for Article: Further resources on this subject: Introduction to Network Security Implementing OpenStack Networking and Security Communication and Network Security
Read more
  • 0
  • 0
  • 37401

article-image-getting-started-ansible-2
Packt
23 Jun 2017
5 min read
Save for later

Getting Started with Ansible 2

Packt
23 Jun 2017
5 min read
In this article, by Jonathan McAllister, author of the book, Implementing DevOps with Ansible 2, we will learn what is Ansible, how users can leverage it, it's architecture, the key differentiators of Ansible from other configuration managements. We will also see the organizations that were successfully able to leverage Ansible. (For more resources related to this topic, see here.) What is Ansible? Ansible is a relatively new addition to the DevOps and configuration management space. It's simplicity, structured automation format and development paradigm have caught the eyes of both small and large corporations alike. Organizations as large as Twitter have managed successfully to leverage Ansible for highly scaled deployments and configuration management across implementations across thousands of servers simultaneously. And Twitter isn't the only organization that has managed to implement Ansible at scale, other well-known organizations that have successfully leveraged Ansible include Logitech, NASA, NEC, Twitter, Microsoft and hundreds more. As it stands today, Ansible is in use by major players around the world managing thousands of deployments and configuration management solutions world wide. The Ansible's Automation Architecture Ansible was created with an incredibly flexible and scalable automation engine. It allows users to leverage it in many diverse ways and can be conformed to be used in the way that best suits your specific needs. Since Ansible is agentless (meaning there is no permanently running daemon on the systems it manages or executes from), it can be used locally to control a single system (without any network connectivity), or leveraged to orchestrate and execute automation against many systems, via a control server. In addition to the aforementioned named architectures, Ansible can also be leveraged via Vagrant or Docker to provision infrastructure automatically. This type of solution basically allows the Ansible user to bootstrap their hardware or infrastructure provisioning by running an Ansible playbook(s).  If you happen to be a Vagrant user, there is instructions within the HashiCorp Ansible provisioning located at https://www.vagrantup.com/docs/provisioning/ansible.html. Ansible is open source, module based, pluggable, and agentless. These key differentiators from other configuration management solutions give Ansible a significant edge. Let's take a look at each of these differentiators in details and see what that actually means for Ansible developers and users: Open source: It is no secret that successful open source solutions are usually extraordinarily feature rich. This is because instead of a simple 8 person (or even 100) person engineering team, there are potentially thousands of developers. Each development and enhancement has been designed to fit a unique need. As a result the end deliverable product provides the consumers of Ansible with a very well rounded solution that can be adapted or leveraged in numerous ways. Module based: Ansible has been developed with the intention to integrate with numerous other open and closed source software solutions. This idea means that Ansible is currently compatible with multiple flavors of Linux, Windows and Cloud providers. Aside from its OS level support Ansible currently integrates with hundreds of other software solutions; including EC2, Jira, Jenkins, Bamboo, Microsoft Azure, Digital Ocean, Docker, Google and MANY MANY more.   For a complete list of Ansible modules, please consult the Ansible official module support list located at http://docs.ansible.com/ansible/modules_by_category.html. Agentless: One of the key differentiators that gives Ansible an edge against the competition is the fact that it is 100% agentless. This means there are no daemons that need to be installed on remote machines, no firewall ports that need to be opened (besides traditional SSH), no monitoring that needs to be performed on the remote machines and no management that needs to be performed on the infrastructure fleet. In effect, this makes Ansible very self sufficient.  Since Ansible can be implemented in a few different ways the aim of this section is to highlight these options and help get us familiar with the architecture types that Ansible supports. Generally the architecture of Ansible can be categorized into three distinct architecture types. These are described next. Pluggable: While Ansible comes out of the box with a wide spectrum of software integrations support, it is often times a requirement to integrate the solution with a company based internal software solution or a software solution that has not already been integrated into Ansible's robust playbook suite. The answer to such a requirement would be to create a plugin based solution for Ansible, this providing the custom functionality necessary.  Summary In this article, we discussed the architecture of Ansible, the key differentiators that differentiate Ansible from other configuration management. We learnt that Ansible can also be leveraged via Vagrant or Docker to provision infrastructure automatically. We also saw that Ansible has been successfully leveraged by large oraganizations like Twitter, Microsoft, and many more. Resources for Article: Further resources on this subject: Getting Started with Ansible Getting Started with Ansible Introduction to Ansible Introduction to Ansible System Architecture and Design of Ansible System Architecture and Design of Ansible
Read more
  • 0
  • 0
  • 25509

Packt
23 Jun 2017
9 min read
Save for later

Working with Basic Elements – Threads and Runnables

Packt
23 Jun 2017
9 min read
In this article by Javier Fernández González, the author of the book, Mastering Concurrency Programming with Java 9 - Second Edition, we will see the execution threads are the core of concurrent applications. When you implement a concurrent application, no matter the language, you have to create different execution threads that run in parallel in a non-deterministic order unless you use a synchronization element (such as a semaphore). In Java you can create execution threads in two ways: Extending the Thread class Implementing the Runnable interface In this article, you will learn how to use these elements to implement concurrent applications in Java. (For more resources related to this topic, see here.) Introduction Nowadays, computer users (and mobile and tablet users too) use different applications at the same time when they work with their computers. They can be writing a document with the word processor while they’re reading the news or posting on the social network and listening to music. They can do all these things at the same time because modern operating systems supports multiprocessing. They can execute different tasks at the same time. But inside an application, you can also do different things at the same time. For example, if you’re working with your word processor, you can save the file while you’re putting a text with bold style. You can do this because the modern programming languages that are used to write those applications allow programmers to create multiple execution threads inside an application. Each execution thread executes a different task, so you can do different things at the same time. Java implements execution threads using the Thread class. You can create an execution thread in your application using the following mechanisms: You can extend the Thread class and override the run() method You can implement the Runnable interface and pass an object of that class to the constructor of a Thread object In both the cases, you will have a Thread object, but the second approach is recommended over the first one. Its main advantages are: Runnable is an interface: You can implement other interfaces and extend other class. With the Thread class you can only extend that class. Runnable objects can not only be executed with threads but also in other Java concurrency objects as executors. This gives you more flexibility to change your concurrent applications. You can use the same Runnable object with different threads. Once you have a Thread object, you must use the start() method to create a new execution thread and execute the run() method of Thread. If you call the run() method directly, you will be calling a normal Java method and no new execution thread will be created. Let’s see the most important characteristics of threads in the Java programming language. Threads in Java: characteristics and states The first thing we have to say about threads in Java is that all Java programs, concurrent or not, have one Thread called the main thread. As you may know, a Java SE program starts its execution with the main() method. When you execute that program, the Java Virtual Machine (JVM) creates a new Thread and executes the main() method in that thread. This is the unique thread in the non-concurrent applications and the first one in the concurrent ones. In Java, as with other programming languages, threads share all the resources of the application, including memory and open files. This is a powerful tool because they can share information in a fast and easy way, but it must be done using adequate synchronization elements to avoid data race conditions. All the threads in Java have a priority. It’s an integer value that can be between the Thread.MIN_PRIORITY and Thread.MAX_PRIORITY values (actually, their values are 1 and 10). By default, all the threads are created with the priority, Thread.NORM_PRIORITY (actually, its value is 5). You can use the setPriority() method to change the priority of a Thread (it can throw a SecurityException exception if you are not allowed to do that operation) and the getPriority() method to get the priority of a Thread. This priority is a hint to the Java Virtual Machine and to the underlying operating system about the preference between the threads, but it’s not a contract. There’s no guarantee about the order of execution of the threads. Normally, threads with a higher priority will be executed before the threads with lower priority but, as I told you before, there’s no guarantee about this. You can create two kind of threads in Java: Daemon threads Non-daemon threads The difference between them is how they affect the end of a program. A Java program ends its execution when one of the following circumstances occurs: The program executes the exit() method of the Runtime class and the user has authorization to execute that method All the non-daemon threads of the application have ended its execution, no matter if there are daemon threads running. With these characteristics, daemon threads are usually used to execute auxiliary tasks in the applications as garbage collectors or cache managers. You can use the isDaemon() method to check whether a thread is a daemon thread or not, and you can use the setDaemon() method to establish a thread as a daemon one. Take into account that you must call this method before the thread starts its execution with the start() method. Finally, threads can pass through different states depending on the situation. All the possible states are defined in the Thread.States class and you can use the getState() method to get the status of a Thread. Obviously, you can change the status of the thread directly. These are the possible statuses of a thread: NEW: Thread has been created but it hasn’t started its execution yet RUNNABLE: Thread is running in the Java Virtual Machine BLOCKED: Thread is waiting for a lock WAITING: Thread is waiting for the action of the other thread TIME_WAITING: Thread is waiting for the action of the other thread, but this waiting has a time limit THREAD: Thread has finished its execution Now that we know the most important characteristics of threads in the Java programming language, let’s see the most important methods of the Runnable interface and the Thread class. The Thread class and the Runnable interface As we mentioned before, you can create a new execution thread using one of the following two mechanisms: Extend the Thread class and override its run() method. Implement the Runnable interface and pass an instance of that object to the constructor of a Thread object. Java good practices recommend the utilization of the second approach over the first one, and that will be the approach we will use in this article and in the whole book. The Runnable interface only defines one method: the run() method. This is the main method of every thread. When you start a new execution of the start() method, it will call the run() method (of the Thread class or of the Runnable object passed as a parameter in the constructor of the Thread class). The Thread class, on the contrary, has a lot of different methods. It has a run() method that you must override if you implement your thread extending the Thread class and the start() method that you must call to create a new execution thread. These are the other interesting methods of the Thread class: Methods to get and set information of a Thread: getId(): This method returns the identifier of the Thread. The thread identifier is a positive integer number, assigned when a thread is created. It is unique during its lifetime and it can't be changed. getName()/setName(): This method allows you to get or set the name of the Thread. This name is a String that can also be established in the constructor of the Thread class. getPriority()/setPriority(): You can use these methods to obtain and establish the priority of the Thread class. We explained before in this article how Java manages the priority of its threads. isDaemon()/setDaemon(): This method allows you to obtain and establish the condition of a daemon of the Thread. We have explained how this condition works before. getState(): This method returns the state of the Thread. We explained before all the possible states of a Thread. interrupt()/interrupted()/isInterrupted(): The first method is used to indicate to Thread that you're are requesting the end of its execution. The other two methods can be used to check the interrupt status. The main difference between those methods is that one clears the value of the interrupted flag when it's called and the other one does not. A call to the interrupt() method doesn't end the execution of a Thread. It is the responsibility of the Thread to check the status of that flag and respond accordingly. sleep(): This method allows you to suspend the execution of the Thread for a period of time. It receives a long value, that is, the number of milliseconds that you want to suspend the execution of the Thread for. join(): This method suspends the execution of the thread that makes the call until the end of the execution of the Thread that is used to call the method. You can use this method to wait for the finalization of other Thread. setUncaughtExceptionHandler(): This method is used to establish the controller of the unchecked exceptions that can occur while you're executing the threads. currentThread(): This is a static method of the Thread class that returns the Thread object that is actually executing this code. Summary In this article, you learned the threads in Java and how the Thread class and the Runnable interface work. Resources for Article: Further resources on this subject: Thread synchronization and communication [article] Multithreading with Qt [article] Concurrency and Parallelism with Swift 2 [article]
Read more
  • 0
  • 0
  • 1762
Packt
23 Jun 2017
18 min read
Save for later

User Story Map – The First User Experience Map in a Product’s Life

Packt
23 Jun 2017
18 min read
In this article by Peter W. Szabo, the author of the book User Experience Mapping we will explore the idea of how to start with predictive analysis. In this User story maps solve the user's problems in form of a discussion. Your job as a product manager or user experience consultant should be to make the world better through user-centric products. Essentially solving the user's problems. Contrary to popular belief, user story maps are not just cash cows for agile experts. They will help a product to succeed, by increasing their understanding of the system. Not just what's inside it, but what will happen to the world as a result. By focusing on the opportunity and outcomes the team can prioritize development. In reality, this often means stopping the proliferation of features, andunderdoing your competition. Wait a minute, did you just read underdoing? As in, fewer features, not making bold promises and significantly less customizability and options? Yes indeed. The founders of Basecamp (formerly 37 signals) are the champions of building less. In their bookReWork: Change the Way You Work Foreverthey tell Basecamp's success story while giving vital advice to anyone trying to run a build a product or a startup: “When things aren't working, the natural inclination is to throw more at the problem. More people, time, and money. All that ends up doing is making the problem bigger. The right way to go is the opposite direction: Cut back. So do less. Your project won't suffer nearly as much as you fear. In fact, there's a good chance it'll end up even better.” (Jason Fried) User Story Maps will help you to throw less at the problem, chopping down extras, until you reach an awesome product, which is actually done.One of the problems with long product backlogs or nightmarish requirement documents is that it never gets done. Literally never. Once I had to work on improving the user experience of a bank's backend. It was a gargantuan task, as this backend was a large collection of distributed microservices, which meant hundreds of different forms with hard to understand functions and a badly designed multi-level menu which connected them together. I knew almost nothing about banking, and they knew almost nothing about UX, so this was a match made in heaven. They gave me a twelve-page document. That was just the non-disclosure agreement. The project had many 100+ page documents, detailing various things and how they are done, complete with business processes and banking jargon. They wanted us to compile an expert review on what needs to be redesigned and create a detailed strategy for that. I found a better use of their money than wasting time on expert reviews and redesign strategies at that stage. Recording or even watching bank employees, while they used the system during their work was out of the question. So we went for the quick win and did user story mapping in the first week of the project. Among the attendees of the user story mapping sessions, therewerea few non-manager level bank employees, who used the backend extensively. One of them was quite new to her job, but fortunately, quite talkative about it. It was immediately evident that most employees almost never used at least 95% of the functionality. Those were reserved for specially trained people, usually managers. After creating the user story map with the most essential and frequently used features, I suggested a backend interface, which only contained about 1% of the functionality of the old system at first, with the mention of other features to be added later. (As a UX consultant you should avoid saying no, instead try saying later. It has the same effect for the project but keeps everyone happy.) No one in the room believed that such a crazy idea would go through senior management, although they supported the proposal. Quite the contrary, it did go extremely well with senior management. The senior managers understood, that by creating a simple and fast backend user interface, they will be able to reduce the queues without hiring new employees. Moreover, if they need to hire people, training will be easier and faster. The new UI could also reduce the number of human errors. Almost all of the old backend was still online two years later, although used only by a few employees. This made both the product and the information security team happy, not to mention HR. The functionality of the new application extended only slightly in 24 months. Nobody complained and the bank's customers were happy with smaller queues. All this was achieved with a pack of colored sticky notes, some markers and much more importantly a discussion and shared understanding. This is just one example, how a simple technique, like user story mapping, could save millions of dollars for a company. (For more resources related to this topic, see here.) Just tell the story Drawing a map, any map will lead to solving the problem. User story maps aim to replace document hand-overs with discussions and collaboration. Enterprises tend to have some sort of formal approval process, usually with a sign-off. That's perfectly fine, and most of the time unavoidable. Just make sure, that the sign-off happens after the mapping and story discussions. Ideally, right after the discussion, not days or weeks later. There is a reason why product manager, UX experts and all stakeholders love stories: they are humans. As such, we all have a natural tendency to love an emotionally satisfying tale. Most of our entertainment revolves around stories, and we want to hear good stories. A great story revolves around conflicts in a memorable and exciting way. How to tell a story? Telling a story is an easy task. We all did that as kids, yet we tend to forget about that skill we possess when we get into a serious product management discussion. How to tell a great story? There are a few rules to consider, the most important one is that you should talk about something that captivates the audience. The audience You should focus on the audience. What are their problems? What would make them listen actively, instead of texting or catching Pokémon, while at a user story discussion? Even if the project is about scratching your own itch, you should spin the story so it's their itch that is scratched. Engaging the audience can be indeed a challenge. Once upon atimeI have written a sci-fi novel. Actually, it was published in 2000, with the title Tűzsugár, in Hungarian.The English title would be Ray of Fire, but fortunately for my future writing career, it was never translated into English. The bookhad everything my 15-16 years old self consideredfun: For instance a great space war or passionate love between members of different sapient spacefaring races. The characters were miscellaneous human and non-human life-forms stuck in a spaceship for most of the story. Some of my characters had a keen resemblance to miscellaneous video-game characters, from games like Mortal Kombat 2 or Might & Magic VI. They certainly lacked emotional struggles over insignificant things like mass-murder or the end of the universe. As I certainly hope you will never read that book, I will spoiler you the end. A whole planet died, hinting that the entire galaxy might share the same fate, with a faint hope for salvation. This could have led to a sequel, but fortunately for all sci-fi lovers, I stopped writing the sequel after nine chapters. The book seemed to be a success. A national TV channel made an interview with me, if that’s any measure of success. More importantly, I had lots of fun writing it. But the book itself was hard to understand and probably impossible to appreciate. My biggest mistake was writing only what I considered fun. To be honest, I still write for fun, but now Ihave an audience in mind. I tell the story of my passion for user experience mapping to a great audience: you. I try to find things that are fun to write and still interesting to my target audience. As a true UX geek, I create the main persona of my audience before writing anything and tell a story to her. This article’s main persona is called Maya, and she shares many traits with my daughter. Could I say, I'm writing this book to my daughter? Of course, I do, but I keep in mind all otherpersonas. Hopefully one of them is a bit like you. Before a talk at a conference, I always ask the organizers about the audience. Even if the topic stays the same, the audience completely changes the story and the way I present it. I might tell the same user story differently to one of my team members, to the leader of another team, or to a chief executive. Differently internally, to a client or a third party. When telling a story, contrary to a written story, you will see an immediate feedback or the lack of it from your audience. You should go even further and shape the story based on this feedback. Telling a story is an interactive experience. Engage with the audience. Ask them questions, and let them ask you questions as a start, then turn this into a shared storytelling experience, where the story is being told by all participants together (not at the same time, though, unless you turn the workshop into a UX carol). When you tell a fairy-tale to your daughter, she might ask you why can't theprincessescape using her sharp wits and cunning, instead of relying on a prince challenging the dragon to a bloody duel. (Then you might start appreciating the story of the My Little Pony where the girl ponies solve challenges mostly non-violently while working together as a team of friends, instead of acting as a prize to be won.) So why not spin a tale of heroic princesses with fluffy pink cats?   Start with action Beginningin medias res, as in starting the narrative in the midst of action is a technique used by masters, such as Shakespeare or Homer, and it is also a powerful tool in your user story arsenal. While telling a story, always try to add as little background as possible, and start with drama or something to catch the attention of the audience, whenever possible. At the beginning of TheOdyssey quite a few unruly men want to marry Telemachus' mother, while his father has still not returned home from the Trojan War. There is no lengthy introduction explaining how those men ended up in Ithaca, or why the goddess, flashing-eyed Athena cares about Odysseus. The poem was composed in an oral tradition and was more likely to be heard thanreadat the time of composition. While literacy skyrocketed since Homer's time, you want to tell and discuss your user stories. Therefore you should consider a similar start. (Maybe not mentioning the user's mother or her rascally suitors.) Simplify In literary fiction, a complex story can be entertaining. A Game of Thrones and its sequels inA Song of Ice and Fireseries is a good example for that. The thing is, George R. R. Martin writes those novels, and he certainly has no intention to discuss them during sprint planning meetings with stakeholders. User Story Maps are more similar to sagas, folktales and other stories formed in an oral tradition. They develop in a discussion, and their understandability is granted by their simplicity. We need to create a map as simple and small as possible, with as few story cards as possible. So how big should the story map be? Jim Shore definesstory card hell as something that happens when you have 300 story cards and you have to keep track of them. Madness, huh? This is not Sparta! Sorry Jim for the bad pun, but you are absolutely right, in the 300 range, you will not understand the map, and the whole discussion part will completely fail. The user stories will be lost, and the audience will not even try to understand what's happening. There is no ideal number of cards in a story map but aim low. Then eliminate most of the cards. Clutter will destroy the conversation. In most card games you will have from two to seven cards in hand, with some rare exceptions. The most popular card game both online and offline is Texas Hold 'em Poker. In that game, each player is dealt only two cards. This is because human thought processes and discussions work best with a small number of objects. Sometimes the number of objects in the real world is high. Our mind is good at simplifying, classifying and grouping things into manageable units. With that said, most books and conference presentations about user story mapping show us a photo of a wall covered with sticky notes. The viewer will have absolutely no idea what's on them, but one thing is certain, it looks like acomplex project. I have a bad news for you: projects with a complex user story map never get finished, and if they do get finished to a degree they will fail. The abundance of sticky notes means that the communication and simplification process needs one or more iterations. Throw away most of the sticky notes! To do that, you need to understand the problem better. Tell the story of your passion Seriously. Find someone, and tell her the user story of the next big thing. The app or hardware which will change the world. Try it now. Be bold and let your imagination flow. I believe that in this century we will be able to digitalize human beings. This will be the key to both humankind's survival as a species and our exploration of the space. The digital society would have no famine, no plagues and no poverty. This would solve all major problems we face today. Digital humans would even defeat death. Sounds like a wild sci-fi idea? It is, but then again, smartphones were also a wild sci-fi idea a few decades ago. Now I will tell you the story of something we can build today. The grocery surplus webshop We will create the user story map for a grocery surplus webshop. Using this eCommerce site, we will sell clearancefood and drink at a discount price. This means food, that would be thrown away at a regular store. For example food past its expiry date or with damaged packaging.This idea is popular in developed countries, like Denmark or the UK, and it might help cutting down on the amounts of food wasted every year, totaling 1.3 billion metric tonnes worldwide. We are trying to create the online-only version of WeFood (https://donate.danchurchaid.org/wefood). Our users can be environmentally conscious shoppers or low-income families with limited budgets just to give two examples. In this article I will not introduce personas, and treat them separately, so for now, we will only think about them as shoppers. The opportunity to scratch your own itch Mapping will help you to achieve the most, with as little as possible. In other words: maximize the opportunity, while minimizing the outputs. To use the mapping lingo: The outputs are products of the map’s usage, while the outcomes are the results. The opportunity is the desired outcome we plan to achieve with the aid of the map. This is how we want to change the world. We should start the map with the opportunity. The opportunity should not be defined as selling surplus food and drink to our visitors. If you approach a project or a business without solving the users' problem the project might become a failure. The best way to find out what our user want is through valid user research, remote and lab-based user experience testing. Sometimes we need to settle with the second best solution, which happens to be free. That’ssolving your own problem, in other words,scratch your own itch. Probably the best summary of this mantra comes from Jason Fried, the founder and CEO of Basecamp: “When you solve your own problem, you create a tool that you're passionate about. And passion is key. Passion means you'll truly use it and care about it. And that's the best way to get others to feel passionate about it too.” (Getting Real: The Smarter, Faster, Easier Way to Build a Successful Web Application) We will create the web store we would love to use. Although, as the cliché goes, there is noI inteam, but there is certainly an I inwriter. My ideal eCommerce site could be different to yours. When following the examples of this article, please try to think of your itch, your ideal web store, and use my words only as guidance. You can create the user story map for any other project, ideally something you are passionate about. I would encourage you to pick something that's not a webshop, or maybe not even a digital product if you feel adventurous. You need the tell the story of your passion. (No, not that passion. This is not an adults-only website.) My passion is reducing food waste (that's also the poor excuse I'm using when looking at the bathroom scale). Here is my attempt to phrase the opportunity. The opportunity: Our shoppers want to save money while reducing global food waste. They understand and accept what surplus food and drink means, and they are happy to shop with us. Actually, the first sentence would be enough. Remember, you want to have a simple one or two sentence opportunity definition. I ended up working for two tapestry web shops as a consultant. Not at the same time, though, and the second company approached me mostly as the result of how successful the first one was. It's a relatively small industry in Europe, and business owners and decision-makers know each other by name. I still recall the pleasant experience I had meeting the owners of the first web shop. They invited me to dinner at a homely restaurant in Budapest.We had a great discussion and they shared their passion. They were an elderly couple, so they must have spent most of their life in the communist era. In the early 90's they decided to start a business, selling tapestry in a brick and mortar store. Obviously, they had no background in management or running acapitalist business, but that didn't matter, they only wanted to help people to make their homes beautiful. They loved tapestry, so they started importing and selling it. When I visited their physical store I have seen them talking to a customer. They spent more than an hour discussing interior decoration with someone, who just popped by to ask the square meter prices of tapestry. Tapestry is not sold per square meter, but they did the math for the customer among many other things. They showed her many different patterns, types and discussed application methods. After leaving the shop the customer knew more about tapestry than most other people ever will. Fast forward to the second contract. I only talked to the client on Skype, and that's perfectly fine because most of my clients don't invite me to dinner. I saw many differences in this client's approach to the previous one. At some point, I asked him “Why do you sell tapestry? Is tapestry your passion?” He was a bit startled by the question, but he promptly replied: “To make money, why else? You need to be pretty crazy to have tapestry as a passion.” Seven years later the second business no longer exists, yet the first one is still successful. Treating your work as your passion works wonders. Passion is an important contributor to the success of an idea. Whenever possible, pour your passion into a product and summarize it as the opportunity. What’s next? If you buy my new book, User Experience Mapping, you will find more about user story maps in the second chapter. In that chapter, we will explore user story maps, and how they help you to create requirements through collaboration (and a few sticky notes): We will create user stories and arrange them as a user story map. We will discuss the reasons behind creating them. We will learn how to tell a story. The grocery surplus webshop's user story map will be the example I will create in this chapter. To do this, we will explore user story templates, characteristics of a good user story (INVEST) and epics. With the 3 Cs (Card, Conversation and Confirmation) process we will turn the stories into reality. We will create a user story map on a wall with sticky notes Then digitally using StoriesOnBoard. And that’s just the second chapter, each of the eleven chapters contains different user experience maps. The book reveals two advanced mapping techniques for the first time in print, the behavioural change map and the 4D UX map. You will also explore user story maps, task models and journey maps. You will create wireflows, mental model maps, ecosystem maps and solution maps. In this book, we will show you how to use insights from real users to create and improve your maps and your product. Start mapping your products now to change your users’ lives! Resources for Article: Further resources on this subject: Learning D3.js Mapping Data Acquisition and Mapping Creating User Interfaces
Read more
  • 0
  • 0
  • 933

article-image-getting-started-metasploit
Packt
22 Jun 2017
10 min read
Save for later

Getting Started with Metasploit

Packt
22 Jun 2017
10 min read
In this article by Nipun Jaswal, the author of the book Metasploit Bootcamp, we will be covering the following topics: Fundamentals of Metasploit Benefits of using Metasploit (For more resources related to this topic, see here.) Penetration testing is an art of performing a deliberate attack on a network, web application, server or any device that require a thorough check up from the security perspective. The idea of a penetration test is to uncover flaws while simulating real world threats. A penetration test is performed to figure out vulnerabilities and weaknesses in the systems so that vulnerable systems can stay immune to threats and malicious activities. Achieving success in a penetration test largely depends on using the right set of tools and techniques. A penetration tester must choose the right set of tools and methodologies in order to complete a test. While talking about the best tools for penetration testing, the first one that comes to mind is Metasploit. It is considered as one of the most practical tools to carry out penetration testing today. Metasploit offers a wide variety of exploits, a great exploit development environment, information gathering and web testing capabilities, and much more. The fundamentals of Metasploit Now that we have completed the setup of Kali Linux let us talk about the big picture: Metasploit. Metasploit is a security project that provides exploits and tons of reconnaissance features to aid a penetration tester. Metasploit was created by H.D Moore back in 2003, and since then, its rapid development has led it to be recognized as one of the most popular penetration testing tools. Metasploit is entirely a Ruby-driven project and offers a great deal of exploits, payloads, encoding techniques, and loads of post-exploitation features. Metasploit comes in various editions, as follows: Metasploit pro: This edition is a commercial edition, offers tons of great features such as web application scanning and exploitation, automated exploitation and is quite suitable for professional penetration testers and IT security teams. Pro edition is used for advanced penetration tests and enterprise security programs. Metasploit express: The Express edition is used for baseline penetration tests. Features in this version of Metasploit include smart exploitation, automated brute forcing of the credentials, and much more. This version is quite suitable for IT security teams to small to medium size companies. Metasploit community: This is a free version with reduced functionalities of the express edition. However, for students and small businesses, this edition is a favorable choice. Metasploit framework: This is a command-line version with all manual tasks such as manual exploitation, third-party import, and so on. This release is entirely suitable for developers and security researchers. You can download Metasploit from the following link: https://www.rapid7.com/products/metasploit/download/editions/ We will be using the Metasploit community and framework version.Metasploit also offers various types of user interfaces, as follows: The graphical user interface(GUI): This has all the options available at a click of a button. This interface offers a user-friendly interface that helps to provide a cleaner vulnerability management. The console interface: This is the most preferred interface and the most popular one as well. This interface provides an all in one approach to all the options offered by Metasploit. This interface is also considered one of the most stable interfaces. The command-line interface: This is the more potent interface that supports the launching of exploits to activities such as payload generation. However, remembering each and every command while using the command-line interface is a difficult job. Armitage: Armitage by Raphael Mudge added a neat hacker-style GUI interface to Metasploit. Armitage offers easy vulnerability management, built-in NMAP scans, exploit recommendations, and the ability to automate features using the Cortanascripting language. Basics of Metasploit framework Before we put our hands onto the Metasploit framework, let us understand basic terminologies used in Metasploit. However, the following modules are not just terminologies but modules that are heart and soul of the Metasploit project: Exploit: This is a piece of code, which when executed, will trigger the vulnerability at the target. Payload: This is a piece of code that runs at the target after a successful exploitation is done. It defines the type of access and actions we need to gain on the target system. Auxiliary: These are modules that provide additional functionalities such as scanning, fuzzing, sniffing, and much more. Encoder: Encoders are used to obfuscate modules to avoid detection by a protection mechanism such as an antivirus or a firewall. Meterpreter: This is a payload that uses in-memory stagers based on DLL injections. It provides a variety of functions to perform at the target, which makes it a popular choice. Architecture of Metasploit Metasploit comprises of various components such as extensive libraries, modules, plugins, and tools. A diagrammatic view of the structure of Metasploit is as follows: Let's see what these components are and how they work. It is best to start with the libraries that act as the heart of Metasploit. Let's understand the use of various libraries as explained in the following table: Library name Uses REX Handles almost all core functions such as setting up sockets, connections, formatting, and all other raw functions MSF CORE Provides the underlying API and the actual core that describes the framework MSF BASE Provides friendly API support to modules We have many types of modules in Metasploit, and they differ regarding their functionality. We have payload modules for creating access channels to exploited systems. We have auxiliary modules to carry out operations such as information gathering, fingerprinting, fuzzing an application, and logging into various services. Let's examine the basic functionality of these modules, as shown in the following table: Module type Working Payloads Payloads are used to carry out operations such as connecting to or from the target system after exploitation or performing a particular task such as installing a service and so on. Payload execution is the next step after the system is exploited successfully. Auxiliary Auxiliary modules are a special kind of module that performs specific tasks such as information gathering, database fingerprinting, scanning the network to find a particular service and enumeration, and so on. Encoders Encoders are used to encode payloads and the attack vectors to (or intending to) evade detection by antivirus solutions or firewalls. NOPs NOP generators are used for alignment which results in making exploits stable. Exploits The actual code that triggers a vulnerability Metasploit framework console and commands Gathering knowledge of the architecture of Metasploit, let us now run Metasploit to get a hands-on knowledge about the commands and different modules. To start Metasploit, we first need to establish database connection so that everything we do can be logged into the database. However, usage of databases also speeds up Metasploit's load time by making use of cache and indexes for all modules. Therefore, let us start the postgresql service by typing in the following command at the terminal: root@beast:~# service postgresql start Now, to initialize Metasploit's database let us initialize msfdb as shown in the following screenshot: It is clearly visible in the preceding screenshot that we have successfully created the initial database schema for Metasploit. Let us now start the Metasploit's database using the following command: root@beast:~# msfdb start We are now ready to launch Metasploit. Let us issue msfconsole in the terminal to startMetasploit as shown in the following screenshot: Welcome to the Metasploit console, let us run the help command to see what other commands are available to us: The commands in the preceding screenshot are core Metasploit commands which are used to set/get variables, load plugins, route traffic, unset variables, printing version, finding the history of commands issued, and much more. These commands are pretty general. Let's see module based commands as follows: Everything related to a particular module in Metasploit comes under module controls section of the Help menu. Using the preceding commands, we can select a particular module, load modules from a particular path, get information about a module, show core, and advanced options related to a module and even can edit a module inline. Let us learn some basic commands in Metasploit and familiarize ourselves to the syntax and semantics of these commands: Command Usage Example use [auxiliary/exploit/payload/encoder] To select a particular msf>use exploit/unix/ftp/vsftpd_234_backdoor msf>use auxiliary/scanner/portscan/tcp show[exploits/payloads/encoder/auxiliary/options] To see the list of available modules of a particular type msf>show payloads msf> show options set [options/payload] To set a value to a particular object msf>set payload windows/meterpreter/reverse_tcp msf>set LHOST 192.168.10.118 msf> set RHOST 192.168.10.112 msf> set LPORT 4444 msf> set RPORT 8080 setg [options/payload] To assign a value to a particular object globally, so the values do not change when a module is switched on msf>setgRHOST 192.168.10.112 run To launch an auxiliary module after all the required options are set msf>run exploit To launch an exploit msf>exploit back To unselect a module and move back msf(ms08_067_netapi)>back msf> Info To list the information related to a particular exploit/module/auxiliary msf>info exploit/windows/smb/ms08_067_netapi msf(ms08_067_netapi)>info Search To find a particular module msf>search hfs check To check whether a particular target is vulnerable to the exploit or not msf>check Sessions To list the available sessions msf>sessions [session number]   Meterpreter commands Usage Example sysinfo To list system information of the compromised host meterpreter>sysinfo ifconfig To list the network interfaces on the compromised host meterpreter>ifconfig meterpreter>ipconfig (Windows) Arp List of IP and MAC addresses of hosts connected to the target meterpreter>arp background To send an active session to background meterpreter>background shell To drop a cmd shell on the target meterpreter>shell getuid To get the current user details meterpreter>getuid getsystem To escalate privileges and gain system access meterpreter>getsystem getpid To gain the process id of the meterpreter access meterpreter>getpid ps To list all the processes running at the target meterpreter>ps If you are using Metasploit for the very first time, refer to http://www.offensive-security.com/metasploit-unleashed/Msfconsole_Commandsfor more information on basic commands Benefits of using Metasploit Metasploit is an excellent choice when compared to traditional manual techniques because of certain factors which are listed as follows: Metasploit framework is open source Metasploit supports large testing networks by making use of CIDR identifiers Metasploit offers quick generation of payloads which can be changed or switched on the fly Metasploit leaves the target system stable in most of the cases The GUI environment provides a fast and user-friendly way to conduct penetration testing Summary Throughout this article, we learned the basics of Metasploit. We learned about various syntax and semantics of Metasploit commands. We also learned the benefits of using Metasploit. Resources for Article: Further resources on this subject: Approaching a Penetration Test Using Metasploit [article] Metasploit Custom Modules and Meterpreter Scripting [article] So, what is Metasploit? [article]
Read more
  • 0
  • 0
  • 17382

article-image-string-encryption-and-decryption
Packt
22 Jun 2017
21 min read
Save for later

String Encryption and Decryption

Packt
22 Jun 2017
21 min read
In this article by Brenton J.W Blawat, author of the book Enterprise PowerShell Scripting Bootcamp, we will learn about string encryption and decryption. Large enterprises often have very strict security standards that are required by industry-specific regulations. When you are creating your Windows server scanning script, you will need to approach the script carefully with certain security concepts in mind. One of the most common situations you may encounter is the need to leverage sensitive data, such as credentials,in your script. While you could prompt for sensitive data during runtime, most enterprises want to automate the full script using zero-touch automation. (For more resources related to this topic, see here.) Zero-touch automation requires that the scripts are self-contained and have all of the required credentials and components to successfully run. The problem with incorporating sensitive data in the script, however, is that data can be obtained in clear text. The usage of clear text passwords in scripts is a bad practice, and violates many regulatory and security standards. As a result, PowerShell scripters need a method to securely store and retrieve sensitive data for use in their scripts. One of the popular methods to secure sensitive data is to encrypt the sensitive strings. This article explores RijndaelManaged symmetric encryption, and how to use it to encrypt and decrypt strings using PowerShell. In this article, we will cover the following topics: Learn about RijndaelManaged symmetric encryption Understand the salt, init, and password for the encryption algorithm Script a method to create randomized salt, init, and password values Encrypt and decrypt strings using RijndaelManaged encryption Create an encoding and data separation security mechanism for encryption passwords The examples in this article build upon each other. You will need to execute the script sequentially to have the final script in this article work properly. RijndaelManaged encryption When you are creating your scripts, it is best practice to leverage some sort of obfuscation, or encryption for sensitive data. There are many different strategies that you can use to secure your data. One is leveraging string and script encoding. Encoding takes your human readable string or script, and scrambles it to make it more difficult for someone to see what the actual code is. The downsides of encoding are that you must decode the script to make changes to it and decoding does not require the use of a password or passphrase. Thus, someone can easily decode your sensitive data using the same method you would use to decode the script. The alternative to encoding is leveraging an encryption algorithm. Encryption algorithms provide multiple mechanisms to secure your scripts and strings. While you can encrypt your entire script, it's most commonly used to encrypt the sensitive data in the scripts themselves, or answer files. One of the most popular encryption algorithms to use with PowerShell is RijndaelManaged. RijndaelManaged is a symmetric block cipher algorithm, which was selected by United States National Institute of Standards and Technology (NIST) for its implementation of Advanced Encryption Standard (AES). When using RijndaelManaged for the standard of AES, it supports 128-bit, 192-bit, and 256-bit encryption. In contrast to encoding, encryption algorithms require additional information to be able to properly encrypt and decrypt the string. When implementing the RijndaelManaged in PowerShell, the algorithm requires salt, a password, and the InitializationVector (IV). The salt is typically a randomized value that changes each time you leverage the encryption algorithm. The purpose of salt in a traditional encryption scenario is to change the encrypted value each time the encryption function is used. This is important in scenarios where you are encrypting multiple passwords or strings with the same value. If two users are using the same password, the encryption value in the database would also be the same. By changing the salt each time, the passwords, though the same value, would have different encrypted values in the database. In this article, we will be leveraging a static salt value. The password typically is a value that is manually entered by a user, or fed into the script using a parameter block. You can also derive the password value from a certificate, active directory attribute values, or a multitude of other sources. In this article, we will be leveraging three sources for the password. The InitializationVector (IV) is a hash generated from the IV string and is used for the EncryptionKey. The IV string is also typically a randomized value that changes each time you leverage the encryption algorithm. The purpose of the IV string is to strengthen the hash created by the encryption algorithm. This was created to thwart a hacker who is leveraging a rainbow attack using precalculated hash tables using no IV strings, or commonly used strings. Since you are setting the IV string, the number of hash combinations exponentially increases and it reduces the effectiveness of a rainbow attack. In this article, we will be using a static initialization vector value. The implementation of randomization of the salt and initialization vector strings become more important in scenarios where you are encrypting a large set of data. An attacker can intercept hundreds of thousands of packets, or strings,which reveals an increasing amount of information about your IV. With this, the attacker can guess the IV and derive the password. The most notable hack of IVs were with WiredEquivalentPrivacy (WEP) wireless protocol that used aweak, or small, initialization vector. After capturing enough packets, anIV hash could be guessed and a hacker could easily obtain the passphrase used on the wireless network. Creating random salt, initialization vector, and passwords As you are creating your scripts, you will want to make sure you use complex random values for the salt, IV string, and password. This is to prevent dictionary attacks where an individual may use common passwords and phrases to guess the salt, IV string, and password. When you create your salt and IVs, make sure they are a minimum of 10 random characters each. It is also recommended that you use a minimum of 30 random characters for the password. To create random passwords in PowerShell, you can do the following: Function create-password { # Declare password variable outside of loop. $password = "" # For numbers between 33 and 126 For ($a=33;$a –le 126;$a++) { # Add the Ascii text for the ascii number referenced. $ascii += ,[char][byte]$a } # Generate a random character form the $ascii character set. # Repeat 30 times, or create 30 random characters. 1..30 | ForEach { $password += $ascii | get-random } # Return the password return $password } # Create four 30 character passwords create-password create-password create-password create-password The output of this command would look like the following: This function will create a string with 30 random characters for use with random password creation. You first start by declaring the create-password function. You then declare the $password variable for use within the function by setting it equal to "". The next step is creating a For command to loop through a set of numbers. These numbers represent ASCII character numbers that you can select from for the password. You then create the For command by writing For ($a=33; $a -le 126;$a++). This means starting at the number 33, increase the value by one ($a++), and continue until the number is less than or equal to 126. You then declare the $ascii variable and construct the variable using the += assignment operator. As the For loop goes through its iterations, it adds a character to the array values. The script then leverages the [char] or character value of the [byte] number contained in $a. After this section, the $ascii array will contain an array of all the ASCII characters with the byte values between 33 and 126. You then continue to the random character generation. You declare the 1..30 command, which means for numbers 1 to 30, repeat the following command. You pipe this to ForEach {, which will designate for each of the 30 iterations. You then call the $ascii array and pipe it to | get-random cmdlet. The get-random cmdlet will randomly select one of the characters in the $ascii array. This value is then joined to the existing values in the $password string using the assignment operator +=. After the 30 iterations, there will be 30 random values in the $password variable. Lastly, you leverage return $password, to return this value to the script. After declaring the function, you call the function four times using create-password. This creates four random passwords for use. To create strings that are less than 30 random characters in length, you can modify the 1..30 to be any value that you want. If you want 15 random character Salt and Initialization Vector, you would use 1..15 instead. Encrypting and decrypting strings To start using RijndaelManaged encryption, you need to import the .NET System.Security Assembly into your script. Much like importing a module to provide additional cmdlets, using .NET assemblies provide an extension to a variety of classes you wouldn't normally have access to in PowerShell. Importing the assembly isn't persistent. This means you will need to import the assembly each time you want to use it in a PowerShell session, or each time you want to run the script. To load the .NET assembly, you can use the Add-Type cmdlet with the -AssemblyName parameter with the System.Security argument. Since the cmdlet doesn't actually output anything to the screen, you may choose to print to the screen successful importing of the assembly. To import the System.Security Assembly with display information, you can do the following: Write-host "Loading the .NET System.Security Assembly For Encryption" Add-Type -AssemblyNameSystem.Security -ErrorActionSilentlyContinue -ErrorVariable err if ($err) { Write-host "Error Importing the .NET System.Security Assembly." PAUSE EXIT } # if err is not set, it was successful. if (!$err) { Write-host "Succesfully loaded the .NET System.Security Assembly For Encryption" } The output from this command looks like the following: In this example, you successfully import the.NET System.SecurityAssembly for use with PowerShell. You first start by writing "Loading the .NET System.Security Assembly for Encryption" to the screen using the Write-host command. You then leverage the Add-Type cmdlet with the -AssemblyName parameter with the System.Security argument, the -ErrorAction parameter with the SilentlyContinue argument, and the -ErrorVariable parameter with the err argument. You then create an if statement to see if $err contains data. If it does, it will use Write-host cmdlet to print"Error Importing the .NET System.Security Assembly." to the screen. It will PAUSE the script so the error can be read. Finally, it will exit the script. If $err is $null, designated by if (!$err) {, it will use the Write-host cmdlet to print "Successfully loaded the .NET System.Security Assembly for Encryption" to the screen. At this point, the script or PowerShell window is ready to leverageSystem.Security Assembly. After you loadSystem.Security Assembly, you can start creating the encryption function. The RijndaelManaged encryption requires a four-step process to encrypt the strings which is represented in the preceeding diagram. The RijndaelManaged encryption process is as follows: The process starts by creating the encryptor. The encryptor is derived from the encryption key (password and salt) and initialization vector. After you define the encryptor, you will need to create a new memory stream using the IO.MemoryStream object. A memory stream is what stores values in memory for use by the encryption assembly. Once the memory stream is open, you define a System.Security.Cryptography.CryptoStream object. The CryptoStream is the mechanism that uses the memory stream and the encryptor to transform the unencrypted data to encrypted data. In order to leverage the CryptoStream, you need to write data to the CryptoStream. The final step is to use the IO.StreamWriter object to write the unencrypted value into the CryptoStream. The output from this transformation is placed into MemoryStream. To access the encrypted value, you read the data in the memory stream. To learn more about the System.Security.Cryptography.RijndaelManaged class, you can view the following MSDN article: https://msdn.microsoft.com/en-us/library/system.security.cryptography.rijndaelmanaged(v=vs.110).aspx. To create a script that encrypts strings using the RijndaelManaged encryption, you would perform the following: Add-Type -AssemblyNameSystem.Security function Encrypt-String { param($String, $Pass, $salt="CreateAUniqueSalt", $init="CreateAUniqueInit") try{ $r = new-Object System.Security.Cryptography.RijndaelManaged $pass = [Text.Encoding]::UTF8.GetBytes($pass) $salt = [Text.Encoding]::UTF8.GetBytes($salt) $init = [Text.Encoding]::UTF8.GetBytes($init) $r.Key = (new-Object Security.Cryptography.PasswordDeriveBytes $pass, $salt, "SHA1", 50000).GetBytes(32) $r.IV = (new-Object Security.Cryptography.SHA1Managed).ComputeHash($init)[0..15] $c = $r.CreateEncryptor() $ms = new-Object IO.MemoryStream $cs = new-Object Security.Cryptography.CryptoStream $ms,$c,"Write" $sw = new-Object IO.StreamWriter $cs $sw.Write($String) $sw.Close() $cs.Close() $ms.Close() $r.Clear() [byte[]]$result = $ms.ToArray() } catch { $err = "Error Occurred Encrypting String: $_" } if($err) { # Report Back Error return $err } else { return [Convert]::ToBase64String($result) } } Encrypt-String "Encrypt This String""A_Complex_Password_With_A_Lot_Of_Characters" The output of this script would look like the following: This function displays how to encrypt a string leveraging the RijndaelManaged encryption algorithm. You first start by importing the System.Security assembly by leveraging Add-Type cmdlet, using the -AssemblyName parameter with the System.Security argument. You then declare the function of Encrypt-String. You include a parameter block to accept and set values into the function. The first value is $string, which is the unencrypted text. The second value is $pass, which is used for the encryption key. The third is a predefined $salt variable set to "CreateAUniqueSalt". You then define the $init variable, which is set to "CreateAUniqueInit". After the parameter block, you declare try { to handle any errors in the .NET assembly. The first step is to declare the encryption class using new-Object cmdlet with the System.Security.Cryptography.RijndaelManaged argument. You place this object inside the $r variable. You then convert the $pass, $salt, and $init values to the character encoding standard of UTF8 and store the character byte values in a variable. This is done specifying [Text.Encoding]::UTF8.GetBytes($pass) for the $pass variable, [Text.Encoding]::UTF8.GetBytes($salt) for the $salt variable, and [Text.Encoding]::UTF8.GetBytes($init) for the $init variable. After setting the proper character encoding, you proceed to create the encryption key for the RijndalManaged encryption algorithm. This is done by setting the RijndaelManaged $r.Key attribute to the object created by (new-Object Security.Cryptography.PasswordDeriveBytes $pass, $salt, "SHA1", 50000).GetBytes(32). This object leverages the Security.Cryptography.PasswordDeriveBytes class and creates a key using the $pass variable, $salt variable, "SHA1" hash name, and iterating the derivative 50000 times. Each iteration of this class generates a different key value, making it more complex to guess the key. You then leverage the .Get-Bytes(32) method to return the 32-byte value of the key. The RijndaelManaged 256-bit encryption is a derivative of the 32 bytes in the key. 32 bytes times 8 bits per byte is 256bits. To create the initialization vector for the algorithm, you set the RijndaelManaged$r.IV attribute to the object created by (new-Object Security.Cryptography.SHA1Managed).ComputeHash($init)[0..15]. This section of the code leverages Security.Cryptography.SHA1Managed and computes the hash based on the $init value. When you invoke the [0..15] range operator, it will obtain the first 16 bytes of the hash and place it into $r.IVattribute. The RijndaelManaged default block size for the initialization vector is 128bits. 16 bytes times 8 bits per byte is 128bits. After setting up the required attributes, you are now ready to start encrypting data. You first start by leveraging the $r RijndaelManaged object with the $r.Key and $r.IV attributes defined. You use the $r.CreateEncryptor() method to generate the encryptor. Once you've generated the encryptor, you have to create a memory stream to do the encryption in memory. This is done by declaring new-Objectcmdlet, set to the IO.MemoryStream class, and placing the memory stream object in the $ms variable. Next, you create CryptoStream. The CryptoStream is used to transform the unencrypted data into the encrypted data. You first declare the new-Object cmdlet with the Security.Cryptopgraphy.CryptoStream argument. You also define the memory stream of $ms, the encryptor of $c, and the operator of "Write" to tell the class to write unencrypted data to the encryption stream in memory. After creating CryptoStream, you are ready to write the unencrypted data into the CryptoStream. This is done using the IO.StreamWriter class. You declare a new-Object cmdlet with the IO.StreamWriter argument, and define CryptoStream of $cs for writing. Last, you take the unencrypted string stored in the $string variable, and pass it into the StreamWriter$sw with $sw.Write($String). The encrypted value is now stored in the memory stream. To stop the writing of data to the CryptoStream and MemoryStream, you close the StreamWriter with $sw.Close(), close the CryptoStream with $cs.Close() and the memory stream with $ms.Close(). For security purposes, you also clear out the encryptor data by declaring $r.Clear(). After the encryption process is done, you will need to export the memory stream to a byte array. This is done calling the $ms.ToArray() method and setting it to the$result variable with the [byte[]] data type. The contents are stored in a byte array in $result. This section of the code is where you declare your catch { statement. If there were any errors in the encryption process, the script will execute this section. You declare the variable of $err with the"Error Occurred Encrypting String: $_" argument. The $_ will be the pipeline error that occurred during the try {} section. You then create an if statement to determine whether there is data in the $err variable. If there is data in $err, it returns the error string to the script. If there were no errors, the script will enter the else { section of the script. It will convert the $result byte array to Base64String by leveraging [Convert]::ToBase64String($result). This converts the byte array to string for use in your scripts. After defining the encryption function, you call the function for use. You first start by calling Encrypt-String followed by "Encrypt This String". You also declare the second argument as the password for the encryptor, which is "A_Complex_Password_With_A_Lot_Of_Characters". After execution, this example receives the encrypted value of hK7GHaDD1FxknHu03TYAPxbFAAZeJ6KTSHlnSCPpJ7c= generated from the function. Your results will vary depending on your salt, init, and password you use for the encryption algorithm. Decrypting strings The decryption of strings is very similar to the process you performed of encrypting strings. Instead of writing data to the memory stream, the function reads the data in the memory stream. Also, instead of using the .CreateEncryptor() method, the decryption process leverages the .CreateDecryptor() method. To create a script that decrypts encrypted strings using the RijndaelManaged encryption, you would perform the following: Add-Type -AssemblyNameSystem.Security function Decrypt-String { param($Encrypted, $pass, $salt="CreateAUniqueSalt", $init="CreateAUniqueInit") if($Encrypted -is [string]){ $Encrypted = [Convert]::FromBase64String($Encrypted) } $r = new-Object System.Security.Cryptography.RijndaelManaged $pass = [System.Text.Encoding]::UTF8.GetBytes($pass) $salt = [System.Text.Encoding]::UTF8.GetBytes($salt) $init = [Text.Encoding]::UTF8.GetBytes($init) $r.Key = (new-Object Security.Cryptography.PasswordDeriveBytes $pass, $salt, "SHA1", 50000).GetBytes(32) $r.IV = (new-Object Security.Cryptography.SHA1Managed).ComputeHash($init)[0..15] $d = $r.CreateDecryptor() $ms = new-Object IO.MemoryStream@(,$Encrypted) $cs = new-Object Security.Cryptography.CryptoStream $ms,$d,"Read" $sr = new-Object IO.StreamReader $cs try { $result = $sr.ReadToEnd() $sr.Close() $cs.Close() $ms.Close() $r.Clear() Return $result } Catch { Write-host "Error Occurred Decrypting String: Wrong String Used In Script." } } Decrypt-String "hK7GHaDD1FxknHu03TYAPxbFAAZeJ6KTSHlnSCPpJ7c=""A_Complex_Password_With_A_Lot_Of_Characters". The output of this script would look like the following: This function displays how to decrypt a string leveraging the RijndaelManaged encryption algorithm. You first start by importing the System.Security assembly by leveraging the Add-Type cmdlet, using the -AssemblyName parameter with the System.Security argument. You then declare the Decrypt-String function. You include a parameter block to accept and set values for the function. The first value is $Encrypted, which is the encrypted text. The second value is the $pass which is used for the encryption key. The third is a predefined $salt variable set to "CreateAUniqueSalt". You then define the $init variable, which is set to "CreateAUniqueInit". After the parameter block, you check to see if the encrypted value is formatted as a string by using if ($Encrypted -is [string]) {. If this evaluates to True, you convert the string to bytes using [Convert]::FromBase64String($Encrypted) and placing the encoded value in the $Encrypted variable. Next, you declare the decryption class using new-Object cmdlet with the System.Security.Cryptography.RijndaelManaged argument. You place this object inside of the $r variable. You then convert the $pass, $salt, and $init values to the character encoding standard of UTF8 and store the character byte values in a variable. This is done specifying [Text.Encoding]::UTF8.GetBytes($pass) for the $pass variable, [Text.Encoding]::UTF8.GetBytes($salt) for the $salt variable, and [Text.Encoding]::UTF8.GetBytes($init) for the $init variable. After setting the proper character encoding, you proceed to create the encryption key for the RijndaelManaged encryption algorithm. This is done by setting the RijndaelManaged $r.Key attribute to the object created by (new-Object Security.Cryptography.PasswordDeriveBytes $pass, $salt, "SHA1", 50000).GetBytes(32). This object leverages the Security.Cryptography.PasswordDeriveBytes class and creates a key using the $pass variable, $salt variable, "SHA1" hash name, and iterating the derivative 50000 times. Each iteration of this class generates a different key value, making it more complex to guess the key. You then leverage the .get-bytes(32) method to return the 32-byte value of the key. To create the initialization vector for the algorithm, you set the RijndaelManaged $r.IV attribute to the object created by (new-Object Security.Cryptography.SHA1Managed).ComputeHash($init)[0..15]. This section of the code leverages the Security.Cryptography.SHA1Managed class and computes the hash based on the $init value. When you invoke the [0..15] range operator, the first 16 bytes of the hash are obtained and placed into $r.IV attribute. After setting up the required attributes, you are now ready to start decrypting data. You first start by leveraging the $r RijndaelManaged object with the $r.key and $r.IV attributes defined. You use the $r.CreateDecryptor() method to generate the decryptor. Once you've generated the decryptor, you have to create a memory stream to do the decryption in memory. This is done by declaring new-Object cmdlet with the IO.MemoryStream class argument. You then reference the $encrypted values to place in the memory stream object with @(,$Encrypted), and store the populated memory stream in the $ms variable. Next, you create CryptoStream, which CryptoStream is used to transform the encrypted data into the decrypted data. You first declare new-Object cmdlet with the Security.Cryptopgraphy.CryptoStream class argument. You also define the memory stream of $ms, the decryptor of $d, and the operator of "Read" to tell the class to read the encrypted data from the encryption stream in memory. After creating CryptoStream, you are ready to read the decrypted datafrom CryptoStream. This is done using the IO.StreamReader class. You declare new-Object with the IO.StreamReader class argument, and define CryptoStream of $cs to read from. At this point, you use try { to catch any error messages that are generated from reading the data in the StreamReader. You call $sr.ReadToEnd(), which calls the StreamReader and reads the complete decrypted value and places the datain the $result variable. To stop the reading of data to CryptoStream and MemoryStream, you close StreamWriter with $sw.Close(), close the CryptoStream with $cs.Close() and the memory stream with $ms.Close(). For security purposes, you also clear out the decryptor data by declaring $r.Clear(). If the decryption is successful, you return the value of $result to the script. After defining the decryption function, you call the function for use. You first start by calling Decrypt-String followed by "hK7GHaDD1FxknHu03TYAPxbFAAZeJ6KTSHlnSCPpJ7c=". You also declare the second argument as the password for the decryptor, which is "A_Complex_Password_With_A_Lot_Of_Characters". After execution, you will receive the decrypted value of "Encrypt This String" generated from the function. Summary In this article, we learned about RijndaelManaged 256-bit encryption. We first started with the basics of the encryption process. Then, we proceeded into learning how to create randomized salt, init, and passwords in scripts. We ended the article with learning how to encrypt and decrypt strings. Resources for Article: Further resources on this subject: WLAN Encryption Flaws [article] Introducing PowerShell Remoting [article] SQL Server with PowerShell [article]
Read more
  • 0
  • 0
  • 13052
article-image-inbuilt-data-types-python
Packt
22 Jun 2017
4 min read
Save for later

Inbuilt Data Types in Python

Packt
22 Jun 2017
4 min read
This article by Benjamin Baka, author of the book Python Data Structures and Algorithm, explains the inbuilt data types in Python. Python data types can be divided into 3 categories, numeric, sequence and mapping. There is also the None object that represents a Null, or absence of a value. It should not be forgotten either that other objects such as classes, files and exceptions can also properly be considered types, however they will not be considered here. (For more resources related to this topic, see here.) Every value in Python has a data type. Unlike many programming languages, in Python you do not need to explicitly declare the type of a variable. Python keeps track of object types internally. Python inbuilt data types are outlined in the following table: Category Name Description None None The null object Numeric int Integer   float Floating point number   complex Complex number   bool Boolean (True, False) Sequences str String of characters   list List of arbitrary objects   Tuple Group of arbitrary items   range Creates a range of integers. Mapping dict Dictionary of key – value pairs   set Mutable, unordered collection of unique items   frozenset Immutable set None type The None type is immutable and has one value, None. It is used to represent the absence of a value. It is returned by objects that do not explicitly return a value and evaluates to False in Boolean expressions. It is often used as the default value in optional arguments to allow the function to detect if the caller has passed a value. Numeric Types All numeric types, apart from bool, are signed and they are all immutable. Booleans have two possible values, True and False. These values are mapped to 1 and 0 respectively. The integer type, int, represents whole numbers of unlimited range. Floating point numbers are represented by the native double precision floating point representation of the machine. Complex numbers are represented by two floating point numbers. they are assigned using the j operator to signify the imaginary part of the complex number. For example : a = 2+3j We can access the real and imaginary parts by a.real and a.imag respectively. Representation error It should be noted that the native double precision representation of floating point numbers leads to some unexpected results. For example, consider the following: In[14]: 1-0.9 Out[14]: 0.09999999999998 In [15]: 1-0.9 == 0.1 Out[15]: False This is a result of the fact that most decimal fractions are not exactly representable as a binary fraction, which is how most underlying hardware represents floating point numbers. For algorithms or applications where this may be an issue Python provides a decimalmodule. This module allows for the exact representation of decimal numbers and facilitates greater control properties such as rounding behaviour, number of significant digits and precision. It defines two objects, a Decimal type, representing decimal numbers and a Context type, representing various computational parameters such as precision, rounding and error handling.  An example of its usage can be seen in the following: In [1]: import decimal In[2]: x = decimal.Decimal(3.14); y=decimal.Decimal(2.74) In[3]: x*y Out[3]: Decimal (‘8.60360000000001010036498883’) In[4]: decimal.getcontext().prec = 4 In[5]: x * y Out[5]: Decimal(‘8.604’) Here we have created a global context and set the precision to 4. The Decimal object can be treated pretty much as you would treat an int or a float. They are subject to all the same mathematical operations and can be used as dictionary keys, placed in sets and so on. In addition, Decimal objects also have several methods for mathematical operations such as natural exponents x.exp(), natural logarithms, x.ln() and base 10 logarithms, x.log10().  Python also has a fractions module that implements a rational number type. The following shows several ways to create fractions: In [62]: import fractions In [63]: fractions Fraction(3,4) #creates the fraction ¾ Out[63]: Fraction(3,4) In [64]: fraction Fraction(0,5) #creates a fraction from a float Out[64]: Fraction(1,2) In [65]: fraction Fraction(“.25”) #creates a fraction from a string Out[65]: Fraction(1,4) It is also worth mentioning here the NumPy extension. This has types for mathematical objects such as arrays, vectors and matrixes and capabilities for linear algebra, calculation of Fourier transforms, eigenvectors, logical operations and much more. Summary We have looked at the built in data types and some internal Python modules, most notable the collections module. There are a number of external libraries such as the SciPy stack, and, likewise.  Resources for Article: Further resources on this subject: Python Data Structures [article] Getting Started with Python Packages [article] An Introduction to Python Lists and Dictionaries [article]
Read more
  • 0
  • 0
  • 2412

article-image-understanding-microservices
Packt
22 Jun 2017
19 min read
Save for later

Understanding Microservices

Packt
22 Jun 2017
19 min read
This article by Tarek Ziadé, author of the book Python Microservices Development explains the benefits and implementation of microservices with Python. While the microservices architecture looks more complicated than its monolithic counterpart, its advantages are multiple. It offers the following benefits. (For more resources related to this topic, see here.) Separation of concerns First of all, each microservice can be developed independently by a separate team. For instance, building a reservation service can be a full project on its own. The team in charge can make it in whatever programming language and database, as long as it has a well-documented HTTP API. That also means the evolution of the app is more under control than with monoliths. For example, if the payment system changes its underlying interactions with the bank, the impact is localized inside that service and the rest of the application stays stable and under control. This loose coupling improves a lot the overall project velocity as we're applying at the service level a similar philosophy than the single responsibility principle. The single responsibility principle was defined by Robert Martin to explain that a class should have only one reason to change - in other words, each class should be providing a single, well-defined feature. Applied to microservices, it means that we want to make sure that each microservice focuses on a single role. Smaller projects The second benefit is breaking the complexity of the project. When you are adding a feature to an application like the PDF reporting, even if you are doing it cleanly, you are making the base code bigger, more complicated and sometimes slower. Building that feature in a separate application avoids this problem, and makes it easier to write it with whatever tools you want. You can refactor it often and shorten your release cycles, and stay on the top of things. The growth of the application remains under your control. Dealing with a smaller project also reduces risks when improving the application: if a team wants to try out the latest programming language or framework, they can iterate quickly on a prototype that implements the same microservice API, try it out, and decide whether or not to stick with it. One real-life example in mind is the Firefox Sync storage microservice. There are currently some experiments to switch from the current Python+MySQL implementation to a Go based one that stores users data in standalone SQLite databases. That prototype is highly experimental, but since we have isolated the storage feature in a microservice with a well-defined HTTP API, it's easy enough to give it a try with a small subset of the user base. Scaling and deployment Last, having your application split into components makes it easier to scale depending on your constraints. Let's say you are starting to get a lot of customers that are booking hotels daily, and the PDF generation is starting to heat up the CPUs. You can deploy that specific microservice in some servers that have bigger CPUs. Another typical example is RAM-consuming microservices like the ones that are interacting with memory databases like Redis or Memcache. You could tweak your deployments consequently by deploying them on servers with less CPU and a lot more RAM. To summarize microservices benefits: A team can develop each microservice independently, and use whatever technological stack makes sense. They can define a custom release cycle. The tip of the iceberg is its language agnostic HTTP API. Developers break the application complexity into logical components. Each microservice focuses on doing one thing well. Since microservices are standalone applications, there's a finer control on deployments, which makes scaling easier. Microservices architectures are good at solving a lot of the problems that may arise once your application is starting to grow. Although, we need to be aware of some of the new issues they also bring in practice. Implementing microservices with Python Python is an amazingly versatile language. As you probably already know, it's used to build many different kinds of applications, from simple system scripts that perform tasks on a server, to large object-oriented applications that run services for millions of users. According to a study conducted by Philip Guo in 2014, published in the Association for Computing Machinery (ACM) website, Python has surpassed Java in top U.S. universities and is the most popular language to learn Computer Science. This trend is also true in the software industry. Python sits now in the top 5 languages in the TIOBE index (http://www.tiobe.com/tiobe-index/), and it's probably even bigger in the web development land since languages like C are rarely used as main languages to build web applications. However, some developers criticize Python for being slow and unfit for building efficient web services. Python is slow, and this is undeniable. But it's still is a language of choice for building microservices, and many major companies are happily using it. This section will give you some background on the different ways you can write microservices using Python, some insights on asynchronous versus synchronous programming, and conclude with some details on Python performances. It's composed of 4 parts: The WSGI standard Greenlet & Gevent Twisted & Tornado asyncio Language performances The WSGI standard What strikes the most web developers that are starting with Python is how easy it is to get a web application up and running. The Python web community has created a standard inspired from the Common Gateway Interface (CGI) called Web Server Gateway Interface (WSGI) that simplifies a lot how you can write a Python application which goal is to serve HTTP requests. When your code is using that standard, your project can be executed by standard web servers like Apache or NGinx, using WSGI extensions like uwsgi or mod_wsgi. Your application just has to deal with incoming requests and send back JSON responses, and Python includes all that goodness in its standard library. You can create a fully functional microservice that returns the server's local time with a vanilla Python module of fewer than ten lines: import JSON import time def application(environ, start_response): headers = [('Content-type', 'application/json')] start_response('200 OK', headers) return bytes(json.dumps({'time': time.time()}), 'utf8') Since its introduction, the WSGI protocol became an essential standard and the Python web community widely adopted it. Developers wrote middlewares, which are functions you can hook before or after the WSGI application function itself, to do something within the environment. Some web frameworks were created specifically around that standard, like Bottle (http://bottlepy.org) - and soon enough, every framework out there could be used through WSGI in a way or another. The biggest problem with WSGI though is its synchronous nature. The application function you see above is called exactly once per incoming request, and when the function returns, it has to send back the response. That means that every time you are calling the function, it will block until the response is ready. And writing microservices means your code will be waiting for responses from various network resources all the time. In other words, your application will idle and just block the client until everything is ready. That's an entirely okay behavior for HTTP APIs. We're not talking about building bidirectional applications like web socket based ones. But what happens when you have several incoming requests that are calling your application at the same time? WSGI servers will let you run a pool of threads to serve several requests concurrently. But you can't run thousands of them, and as soon as the pool is exhausted, the next request will be blocking even if your microservice is doing nothing but idling and waiting for backend services responses. That's one of the reasons why non-WSGI frameworks like Twisted, Tornado and in Javascript land Node.js became very successful - it's fully async. When you're coding a Twisted application, you can use callbacks to pause and resume the work done to build a response. That means you can accept new requests and start to treat them. That model dramatically reduces the idling time in your process. It can serve thousands of concurrent requests. Of course, that does not mean the application will return each single response faster. It just means one process can accept more concurrent requests and juggle between them as the data is getting ready to be sent back. There's no simple way with the WSGI standard to introduce something similar, and the community has debated for years to come up with a consensus - and failed. The odds are that the community will eventually drop the WSGI standard for something else. In the meantime, building microservices with synchronous frameworks is still possible and completely fine if your deployments take into account the one request == one thread limitation of the WSGI standard. There's, however, one trick to boost synchronous web applications: greenlets. Greenlet & Gevent The general principle of asynchronous programming is that the process deals with several concurrent execution contexts to simulate parallelism. Asynchronous applications are using an event loop that pauses and resumes execution contexts when an event is triggered - only one context is active, and they take turns. Explicit instruction in the code will tell the event loop that this is where it can pause the execution. When that occurs, the process will look for some other pending work to resume. Eventually, the process will come back to your function and continue it where it stopped - moving from an execution context to another is called switching. The Greenlet project (https://github.com/python-greenlet/greenlet) is a package based on the Stackless project, a particular CPython implementation, and provides greenlets. Greenlets are pseudo-threads that are very cheap to instantiate, unlike real threads, and that can be used to call python functions. Within those functions, you can switch and give back the control to another function. The switching is done with an event loop and allows you to write an asynchronous application using a Thread-like interface paradigm. Here's an example from the Greenlet documentation def test1(x, y): z = gr2.switch(x+y) print z def test2(u): print u gr1.switch(42) gr1 = greenlet(test1) gr2 = greenlet(test2) gr1.switch("hello", " world") The two greenlets are explicitly switching from one to the other. For building microservices based on the WSGI standard, if the underlying code was using greenlets we could accept several concurrent requests and just switch from one to another when we know a call is going to block the request - like performing a SQL query. Although, switching from one greenlet to another has to be done explicitly, and the resulting code can quickly become messy and hard to understand. That's where Gevent can become very useful. The Gevent project (http://www.gevent.org/) is built on the top of Greenlet and offers among other things an implicit and automatic way of switching between greenlets. It provides a cooperative version of the socket module that will use greenlets to automatically pause and resume the execution when some data is made available in the socket. There's even a monkey patch feature that will automatically replace the standard lib socket with Gevent's version. That makes your standard synchronous code magically asynchronous every time it uses sockets - with just one extra line. from gevent import monkey; monkey.patch_all() def application(environ, start_response): headers = [('Content-type', 'application/json')] start_response('200 OK', headers) # ...do something with sockets here... return result This implicit magic comes with a price, though. For Gevent to work well, all the underlying code needs to be compatible with the patching Gevent is doing. Some packages from the community will continue to block or even have unexpected results because of this. In particular, if they use C extensions and bypass some of the features of the standard library Gevent patched. But for most cases, it works well. Projects that are playing well with Gevent are dubbed "green," and when a library is not functioning well, and the community asks its authors to "make it green," it usually happens. That's what was used to scale the Firefox Sync service at Mozilla for instance. Twisted and Tornado If you are building microservices where increasing the number of concurrent requests you can hold is important, it's tempting to drop the WSGI standard and just use an asynchronous framework like Tornado (http://www.tornadoweb.org/) or Twisted (https://twistedmatrix.com/trac/). Twisted has been around for ages. To implement the same microservices you need to write a slightly more verbose code: import time from twisted.web import server, resource from twisted.internet import reactor, endpoints class Simple(resource.Resource): isLeaf = True def render_GET(self, request): request.responseHeaders.addRawHeader(b"content-type", b"application/json") return bytes(json.dumps({'time': time.time()}), 'utf8') site = server.Site(Simple()) endpoint = endpoints.TCP4ServerEndpoint(reactor, 8080) endpoint.listen(site) reactor.run() While Twisted is an extremely robust and efficient framework, it suffers from a few problems when building HTTP microservices: You need to implement each endpoint in your microservice with a class derived from a Resource class, and that implements each supported method. For a few simple APIs, it adds a lot of boilerplate code. Twisted code can be hard to understand & debug due to its asynchronous nature. It's easy to fall into callback hell when you're chaining too many functions that are getting triggered successively one after the other - and the code can get messy Properly testing your Twisted application is hard, and you have to use Twisted-specific unit testing model. Tornado is based on a similar model but is doing a better job in some areas. It has a lighter routing system and does everything possible to make the code closer to plain Python. Tornado is also using a callback model, so debugging can be hard. But both frameworks are working hard at bridging the gap to rely on the new async features introduced in Python 3. asyncio When Guido van Rossum started to work on adding async features in Python 3, part of the community pushed for a Gevent-like solution because it made a lot of sense to write applications in a synchronous, sequential fashion - rather than having to add explicit callbacks like in Tornado or Twisted. But Guido picked the explicit technique and experimented in a project called Tulip that Twisted inspired. Eventually, asyncio was born out of that side project and added into Python. In hindsight, implementing an explicit event loop mechanism in Python instead of going the Gevent way makes a lot of sense. The way the Python core developers coded asyncio and how they elegantly extended the language with the async and await keywords to implement coroutines, made asynchronous applications built with vanilla Python 3.5+ code look very elegant and close to synchronous programming. By doing this, Python did a great job at avoiding the callback syntax mess we sometimes see in Node.js or Twisted (Python 2) applications. And beyond coroutines, Python 3 has introduced a full set of features and helpers in the asyncio package to build asynchronous applications, see https://docs.python.org/3/library/asyncio.html. Python is now as expressive as languages like Lua to create coroutine-based applications, and there are now a few emerging frameworks that have embraced those features and will only work with Python 3.5+ to benefit from this. KeepSafe's aiohttp (http://aiohttp.readthedocs.io) is one of them, and building the same microservice, fully asynchronous, with it would simply be these few elegant lines. from aiohttp import web import time async def handle(request): return web.json_response({'time': time.time()}) if __name__ == '__main__': app = web.Application() app.router.add_get('/', handle) web.run_app(app) In this small example, we're very close to how we would implement a synchronous app. The only hint we're async is the async keyword marking the handle function as being a coroutine. And that's what's going to be used at every level of an async Python app going forward. Here's another example using aiopg - a Postgresql lib for asyncio. From the project documentation: import asyncio import aiopg dsn = 'dbname=aiopg user=aiopg password=passwd host=127.0.0.1' async def go(): pool = await aiopg.create_pool(dsn) async with pool.acquire() as conn: async with conn.cursor() as cur: await cur.execute("SELECT 1") ret = [] async for row in cur: ret.append(row) assert ret == [(1,)] loop = asyncio.get_event_loop() loop.run_until_complete(go()) With a few async and await prefixes, the function that's performing a SQL query and send back the result looks a lot like a synchronous function. But asynchronous frameworks and libraries based on Python 3 are still emerging, and if you are using asyncio or a framework like aiohttp, you will need to stick with particular asynchronous implementations for each feature you need. If you require using a library that is not asynchronous in your code, using it from your asynchronous code means you will need to go through some extra and challenging work if you want to prevent blocking the event loop. If your microservices are dealing with a limited number of resources, it could be manageable. But it's probably a safer bet at this point (2017) to stick with a synchronous framework that's been around for a while rather than an asynchronous one. Let's enjoy the existing ecosystem of mature packages, and wait until the asyncio ecosystem gets more sophisticated. And there are many great synchronous frameworks to build microservices with Python, like Bottle, Pyramid with Cornice or Flask. Language performances In the previous sections we've been through the two different ways to write microservices - asynchronous vs. synchronous, and whatever technique you are using, the speed of Python is directly impacting the performance of your microservice. Of course, everyone knows Python is slower than Java or Go - but execution speed is not always the top priority. A microservice is often a thin layer of code that is sitting most of its life waiting for some network responses from other services. Its core speed is usually less important than how fast your SQL queries will take to return from your Postgres server because the latter will represent most of the time spent to build the response. But wanting an application that's as fast as possible is legitimate. One controversial topic in the Python community around speeding up the language is how the Global Interpreter Lock (GIL) mutex can ruin performances because multi-threaded applications cannot use several processes. The GIL has good reasons to exist. It protects non thread-safe parts of the CPython interpreter and exists in other languages like Ruby. And all attempts to remove it so far have failed to produce a faster CPython implementation. Larry Hasting is working on a GIL-free CPython project called Gilectomy - https://github.com/larryhastings/gilectomy - its minimal goal is to come up with a GIL-free implementation that can run a single-threaded application as fast as CPython. As of today (2017), this implementation is still slower that CPython. But it's interesting to follow this work and see if it reaches speed parity one day. That would make a GIL-free CPython very appealing. For microservices, besides preventing the usage of multiple cores in the same process, the GIL will slightly degrade performances on high load, because of the system calls overhead introduced by the mutex. Although, all the scrutiny around the GIL had one beneficial impact: some work has been done in the past years to reduce its contention in the interpreter, and in some area, Python performances have improved a lot. But bear in mind that even if the core team removes the GIL, Python is an interpreted language and the produced code will never be very efficient at execution time. Python provides the dis module if you are interested to see how the interpreter decomposes a function. In the example below, the interpreter will decompose a simple function that yields incremented values from a sequence in no less than 29 steps! >>> def myfunc(data): ... for value in data: ... yield value + 1 ... >>> import dis >>> dis.dis(myfunc) 2 0 SETUP_LOOP 23 (to 26) 3 LOAD_FAST 0 (data) 6 GET_ITER >> 7 FOR_ITER 15 (to 25) 10 STORE_FAST 1 (value) 3 13 LOAD_FAST 1 (value) 16 LOAD_CONST 1 (1) 19 BINARY_ADD 20 YIELD_VALUE 21 POP_TOP 22 JUMP_ABSOLUTE 7 >> 25 POP_BLOCK >> 26 LOAD_CONST 0 (None) 29 RETURN_VALUE A similar function written in a statically compiled language will dramatically reduce the number of operations required to produce the same result. There are ways to speed up Python execution, though. One is to write part of your code into compiled code by building C extensions or using a static extension of the language like Cython (http://cython.org/) - but that makes your code more complicated. Another solution, which is the most promising one, is by simply running your application using the PyPy interpreter (http://pypy.org/). PyPy implements a Just-In-Time compiler (JIT). This compiler is directly replacing at run time pieces of Python with machine code that can be directly used by the CPU. The whole trick for the JIT is to detect in real time, ahead of the execution, when and how to do it. Even if PyPy is always a few Python versions behind CPython, it reached a point where you can use it in production, and its performances can be quite amazing. In one of our projects at Mozilla that needs fast execution, the PyPy version was almost as fast as the Go version, and we've decided to use Python there instead. The Pypy Speed Center website is a great place to look at how PyPy compares to CPython - http://speed.pypy.org/ However, if your program uses C extensions, you will need to recompile them for PyPy, and that can be a problem. In particular, if other developers maintain some of the extensions you are using. But if you are building your microservice with a standard set of libraries, the chances are that will it work out of the box with the PyPy interpreter, so that's worth a try. In any case, for most projects, the benefits of Python and its ecosystem largely surpasses the performances issues described in this section because the overhead in a microservice is rarely a problem. Summary In this article we saw that Python is considered to be one of the best languages to write web applications, and therefore microservices - for the same reasons, it's a language of choice in other areas and also because it provides tons of mature frameworks and packages to do the work. Resources for Article: Further resources on this subject: Inbuilt Data Types in Python [article] Getting Started with Python Packages [article] Layout Management for Python GUI [article]
Read more
  • 0
  • 0
  • 42735
Modal Close icon
Modal Close icon