Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7013 Articles
article-image-getting-started-with-q-learning-using-tensorflow
Savia Lobo
14 Mar 2018
9 min read
Save for later

Getting started with Q-learning using TensorFlow

Savia Lobo
14 Mar 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. This book will help you master advanced concepts of deep learning such as transfer learning, reinforcement learning, generative models and more, using TensorFlow and Keras.[/box] In this tutorial, we will learn about Q-learning and how to implement it using deep reinforcement learning. Q-Learning is a model-free method of finding the optimal policy that can maximize the reward of an agent. During initial gameplay, the agent learns a Q value for each pair of (state, action), also known as the exploration strategy. Once the Q values are learned, then the optimal policy will be to select an action with the largest Q-value in every state, also known as the exploitation strategy. The learning algorithm may end in locally optimal solutions, hence we keep using the exploration policy by setting an exploration_rate parameter. The Q-Learning algorithm is as follows: initialize  Q(shape=[#s,#a])  to  random  values  or  zeroes Repeat  (for  each  episode) observe  current  state  s Repeat select  an  action  a  (apply  explore  or  exploit  strategy) observe  state  s_next  as  a  result  of  action  a update  the  Q-Table  using  bellman's  equation set  current  state  s  =  s_next until  the  episode  ends  or  a  max  reward  /  max  steps  condition  is  reached Until  a  number  of  episodes  or  a  condition  is  reached (such  as  max  consecutive  wins) Q(s, a) in the preceding algorithm represents the Q function. The values of this function are used for selecting the action instead of the rewards, thus this function represents the reward or discounted rewards. The values for the Q-function are updated using the values of the Q function in the future state. The well- known bellman equation captures this update: This basically means that at time step t, in state s, for action a, the maximum future reward (Q) is equal to the reward from the current state plus the max future reward from the next state. Q(s,a) can be implemented as a Q-Table or as a neural network known as a Q-Network. In both cases, the task of the Q-Table or the Q-Network is to provide the best possible action based on the Q value of the given input. The Q-Table-based approach generally becomes intractable as the Q-Table becomes large, thus making neural networks the best candidate for approximating the Q-function through Q-Network. Let us look at both of these approaches in action. Initializing and discretizing for Q-Learning The observations returned by the pole-cart environment involves the state of the environment. The state of pole-cart is represented by continuous values that we need to discretize. If we discretize these values into small state-space, then the agent gets trained faster, but with the caveat of risking the convergence to the optimal policy. We use the following helper function to discretize the state-space of the pole-cart environment: #  discretize  the  value  to  a  state  space def  discretize(val,bounds,n_states): discrete_val  =  0 if  val  <=  bounds[0]: discrete_val  =  0 elif  val  >=  bounds[1]: discrete_val  =  n_states-1 else:                       discrete_val  =  int(round(  (n_states-1)  * ((val-bounds[0])/                                                                                         (bounds[1]-bounds[0])) return  discrete_val def  discretize_state(vals,s_bounds,n_s): discrete_vals  =  [] for  i  in  range(len(n_s)): discrete_vals.append(discretize(vals[i],s_bounds[i],n_s[i])) return  np.array(discrete_vals,dtype=np.int) We discretize the space into 10 units for each of the observation dimensions. You may want to try out different discretization spaces. After the discretization, we find the upper and lower bounds of the observations, and change the bounds of velocity and angular velocity to be between -1 and +1, instead of -Inf and +Inf. The code is as follows: env  =  gym.make('CartPole-v0') n_a  =  env.action_space.n #  number  of  discrete  states  for  each  observation  dimension n_s  =  np.array([10,10,10,10])                  #  position,  velocity,  angle,  angular velocity s_bounds  =  np.array(list(zip(env.observation_space.low, env.observation_space.high))) #  the  velocity  and  angular  velocity  bounds  are #  too  high  so  we  bound  between  -1,  +1 s_bounds[1]  =  (-1.0,1.0) s_bounds[3]  =  (-1.0,1.0) Q-Learning with Q-Table Since our discretised space is of the dimensions [10,10,10,10], our Q-Table is of [10,10,10,10,2] dimensions: #  create  a  Q-Table  of  shape  (10,10,10,10,  2)  representing  S  X  A  ->  R q_table  =  np.zeros(shape  =  np.append(n_s,n_a)) We define a Q-Table policy that exploits or explores based on the exploration_rate: def  policy_q_table(state,  env): #  Exploration  strategy  -  Select  a  random  action if  np.random.random()  <  explore_rate: action  =  env.action_space.sample() #  Exploitation  strategy  -  Select  the  action  with  the  highest  q else: action  =  np.argmax(q_table[tuple(state)]) return  action Define the episode() function that runs a single episode as follows:  Start with initializing the variables and the first state: obs  =  env.reset() state_prev  =  discretize_state(obs,s_bounds,n_s) episode_reward  =  0 done  =  False t  =  0  Select the action and observe the next state: action  =  policy(state_prev,  env) obs,  reward,  done,  info  =  env.step(action) state_new  =  discretize_state(obs,s_bounds,n_s)  Update the Q-Table: best_q  =  np.amax(q_table[tuple(state_new)]) bellman_q  =  reward  +  discount_rate  *  best_q indices  =  tuple(np.append(state_prev,action)) q_table[indices]  +=  learning_rate*(  bellman_q  -  q_table[indices])  Set the next state as the previous state and add the rewards to the episode's rewards: state_prev  =  state_new episode_reward  +=  reward The experiment() function calls the episode function and accumulates the rewards for reporting. You may want to modify the function to check for consecutive wins and other logic specific to your play or games: #  collect  observations  and  rewards  for  each  episode def  experiment(env,  policy,  n_episodes,r_max=0,  t_max=0): rewards=np.empty(shape=[n_episodes]) for  i  in  range(n_episodes): val  =  episode(env,  policy,  r_max,  t_max) rewards[i]=val print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) Now, all we have to do is define the parameters, such as learning_rate, discount_rate, and explore_rate, and run the experiment() function as follows: learning_rate  =  0.8 discount_rate  =  0.9 explore_rate  =  0.2 n_episodes  =  1000 experiment(env,  policy_q_table,  n_episodes) For 1000 episodes, the Q-Table-based policy's maximum reward is 180 based on our simple implementation: Policy:policy_q_table,  Min  reward:8.0,  Max  reward:180.0,  Average reward:17.592 Our implementation of the algorithm is very simple to explain. However, you can modify the code to set the explore rate high initially and then decay as the time-steps pass. Similarly, you can also implement the decay logic for the learning and discount rates. Let us see if we can get a higher reward with fewer episodes as our Q function learns faster. Q-Learning with Q-Network  or Deep Q Network (DQN)  In the DQN, we replace the Q-Table with a neural network (Q-Network) that will learn to respond with the optimal action as we train it continuously with the explored states and their Q-Values. Thus, for training the network we need a place to store the game memory:  Implement the game memory using a deque of size 1000: memory  =  deque(maxlen=1000)  Next, build a simple hidden layer neural network model, q_nn: from  keras.models  import  Sequential from  keras.layers  import  Dense model  =  Sequential() model.add(Dense(8,input_dim=4,  activation='relu')) model.add(Dense(2,  activation='linear')) model.compile(loss='mse',optimizer='adam') model.summary() q_nn  =  model The Q-Network looks like this: Layer  (type)                                           Output  Shape                                 Param  # ================================================================= dense_1  (Dense)                                 (None,  8)                                         40 dense_2  (Dense)                                 (None,  2)                                         18 ================================================================= Total  params:  58 Trainable  params:  58 Non-trainable  params:  0 The episode() function that executes one episode of the game, incorporates the following changes for the Q-Network-based algorithm:  After generating the next state, add the states, action, and rewards to the game memory: action  =  policy(state_prev,  env) obs,  reward,  done,  info  =  env.step(action) state_next  =  discretize_state(obs,s_bounds,n_s) #  add  the  state_prev,  action,  reward,  state_new,  done  to  memory memory.append([state_prev,action,reward,state_next,done])   Generate and update the q_values with the maximum future rewards using the bellman function: states  =  np.array([x[0]  for  x  in  memory]) states_next  =  np.array([np.zeros(4)  if  x[4]  else  x[3]  for  x  in memory]) q_values  =  q_nn.predict(states) q_values_next  =  q_nn.predict(states_next) for  i  in  range(len(memory)): state_prev,action,reward,state_next,done  =  memory[i] if  done: q_values[i,action]  =  reward else: best_q  =  np.amax(q_values_next[i]) bellman_q  =  reward  +  discount_rate  *  best_q q_values[i,action]  =  bellman_q  Train the q_nn with the states and the q_values we received from memory: q_nn.fit(states,q_values,epochs=1,batch_size=50,verbose=0) The process of saving gameplay in memory and using it to train the model is also known as memory replay in deep reinforcement learning literature. Let us run our DQN-based gameplay as follows: learning_rate  =  0.8 discount_rate  =  0.9 explore_rate  =  0.2 n_episodes  =  100 experiment(env,  policy_q_nn,  n_episodes) We get a max reward of 150 that you can improve upon with hyper-parameter tuning, network tuning, and by using rate decay for the discount rate and explore rate: Policy:policy_q_nn,  Min  reward:8.0,  Max  reward:150.0,  Average  reward:41.27 To summarize, we calculated and trained the model at every step. One can change the code to discard the memory replay and retrain the model for the episodes that return smaller rewards. However, implement this option with caution as it may slow down your learning as initial gameplay would generate smaller rewards more often. Do check out the book Mastering TensorFlow 1.x  to explore advanced features of TensorFlow 1.x and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet.
Read more
  • 0
  • 0
  • 38886

article-image-feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique
Pravin Dhandre
13 Mar 2018
9 min read
Save for later

Feature Improvement: Identifying missing values using EDA (Exploratory Data Analysis) technique

Pravin Dhandre
13 Mar 2018
9 min read
Today, we will work towards developing a better sense of data through identifying missing values in a dataset using Exploratory Data Analysis (EDA) technique and python packages. Identifying missing values in data Our first method of identifying missing values is to give us a better understanding of how to work with real-world data. Often, data can have missing values due to a variety of reasons, for example with survey data, some observations may not have been recorded. It is important for us to analyze our data, and get a sense of what the missing values are so we can decide how we want to handle missing values for our machine learning. To start, let's dive into a dataset the Pima Indian Diabetes Prediction dataset. This dataset is available on the UCI Machine Learning Repository at: https:/​/​archive.​ics.​uci.​edu/​ml/​datasets/​pima+indians+diabetes From the main website, we can learn a few things about this publicly available dataset. We have nine columns and 768 instances (rows). The dataset is primarily used for predicting the onset of diabetes within five years in females of Pima Indian heritage over the age of 21 given medical details about their bodies. The dataset is meant to correspond with a binary (2-class) classification machine learning problem. Namely, the answer to the question, will this person develop diabetes within five years? The column names are provided as follows (in order): Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skinfold thickness (mm) 2-Hour serum insulin measurement (mu U/ml) Body mass index (weight in kg/(height in m)2) Diabetes pedigree function Age (years) Class variable (zero or one) The goal of the dataset is to be able to predict the final column of class variable, which predicts if the patient has developed diabetes, using the other eight features as inputs to a machine learning function. There are two very important reasons we will be working with this dataset: We will have to work with missing values All of the features we will be working with will be quantitative The first point makes more sense for now as a reason, because the point of this chapter is to deal with missing values. As far as only choosing to work with quantitative data, this will only be the case for this chapter. We do not have enough tools to deal with missing values in categorical columns. In the next chapter, when we talk about feature construction, we will deal with this procedure. The exploratory data analysis (EDA) To identify our missing values we will begin with an EDA of our dataset. We will be using some useful python packages, pandas and numpy, to store our data and make some simple calculations as well as some popular visualization tools to see what the distribution of our data looks like. Let's begin and dive into some code. First, we will do some imports: # import packages we need for exploratory data analysis (EDA) import pandas as pd # to store tabular data import numpy as np # to do some math import matplotlib.pyplot as plt # a popular data visualization tool import seaborn as sns # another popular data visualization tool %matplotlib inline plt.style.use('fivethirtyeight') # a popular data visualization theme We will import our tabular data through a CSV, as follows: # load in our dataset using pandas pima = pd.read_csv('../data/pima.data') pima.head() The head method allows us to see the first few rows in our dataset. The output is as follows: Something's not right here, there's no column names. The CSV must not have the names for the columns built into the file. No matter, we can use the data source's website to fill this in, as shown in the following code: pima_column_names = ['times_pregnant', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness', 'serum_insulin', 'bmi', 'pedigree_function', 'age', 'onset_diabetes'] pima = pd.read_csv('../data/pima.data', names=pima_column_names) pima.head() Now, using the head method again, we can see our columns with the appropriate headers. The output of the preceding code is as follows: Much better, now we can use the column names to do some basic stats, selecting, and visualizations. Let's first get our null accuracy as follows: pima['onset_diabetes'].value_counts(normalize=True) # get null accuracy, 65% did not develop diabetes 0 0.651042 1 0.348958 Name: onset_diabetes, dtype: float64 If our eventual goal is to exploit patterns in our data in order to predict the onset of diabetes, let us try to visualize some of the differences between those that developed diabetes and those that did not. Our hope is that the histogram will reveal some sort of pattern, or obvious difference in values between the classes of prediction: # get a histogram of the plasma_glucose_concentration column for # both classes col = 'plasma_glucose_concentration' plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='nondiabetes') plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes') plt.legend(loc='upper right') plt.xlabel(col) plt.ylabel('Frequency') plt.title('Histogram of {}'.format(col)) plt.show() The output of the preceding code is as follows: It seems that this histogram is showing us a pretty big difference between plasma_glucose_concentration between the two prediction classes. Let's show the same histogram style for multiple columns as follows: for col in ['bmi', 'diastolic_blood_pressure', 'plasma_glucose_concentration']: plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='non-diabetes') plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes') plt.legend(loc='upper right') plt.xlabel(col) plt.ylabel('Frequency') plt.title('Histogram of {}'.format(col)) plt.show() The output of the preceding code will give us the following three histograms. The first one is show us the distributions of bmi for the two class variables (non-diabetes and diabetes): The next histogram to appear will shows us again contrastingly different distributions between a feature across our two class variables. This time we are looking at diastolic_blood_pressure: The final graph will show plasma_glucose_concentration differences between our two class Variables: We can definitely see some major differences simply by looking at just a few histograms. For example, there seems to be a large jump in plasma_glucose_concentration for those who will eventually develop diabetes. To solidify this, perhaps we can visualize a linear correlation matrix in an attempt to quantify the relationship between these variables. We will use the visualization tool, seaborn, which we imported at the beginning of this chapter for our correlation matrix as follows: # look at the heatmap of the correlation matrix of our dataset sns.heatmap(pima.corr()) # plasma_glucose_concentration definitely seems to be an interesting feature here Following is the correlation matrix of our dataset. This is showing us the correlation amongst the different columns in our Pima dataset. The output is as follows: This correlation matrix is showing a strong correlation between plasma_glucose_concentration and onset_diabetes. Let's take a further look at the numerical correlations for the onset_diabetes column, with the following code: pima.corr()['onset_diabetes'] # numerical correlation matrix # plasma_glucose_concentration definitely seems to be an interesting feature here times_pregnant 0.221898 plasma_glucose_concentration 0.466581 diastolic_blood_pressure 0.065068 triceps_thickness 0.074752 serum_insulin 0.130548 bmi 0.292695 pedigree_function 0.173844 age 0.238356 onset_diabetes 1.000000 Name: onset_diabetes, dtype: float64 We will explore the powers of correlation in a later Chapter 4, Feature Construction, but for now we are using exploratory data analysis (EDA) to hint at the fact that the plasma_glucose_concentration column will be an important factor in our prediction of the onset of diabetes. Moving on to more important matters at hand, let's see if we are missing any values in our dataset by invoking the built-in isnull() method of the pandas DataFrame: pima.isnull().sum() >>>> times_pregnant 0 plasma_glucose_concentration 0 diastolic_blood_pressure 0 triceps_thickness 0 serum_insulin 0 bmi 0 pedigree_function 0 age 0 onset_diabetes 0 dtype: int64 Great! We don't have any missing values. Let's go on to do some more EDA, first using the shape method to see the number of rows and columns we are working with: pima.shape . # (# rows, # cols) (768, 9) Confirming we have 9 columns (including our response variable) and 768 data observations (rows). Now, let's take a peak at the percentage of patients who developed diabetes, using the following code: pima['onset_diabetes'].value_counts(normalize=True) # get null accuracy, 65% did not develop diabetes 0 0.651042 1 0.348958 Name: onset_diabetes, dtype: float64 This shows us that 65% of the patients did not develop diabetes, while about 35% did. We can use a nifty built-in method of a pandas DataFrame called describe to look at some basic descriptive statistics: pima.describe() # get some basic descriptive statistics We get the output as follows: This shows us quite quickly some basic stats such as mean, standard deviation, and some different percentile measurements of our data. But, notice that the minimum value of the BMI column is 0. That is medically impossible; there must be a reason for this to happen. Perhaps the number zero has been encoded as a missing value instead of the None value or a missing cell. Upon closer inspection, we see that the value 0 appears as a minimum value for the following columns: times_pregnant plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin bmi onset_diabetes Because zero is a class for onset_diabetes and 0 is actually a viable number for times_pregnant, we may conclude that the number 0 is encoding missing values for: plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin bmi So, we actually do having missing values! It was obviously not luck that we happened upon the zeros as missing values, we knew it beforehand. As a data scientist, you must be ever vigilant and make sure that you know as much about the dataset as possible in order to find missing values encoded as other symbols. Be sure to read any and all documentation that comes with open datasets in case they mention any missing values. If no documentation is available, some common values used instead of missing values are: 0 (for numerical values) unknown or Unknown (for categorical variables) ? (for categorical variables) To summarize, we have five columns where the fields are left with missing values and symbols. [box type="note" align="" class="" width=""]You just read an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla. To learn more about missing values and manipulating features, do check out Feature Engineering Made Easy and develop expert proficiency in Feature Selection, Learning, and Optimization.[/box]    
Read more
  • 0
  • 0
  • 14296

article-image-build-cartpole-game-using-openai-gym
Savia Lobo
10 Mar 2018
11 min read
Save for later

How to build a cartpole game using OpenAI Gym

Savia Lobo
10 Mar 2018
11 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn advanced features of TensorFlow1.x, such as distributed TensorFlow with TF Clusters, deploy production models with TensorFlow Serving, and more. [/box] Today, we will help you understand OpenAI Gym and how to apply the basics of OpenAI Gym onto a cartpole game. OpenAI Gym 101 OpenAI Gym is a Python-based toolkit for the research and development of reinforcement learning algorithms. OpenAI Gym provides more than 700 opensource contributed environments at the time of writing. With OpenAI, you can also create your own environment. The biggest advantage is that OpenAI provides a unified interface for working with these environments, and takes care of running the simulation while you focus on the reinforcement learning algorithms. Note : The research paper describing OpenAI Gym is available here: http://arxiv.org/abs/1606.01540 You can install OpenAI Gym using the following command: pip3  install  gym Note: If the above command does not work, then you can find further help with installation at the following link: https://github.com/openai/ gym#installation  Let us print the number of available environments in OpenAI Gym: all_env  =  list(gym.envs.registry.all()) print('Total  Environments  in  Gym  version  {}  :  {}' .format(gym.     version     ,len(all_env))) Total  Environments  in  Gym  version  0.9.4  :  777  Let us print the list of all environments: for  e  in  list(all_env): print(e) The partial list from the output is as follows: EnvSpec(Carnival-ramNoFrameskip-v0) EnvSpec(EnduroDeterministic-v0) EnvSpec(FrostbiteNoFrameskip-v4) EnvSpec(Taxi-v2) EnvSpec(Pooyan-ram-v0) EnvSpec(Solaris-ram-v4) EnvSpec(Breakout-ramDeterministic-v0) EnvSpec(Kangaroo-ram-v4) EnvSpec(StarGunner-ram-v4) EnvSpec(Enduro-ramNoFrameskip-v4) EnvSpec(DemonAttack-ramDeterministic-v0) EnvSpec(TimePilot-ramNoFrameskip-v0) EnvSpec(Amidar-v4) Each environment, represented by the env object, has a standardized interface, for example: An env object can be created with the env.make(<game-id-string>) function by passing the id string. Each env object contains the following main functions: The step() function takes an action object as an argument and returns four objects: observation: An object implemented by the environment, representing the observation of the environment. reward: A signed float value indicating the gain (or loss) from the previous action. done: A Boolean value representing if the scenario is finished. info: A Python dictionary object representing the diagnostic information. The render() function creates a visual representation of the environment. The reset() function resets the environment to the original state. Each env object comes with well-defined actions and observations, represented by action_space and observation_space. One of the most popular games in the gym to learn reinforcement learning is CartPole. In this game, a pole attached to a cart has to be balanced so that it doesn't fall. The game ends if either the pole tilts by more than 15 degrees or the cart moves by more than 2.4 units from the center. The home page of OpenAI.com emphasizes the game in these words: The small size and simplicity of this environment make it possible to run very quick experiments, which is essential when learning the basics. The game has only four observations and two actions. The actions are to move a cart by applying a force of +1 or -1. The observations are the position of the cart, the velocity of the cart, the angle of the pole, and the rotation rate of the pole. However, knowledge of the semantics of observation is not necessary to learn to maximize the rewards of the game. Now let us load a popular game environment, CartPole-v0, and play it with stochastic control:  Create the env object with the standard make function: env  =  gym.make('CartPole-v0')  The number of episodes is the number of game plays. We shall set it to one, for now, indicating that we just want to play the game once. Since every episode is stochastic, in actual production runs you will run over several episodes and calculate the average values of the rewards. Additionally, we can initialize an array to store the visualization of the environment at every timestep: n_episodes  =  1 env_vis  =  []  Run two nested loops—an external loop for the number of episodes and an internal loop for the number of timesteps you would like to simulate for. You can either keep running the internal loop until the scenario is done or set the number of steps to a higher value. At the beginning of every episode, reset the environment using env.reset(). At the beginning of every timestep, capture the visualization using env.render(). for  i_episode  in  range(n_episodes): observation  =  env.reset() for  t  in  range(100): env_vis.append(env.render(mode  =  'rgb_array')) print(observation) action  =  env.action_space.sample() observation,  reward,  done,  info  =  env.step(action) if  done: print("Episode  finished  at  t{}".format(t+1)) break  Render the environment using the helper function: env_render(env_vis)  The code for the helper function is as follows: def  env_render(env_vis): plt.figure() plot  =  plt.imshow(env_vis[0]) plt.axis('off') def  animate(i): plot.set_data(env_vis[i]) anim  =  anm.FuncAnimation(plt.gcf(), animate, frames=len(env_vis), interval=20, repeat=True, repeat_delay=20) display(display_animation(anim,  default_mode='loop')) We get the following output when we run this example: [-0.00666995  -0.03699492  -0.00972623    0.00287713] [-0.00740985    0.15826516  -0.00966868  -0.29285861] [-0.00424454  -0.03671761  -0.01552586  -0.00324067] [-0.0049789    -0.2316135    -0.01559067    0.28450351] [-0.00961117  -0.42650966  -0.0099006    0.57222875] [-0.01814136  -0.23125029    0.00154398    0.27644332] [-0.02276636  -0.0361504    0.00707284  -0.01575223] [-0.02348937    0.1588694       0.0067578    -0.30619523] [-0.02031198  -0.03634819    0.00063389  -0.01138875] [-0.02103895    0.15876466    0.00040612  -0.3038716  ] [-0.01786366    0.35388083  -0.00567131  -0.59642642] [-0.01078604    0.54908168  -0.01759984  -0.89089036] [    1.95594914e-04   7.44437934e-01    -3.54176495e-02    -1.18905344e+00] [ 0.01508435 0.54979251 -0.05919872 -0.90767902] [ 0.0260802 0.35551978 -0.0773523 -0.63417465] [ 0.0331906 0.55163065 -0.09003579 -0.95018025] [ 0.04422321 0.74784161 -0.1090394 -1.26973934] [ 0.05918004 0.55426764 -0.13443418 -1.01309691] [ 0.0702654 0.36117014 -0.15469612 -0.76546874] [ 0.0774888 0.16847818 -0.1700055 -0.52518186] [ 0.08085836 0.3655333 -0.18050913 -0.86624457] [ 0.08816903 0.56259197 -0.19783403 -1.20981195] Episode  finished  at  t22 It took 22 time-steps for the pole to become unbalanced. At every run, we get a different time-step value because we picked the action scholastically by using env.action_space.sample(). Since the game results in a loss so quickly, randomly picking an action and applying it is probably not the best strategy. There are many algorithms for finding solutions to keeping the pole straight for a longer number of time-steps that you can use, such as Hill Climbing, Random Search, and Policy Gradient. Note: Some of the algorithms for solving the Cartpole game are available at the following links: https://openai.com/requests-for-research/#cartpole http://kvfrans.com/simple-algoritms-for-solving-cartpole/ https://github.com/kvfrans/openai-cartpole Applying simple policies to a cartpole game So far, we have randomly picked an action and applied it. Now let us apply some logic to picking the action instead of random chance. The third observation refers to the angle. If the angle is greater than zero, that means the pole is tilting right, thus we move the cart to the right (1). Otherwise, we move the cart to the left (0). Let us look at an example: We define two policy functions as follows: def  policy_logic(env,obs): return  1  if  obs[2]  >  0  else  0 def  policy_random(env,obs): return  env.action_space.sample() Next, we define an experiment function that will run for a specific number of episodes; each episode runs until the game is lost, namely when done is True. We use rewards_max to indicate when to break out of the loop as we do not wish to run the experiment forever: def  experiment(policy,  n_episodes,  rewards_max): rewards=np.empty(shape=(n_episodes)) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done: action  =  policy(env,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break rewards[i]=episode_reward print('Policy:{},  Min  reward:{},  Max  reward:{}' .format(policy.     name     , min(rewards), max(rewards))) We run the experiment 100 times, or until the rewards are less than or equal to rewards_max, that is set to 10,000: n_episodes  =  100 rewards_max  =  10000 experiment(policy_random,  n_episodes,  rewards_max) experiment(policy_logic,  n_episodes,  rewards_max) We can see that the logically selected actions do better than the randomly selected ones, but not that much better: Policy:policy_random,  Min  reward:9.0,  Max  reward:63.0,  Average  reward:20.26 Policy:policy_logic,  Min  reward:24.0,  Max  reward:66.0,  Average  reward:42.81 Now let us modify the process of selecting the action further—to be based on parameters. The parameters will be multiplied by the observations and the action will be chosen based on whether the multiplication result is zero or one. Let us modify the random search method in which we initialize the parameters randomly. The code looks as follows: def  policy_logic(theta,obs): #  just  ignore  theta return  1  if  obs[2]  >  0  else  0 def  policy_random(theta,obs): return  0  if  np.matmul(theta,obs)  <  0  else  1 def  episode(env,  policy,  rewards_max): obs  =  env.reset() done  =  False episode_reward  =  0 if  policy.   name          in  ['policy_random']: theta  =  np.random.rand(4)  *  2  -  1 else: theta  =  None while  not  done: action  =  policy(theta,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break return  episode_reward def  experiment(policy,  n_episodes,  rewards_max): rewards=np.empty(shape=(n_episodes)) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): rewards[i]=episode(env,policy,rewards_max) #print("Episode  finished  at  t{}".format(reward)) print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes  =  100 rewards_max  =  10000 experiment(policy_random,  n_episodes,  rewards_max) experiment(policy_logic,  n_episodes,  rewards_max) We can see that random search does improve the results: Policy:policy_random,  Min  reward:8.0,  Max  reward:200.0,  Average reward:40.04 Policy:policy_logic,  Min  reward:25.0,  Max  reward:62.0,  Average  reward:43.03 With the random search, we have improved our results to get the max rewards of 200. On average, the rewards for random search are lower because random search tries various bad parameters that bring the overall results down. However, we can select the best parameters from all the runs and then, in production, use the best parameters. Let us modify the code to train the parameters first: def  policy_logic(theta,obs): #  just  ignore  theta return  1  if  obs[2]  >  0  else  0 def  policy_random(theta,obs): return  0  if  np.matmul(theta,obs)  <  0  else  1 def  episode(env,policy,  rewards_max,theta): obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done: action  =  policy(theta,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break return  episode_reward def  train(policy,  n_episodes,  rewards_max): env  =  gym.make('CartPole-v0') theta_best  =  np.empty(shape=[4]) reward_best  =  0 for  i  in  range(n_episodes): if  policy.   name          in  ['policy_random']: theta  =  np.random.rand(4)  *  2  -  1 else: theta  =  None reward_episode=episode(env,policy,rewards_max,  theta) if  reward_episode  >  reward_best: reward_best  =  reward_episode theta_best  =  theta.copy() return  reward_best,theta_best def  experiment(policy,  n_episodes,  rewards_max,  theta=None): rewards=np.empty(shape=[n_episodes]) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): rewards[i]=episode(env,policy,rewards_max,theta) #print("Episode  finished  at  t{}".format(reward)) print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes  =  100 rewards_max  =  10000 reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max) We train for 100 episodes and then use the best parameters to run the experiment for the random search policy: n_episodes  =  100 rewards_max  =  10000 reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max) We find the that the training parameters gives us the best results of 200: trained  theta:  [-0.14779543               0.93269603    0.70896423   0.84632461],  rewards: 200.0 Policy:policy_random,  Min  reward:200.0,  Max  reward:200.0,  Average reward:200.0 Policy:policy_logic,  Min  reward:24.0,  Max  reward:63.0,  Average  reward:41.94 We may optimize the training code to continue training until we reach a maximum reward. To summarize, we learnt the basics of OpenAI Gym and also applied it onto a cartpole game for relevant output.   If you found this post useful, do check out this book Mastering TensorFlow 1.x  to build, scale, and deploy deep neural network models using star libraries in Python.
Read more
  • 0
  • 1
  • 28647

article-image-perform-data-partitioning-postgresql-10
Sugandha Lahoti
09 Mar 2018
11 min read
Save for later

How to perform data partitioning in PostgreSQL 10

Sugandha Lahoti
09 Mar 2018
11 min read
Partitioning refers to splitting; logically it means breaking one large table into smaller physical pieces. PostgreSQL supports basic table partitioning. It can store up to 32 TB of data inside a single table, which are by default 8k blocks. Infact, if we compile PostgreSQL with 32k blocks, we can even put up to 128 TB into a single table. However, large tables like these are not necessarily too convenient and it makes sense to partition tables to enable processing easier, and in some cases, a bit faster. With PostgreSQL 10.0, partitioning data has improved and offers significantly easier handling of partitioning data to the end users. In this article, we will talk about both, the classic way to partition data as well as the new features available on PostgreSQL 10.0 to perform data partitioning. Creating partitions First, we will learn the old method to partition data. Before digging deeper into the advantages of partitioning, I want to show how partitions can be created. The entire thing starts with a parent table: test=# CREATE TABLE t_data (id serial, t date, payload text); CREATE TABLE In this example, the parent table has three columns. The date column will be used for partitioning but more on that a bit later. Now that the parent table is in place, the child tables can be created. This is how it works: test=# CREATE TABLE t_data_2016 () INHERITS (t_data); CREATE TABLE test=# d t_data_2016 Table "public.t_data_2016" Column | Type  | Modifiers ---------+---------+----------------------------------------------------- id   | integer | not null default nextval('t_data_id_seq'::regclass) t    | date | payload | text   | Inherits: t_data The table is called t_data_2016 and inherits from t_data.  () means that no extra columns are added to the child table. As you can see, inheritance means that all columns from the parents are available in the child table. Also note that the id column will inherit the sequence from the parent so that all children can share the very same numbering. Let's create more tables: test=# CREATE TABLE t_data_2015 () INHERITS (t_data); CREATE TABLE test=# CREATE TABLE t_data_2014 () INHERITS (t_data); CREATE TABLE So far, all the tables are identical and just inherit from the parent. However, there is more: child tables can actually have more columns than parents. Adding more fields is simple: test=# CREATE TABLE t_data_2013 (special text) INHERITS (t_data); CREATE TABLE In this case, a special column has been added. It has no impact on the parent, but just enriches the children and makes them capable of holding more data. After creating a handful of tables, a row can be added: test=# INSERT INTO t_data_2015 (t, payload) VALUES ('2015-05-04', 'some data'); INSERT 0 1 The most important thing now is that the parent table can be used to find all the data in the child tables: test=# SELECT * FROM t_data; id |   t   | payload ----+------------+----------- 1 | 2015-05-04 | some data (1 row) Querying the parent allows you to gain access to everything below the parent in a simple and efficient manner. To understand how PostgreSQL does partitioning, it makes sense to take a look at the plan: test=# EXPLAIN SELECT * FROM t_data; QUERY PLAN ----------------------------------------------------------------- Append (cost=0.00..84.10 rows=4411 width=40) -> Seq Scan on t_data (cost=0.00..0.00 rows=1 width=40) -> Seq Scan on t_data_2016 (cost=0.00..22.00 rows=1200 width=40) -> Seq Scan on t_data_2015 (cost=0.00..22.00 rows=1200 width=40) -> Seq Scan on t_data_2014 (cost=0.00..22.00 rows=1200 width=40) -> Seq Scan on t_data_2013 (cost=0.00..18.10 rows=810 width=40) (6 rows) Actually, the process is quite simple. PostgreSQL will simply unify all tables and show us all the content from all the tables inside and below the partition we are looking at. Note that all tables are independent and are just connected logically through the system catalog. Applying table constraints What happens if filters are applied? test=# EXPLAIN SELECT * FROM t_data WHERE t = '2016-01-04'; QUERY PLAN ----------------------------------------------------------------- Append (cost=0.00..95.12 rows=23 width=40) -> Seq Scan on t_data (cost=0.00..0.00 rows=1 width=40) Filter: (t = '2016-01-04'::date) -> Seq Scan on t_data_2016 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date) -> Seq Scan on t_data_2015 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date) -> Seq Scan on t_data_2014 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date) -> Seq Scan on t_data_2013 (cost=0.00..20.12 rows=4 width=40) Filter: (t = '2016-01-04'::date) (11 rows) PostgreSQL will apply the filter to all the partitions in the structure. It does not know that the table name is somehow related to the content of the tables. To the database, names are just names and have nothing to do with what you are looking for. This makes sense, of course, as there is no mathematical justification for doing anything else. The point now is: how can we teach the database that the 2016 table only contains 2016 data, the 2015 table only contains 2015 data, and so on? Table constraints are here to do exactly that. They teach PostgreSQL about the content of those tables and therefore allow the planner to make smarter decisions than before. The feature is called constraint exclusion and helps dramatically to speed up queries in many cases. The following listing shows how table constraints can be created: test=# ALTER TABLE t_data_2013 ADD CHECK (t < '2014-01-01'); ALTER TABLE test=# ALTER TABLE t_data_2014 ADD CHECK (t >= '2014-01-01' AND t < '2015-01-01'); ALTER TABLE test=# ALTER TABLE t_data_2015 ADD CHECK (t >= '2015-01-01' AND t < '2016-01-01'); ALTER TABLE test=# ALTER TABLE t_data_2016 ADD CHECK (t >= '2016-01-01' AND t < '2017-01-01'); ALTER TABLE For each table, a CHECK constraint can be added. PostgreSQL will only create the constraint if all the data in those tables is perfectly correct and if every single row satisfies the constraint. In contrast to MySQL, constraints in PostgreSQL are taken seriously and honored under any circumstances. In PostgreSQL, those constraints can overlap--this is not forbidden and can make sense in some cases. However, it is usually better to have non-overlapping constraints because PostgreSQL has the option to prune more tables. Here is what happens after adding those table constraints: test=# EXPLAIN SELECT * FROM t_data WHERE t = '2016-01-04'; QUERY PLAN ----------------------------------------------------------------- Append (cost=0.00..25.00 rows=7 width=40) -> Seq Scan on t_data (cost=0.00..0.00 rows=1 width=40) Filter: (t = '2016-01-04'::date) -> Seq Scan on t_data_2016 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date) (5 rows) The planner will be able to remove many of the tables from the query and only keep those which potentially contain the data. The query can greatly benefit from a shorter and more efficient plan. In particular, if those tables are really large, removing them can boost speed considerably. Modifying inherited structures Once in a while, data structures have to be modified. The ALTER  TABLE clause is here to do exactly that. The question is: how can partitioned tables be modified? Basically, all you have to do is tackle the parent table and add or remove columns. PostgreSQL will automatically propagate those changes through to the child tables and ensure that changes are made to all the relations, as follows: test=# ALTER TABLE t_data ADD COLUMN x int; ALTER TABLE test=# d t_data_2016 Table "public.t_data_2016" Column |   Type   | Modifiers ---------+---------+----------------------------------------------------- id | integer | not null default t | date | payload |  text | x | integer | Check constraints: nextval('t_data_id_seq'::regclass) "t_data_2016_t_check" CHECK (t >= '2016-01-01'::date AND t < '2017-01-01'::date) Inherits: t_data As you can see, the column is added to the parent and automatically added to the child table here. Note that this works for columns, and so on. Indexes are a totally different story. In an inherited structure, every table has to be indexed separately. If you add an index to the parent table, it will only be present on the parent-it won't be deployed on those child tables. Indexing all those columns in all those tables is your task and PostgreSQL is not going to make those decisions for you. Of course, this can be seen as a feature or as a limitation. On the upside, you could say that PostgreSQL gives you all the flexibility to index things separately and therefore potentially more efficiently. However, people might also argue that deploying all those indexes one by one is a lot more work. Moving tables in and out of partitioned structures Suppose you have an inherited structure. Data is partitioned by date and you want to provide the most recent years to the end user. At some point, you might want to remove some data from the scope of the user without actually touching it. You might want to put data into some sort of archive or something. PostgreSQL provides a simple means to achieve exactly that. First, a new parent can be created: test=# CREATE TABLE t_history (LIKE t_data); CREATE TABLE The LIKE keyword allows you to create a table which has exactly the same layout as the t_data table. If you have forgotten which columns the t_data table actually has, this might come in handy as it saves you a lot of work. It is also possible to include indexes, constraints, and defaults. Then, the table can be moved away from the old parent table and put below the new one. Here is how it works: test=# ALTER TABLE t_data_2013 NO INHERIT t_data; ALTER TABLE test=# ALTER TABLE t_data_2013 INHERIT t_history; ALTER TABLE The entire process can of course be done in a single transaction to assure that the operation stays atomic. Cleaning up data One advantage of partitioned tables is the ability to clean data up quickly. Suppose that we want to delete an entire year. If data is partitioned accordingly, a simple DROP  TABLE clause can do the job: test=# DROP TABLE t_data_2014; DROP TABLE As you can see, dropping a child table is easy. But what about the parent table? There are depending objects and therefore PostgreSQL naturally errors out to make sure that nothing unexpected happens: test=# DROP TABLE t_data; ERROR: cannot drop table t_data because other objects depend on it DETAIL: default for table t_data_2013 column id depends on sequence t_data_id_seq table t_data_2016 depends on table t_data table t_data_2015 depends on table t_data HINT: Use DROP ... CASCADE to drop the dependent objects too. The DROP  TABLE clause will warn us that there are depending objects and refuses to drop those tables. The CASCADE clause is needed to force PostgreSQL to actually remove those objects, along with the parent table: test=# DROP TABLE t_data CASCADE; NOTICE:   drop   cascades to 3 other objects DETAIL:   drop   cascades to default for table    t_data_2013 column id drop cascades to table      t_data_2016 drop   cascades to table t_data_2015 DROP TABLE Understanding PostgreSQL 10.0 partitioning For many years, the PostgreSQL community has been working on built-in partitioning. Finally, PostgreSQL 10.0 offers the first implementation of in-core partitioning, which will be covered in this chapter. For now, the partitioning functionality is still pretty basic. However, a lot of infrastructure for future improvements is already in place. To show you how partitioning works, I have compiled a simple example featuring range partitioning: CREATE TABLE data ( payload   integer )  PARTITION BY RANGE (payload); CREATE TABLE negatives PARTITION OF data FOR VALUES FROM (MINVALUE) TO (0); CREATE TABLE positives PARTITION OF data FOR VALUES FROM (0) TO (MAXVALUE); In this example, one partition will hold all negative values while the other one will take care of positive values. While creating the parent table, you can simply specify which way you want to partition data. In PostgreSQL 10.0, there is range partitioning and list partitioning. Support for hash partitioning and the like might be available as soon as PostgreSQL 11.0. Once the parent table has been created, it is already time to create the partitions. To do that, the PARTITION  OF clause has been added. At this point, there are still some limitations. The most important one is that a tuple (= a row) cannot move from one partition to the other, for example: UPDATE data SET payload = -10 WHERE id = 5 If there were rows satisfying this condition, PostgreSQL would simply error out and refuse to change the value. However, in case of a good design, it is a bad idea to change the partitioning key anyway. Also, keep in mind that you have to think about indexing each partition. We learnt both, the old way of data partitioning and new data partitioning features introduced in PostgreSQL 10.0. [box type="note" align="" class="" width=""]You read an excerpt from the book Mastering PostgreSQL 10, written by Hans-Jürgen Schönig.  To know about, query optimization, stored procedures and other techniques in PostgreSQL 10.0, you may check out this book Mastering PostgreSQL 10..[/box]
Read more
  • 0
  • 0
  • 21791

article-image-learning-the-salesforce-analytics-query-language-saql
Amey Varangaonkar
09 Mar 2018
6 min read
Save for later

Learning the Salesforce Analytics Query Language (SAQL)

Amey Varangaonkar
09 Mar 2018
6 min read
Salesforce Einstein offers its own query language to retrieve your data from various sources, called Salesforce Analytics Query Language (SAQL). The lenses and dashboards in Einstein use SAQL behind the scenes to manipulate data for meaningful visualizations. In this article, we see how to use Salesforce Analytics Query Language effectively. Using SAQL There are the following three ways to use SAQL in Einstein Analytics: Creating steps/lenses: We can use SAQL while creating a lens or step. It is the easiest way of using SAQL. While creating a step, Einstein Analytics provides the flexibility of switching between modes such as Chart Mode, Table Mode, and SAQL Mode. In this chapter, we will use this method for SAQL. Analytics REST API: Using this API, the user can access the datasets, lenses, dashboards, and so on. This is a programmatic approach and you can send the queries to the Einstein Analytics platform. Einstein Analytics uses the OAuth 2.0 protocol to securely access the platform data. The OAuth protocol is a way of securely authenticating the user without asking them for credentials. The first step to using the Analytics REST API to access Analytics is to authenticate the user using OAuth 2.0. Using Dashboard JSON: We can use SAQL while editing the Dashboard JSON. We have already seen the Dashboard JSON in previous chapters. To access Dashboard JSON, you can open the dashboard in the edit mode and press Ctrl + E. The simplest way of using SAQL is while creating a step or lens. A user can switch between the modes here. To use SAQL for lens, perform the following steps: Navigate to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner, as shown in the following screenshot: In SAQL, the query is made up of multiple statements. In the first statement, the query loads the input data from the dataset, operates on it, and then finally gives the result. The user can use the Run Query button to see the results and errors after changing or adding statements. The user can see the errors at the bottom of the Query editor. SAQL is made up of statements that take the input dataset, and we build our logic on that. We can add filters, groups, orders, and so on, to this dataset to get the desired output. There are certain order rules that need to be followed while creating these statements and those rules are as follows: There can be only one offset in the foreach statement The limit statement must be after offset The offset statement must be after filter and order The order and filter statements can be swapped as there is no rule for them In SAQL, we can perform all the mathematical calculations and comparisons. SAQL also supports arithmetic operators, comparison operators, string operators, and logical operators. Using foreach in SAQL The foreach statement applies the set of expressions to every row, which is called projection. The foreach statement is mandatory to get the output of the query. The following is the syntax for the foreach statement: q = foreach q generate expression as 'expresion name'; Let's look at one example of using the foreach statement: Go to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner. In the Query editor you will see the following code: q = load "opportunity"; q = group q by all; q = foreach q generate count() as 'count'; q = limit q 2000; You can see the result of this query just below the Query editor: 4. Now replace the third statement with the following statement: q = foreach q generate sum('Amount') as 'Sum Amount'; 5. Click on the Run Query button and observe the result as shown in the following screenshot: Using grouping in SAQL The user can group records of the same value in one group by using the group statements. Use the following syntax: q = group rows by fieldName Let's see how to use grouping in SAQL by performing the following steps: Replace the second and third statement with the following statement: q = group q by 'StageName'; q = foreach q generate 'StageName' as 'StageName', sum('Amount') as 'Sum Amount'; 2. Click on the Run Query button and you should see the following result: Using filters in SAQL Filters in SAQL behave just like a where clause in SOQL and SQL, filtering the data as per the condition or clause. In Einstein Analytics, it selects the row from the dataset that satisfies the condition added. The syntax for the filter is as follows: q = filter q by fieldName 'Operator' value Click on Run Query and view the result as shown in the following screenshot: Using functions in SAQL The beauty of a function is in its reusability. Once the function is created it can be used multiple times. In SAQL, we can use different types of functions, such as string  functions, math functions, aggregate functions, windowing functions, and so on. These functions are predefined and saved quite a few times. Let's use a math function power. The syntax for the power is power(m, n). The function returns the value of m raised to the nth power. Replace the following statement with the fourth statement: q = foreach q generate 'StageName' as 'StageName', power(sum('Amount'), 1/2) as 'Amount Squareroot', sum('Amount') as 'Sum Amount'; Click on the Run Query button. We saw how to apply different kinds of case-specific functions in Salesforce Einstein to play with data in order to get the desired outcome. [box type="note" align="" class="" width=""]The above excerpt is taken from the book Learning Einstein Analytics, written by Santosh Chitalkar. It covers techniques to set-up and create apps, lenses, and dashboards using Salesforce Einstein Analytics for effective business insights. If you want to know more about these techniques, check out the book Learning Einstein Analytics.[/box]     
Read more
  • 0
  • 1
  • 30864

article-image-spam-filtering-natural-language-processing-approach
Packt
08 Mar 2018
16 min read
Save for later

Spam Filtering - Natural Language Processing Approach

Packt
08 Mar 2018
16 min read
In this article, by Jalaj Thanaki, the author of the book Python Natural Language Processing discusses how to develop natural language processing (NLP) application. In this article, we will be developing a spam filtering. In order to develop spam filtering we will be using supervised machine learning (ML) algorithm named logisticregression. You can also use decision tree, NaiveBayes,or support vector machine (SVM).Tomake this happen the following steps will be covered: Understandlogistic regression with MLalgorithm Data collection and exploration Split dataset into training-dataset and testing-dataset (For more resources related to this topic, see here.) Understanding logistic regression ML algorithm Let's understand logistic regression algorithm first.For this classification algorithm, I will give you intuition how logistic regression algorithm works and we will see some basic mathematics related to it. Then we will see the spam filtering application. First we are considering the binary classes like spam or not-spam, good or bad, win or lose, 0 or 1, and so on for understanding the algorithm and its application. Suppose I want to classify emails into spam and non-spam (ham)category so the spam and non-spam are discrete output label or target concept here. Our goal here is that we want to predict that whether the new email is spam or not-spam. Not-spam also known asham. In order to build this NLP application we are going to use logistic regression. Let's step back a while and understand the technicality of algorithm first. Here I'm stating the facts related to mathematics and this algorithm in very simple manner so everyone can understand the logic. General approach for understanding this algorithm is as follows. If you know some part of ML then you can connect your dot and if you are new to ML then don't worry because we are going to understand every part which I will describe as follows: We are defining our hypothesis function which helps us to generate our target output or target concept We are defining the cost function or error function and we choose error function in such a way that we can derive the partial derivate of error function easily so we can calculate gradient descent easily Over the time we are trying to minimize the error so we can generate the more accurate label and classify data accurately In statistics, logistic regression is also called as logitregression or logitmodel. This algorithm is mostly used as binary class classifier that means there should be two different class in which you want to classify the data. The binary logistic model is used to estimate the probability of a binary response and it generates the response based on one or more predictor or independent variables or features. By the way the ML algorithm that basic mathematics concepts used in deep learning (DL) as well. First I want to explain that why this algorithm called logistic regression? The reason is that the algorithm uses logistic function or sigmoid function and that is the reason it called logistic regression. Logistic function or sigmoid function are the synonyms of each other. We use sigmoid function as hypothesis function and this function belongs to the hypothesis class. Now if you want to say thatwhat do you mean by the hypothesis function? well as we have seen earlier that machine has to learn mapping between data attributes and given label in such a way so it can predict the label for new data. This can be achieved by machine if it learns this mapping using mathematical function. So the mathematical function is called hypothesis function,which machine will use to classify the data and predict the labels or target concept. Here, as I said, we want to build binary classifier so our label is either spam or ham. So mathematically I can assign 0 for ham or not-spam and 1 for spam or viceversa as per your choice. These mathematically assigned labels are our dependent variables. Now we need that our output labels should be either zero or one. Mathematically,we can say that label is y and y ∈ {0, 1}. So we need to choose that kind of hypothesis function which convert our output value either in zero or one and logistic function or sigmoid function is exactly doing that and this is the main reason why logistic regression uses sigmoid function as hypothesis function. Logistic or Sigmoid Function Let me provide you the mathematical equation for logistic or sigmoid function. Refer to Figure 1: Figure 1: Logistic or sigmoid function You can see the plot which is showing g(z). Here, g(z)= Φ(z). Refer to Figure 2: Figure 2: Graph of sigmoid or logistic function From the preceding graph you can see following facts:  If you have z value greater than or equal to zero then logistic function gives the output value one.  If you have value of z less than zero then logistic function or sigmoid function generate the output zero. You can see the following mathematical condition for logistic function. Refer to Figure 3:   Figure 3: Logistic function mathematical property Because of the preceding mathematical property, we can use this function to perform binary classification. Now it's time to show the hypothesis function how this sigmoid function will be represented as hypothesis function. Refer to Figure 4: Figure 4: Hypothesis function for logistic regression If we take the preceding equation and substitute the value of z with θTx then equation given in Figure 1gets convertedas following. Refer to Figure 5: Figure 5: Actual hypothesis function after mathematical manipulation Here hθx is the hypothesis function,θT is the matrix of the feature or matrix of the independent variables and transpose representation of it, x is the stand for all independent variables or for all possible feature set. In order to generate the hypothesis equation we replace the z value of logistic function with θTx. By using hypothesis equation machine actually tries to learn mapping between input variables or input features, and output labels. Let's talk a bit about the interpretation of this hypothesis function. Here for logistic regression, can you think what is the best way to predict the class label? Accordingly, we can predict the target class label by using probability concept. We need to generate the probability for both classes and whatever class has high probability we will assign that class label for that particular instance of feature. So in binary classification the value of y or target class is either zero or one. So if you are familiar with probability then you can represent the probability equation as given in Figure 6: Figure 6: Interpretation of hypothesis function using probabilistic representation So those who are not familiar with probability the P(y=1|x;θ) can be read like this. Probability of y =1, given x, and parameterized by θ. In simple language you can say like this hypothesis function will generate the probability value for target output 1 where we give features matrix x and some parameter θ. This seems intuitive concept, so for a while, you can keep all these in your mind. I will later on given you the reason why we need to generate probability as well as let you know how we can generate probability values for each of the class. Here we complete first step of general approach to understand the logistic regression. Cost or Error function for logistic regression First, let's understand what is cost function or the error function? Cost function or lose function, or error function are all the same things. In ML it is very important concept so here we understand definition of cost function and what is the purpose of defining the cost function. Cost function is the function which we use to check how accurate our ML classifier performs. So let me simplify this for you, in our training dataset we have data and we have labels. Now, when we use hypothesis function and generate the output we need to check how much near we are from the actual prediction and if we predict the actual output label then the difference between our hypothesis function output and actual label is zero or minimum and if our hypothesis function output and actual label are not same then we have big difference between them. So suppose if actual label of email is spam which is 1 and our hypothesis function also generate the result 1 then difference between actual target value and predicated output value is zero and therefore error in prediction is also zero and if our predicted output is 1 and actual output is zero then we have maximum error between our actual target concept and prediction. So it is important for us to have minimum error in our predication. This is the very basic concept of error function. We will get in to the mathematics in some minutes. There are several types of error function available like r2 error, sum of squared error, and so on. As per the ML algorithm and as per the hypothesis function our error function also changes. Now I know you wanted to know what will be the error function for logistic regression? and I have put θ in our hypothesis function so you also want to know what is θ and if I need to choose some value of the θ then how can I approach it? So here I will give all answers. Let me give you some background what we used to do in linear regression so it will help you to understand the logistic regression. We generally used sum of squared error or residuals error, or cost function. In linear regression we used to use it. So, just to give you background about sum of squared error. In linear regression we are trying to generate the line of best fit for our dataset so as I stated the example earlier given height I want to predict the weight and in this case we fist draw a line and measure the distance from each of the data point to line. We will square these distance and sum them and try to minimize this error function. Refer to Figure 7: Figure 7: Sum of squared error representation for reference You can see the distance of each data point from the line which is denoted using red line we will take this distance, square them, and sum them. This error function we will use in linear regression. We use this error function and we have generate partial derivative with respect to slop of line m and with respect to intercept b. Every time we calculate error and update the value of m and b so we can generate the line of best fit. The process of updating m and b is called gradient descent. By using gradient descent we update m and b in such a way so our error function has minimum error value and we can generate line of best fit. Gradient descent gives us a direction in which we need to plot a line so we can generate the line of best fit. You can find the detail example in Chapter 9,Deep Learning for NLU and NLG Problems. So by defining error function and generating partial derivatives we can apply gradient descent algorithm which help us to minimize our error or cost function. Now back to the main question which error function can we use for logistic regression? What you think can we use this as sum of squared error function for logistic regression as well? If you know function and calculus very well, then probably your answer is no. That is the correct answer. Let me explain this for those who aren't familiar with function and calculus. This is important so be careful. In linear regression our hypothesis function is linear so it is very easy for us to calculate sum of squared errors but here we are using sigmoid function which is non-linear function if you apply same function which we used in linear regression will not turn out well because if you take sigmoid function and put into the sum of squared error function then and if you try to visualized the all possible values then you will get non-convex curve. Refer to Figure 8: Figure 8: Non-convex with (Image credit: http://www.yuthon.com/images/non-convex_and_convex_function.png) In machine learning we majorly use function which are able to provide convex curve because then we can use gradient descent algorithm to minimize the error function and able to reach at global minimum certainly. As you saw in Figure 8, non-convex curve has many local minimum so in order to reach to global minimum is very challenging and very time consuming because then you need to apply second order or nth order optimization in order to reach to global minimum where in convex curve you can reach to global minimum certainly and fast as well. So if we plug our sigmoid function in sum of squared error then you get the non-convex function so we are not going to define same error function which we use in linear regression. So, we need to define a different cost function which is convex so we can apply gradient descent algorithm and generate global minimum. So here we are using the statistical concept called likelihood. To derive likelihood function we will use the equation of the probability which is given in Figure 6 and we are considering all data points in training set. So we can generate the following equation which is the likelihood function. Refer to Figure 9: Figure 9: likelihood function for logistic regression (Image credit: http://cs229.stanford.edu/notes/cs229-notes1.pdf) Now in order to simplify the derivative process we need to convert the likelihood function into monotonically increasing function which can be achieved by taking natural logarithm of the likelihood function and this is called loglikelihood. This log likelihood is our cost function for logistic regression. See the following equation given in Figure 10: Figure 10: Cost function for logistic regression Here to gain some intuition about the given cost function we will plot it and understand what benefit it provides to us. Here in xaxis we have our hypothesis function. Our hypothesis function range is 0 to 1 so we have these two points on xaxis. Start with the first case where y =1. You can see the generated curve which is on top right hand side in Figure 11: Figure 11: Logistic function cost function graphs If you see any log function plot and then flip that curve because here we have negative sign then you get the same curve as we plot in Figure 11. you can see the log graph as well as flipped graph in Figure 12: Figure 12:comparing log(x) and –log(x) graph for better understanding of cost function (Image credit : http://www.sosmath.com/algebra/logs/log4/log42/log422/gl30.gif) So here we are interested for value 0 and 1 so we are considering that part of the graph which we have depicted in Figure 11. This cost function has some interesting and useful properties. If predict or candidate label is same as the actual target label then cost will be zero so you can put like this if y=1 and hypothesis function predict hθ(x) = 1 then cost is 0 but if hθ(x) tends to 0 means more towards the zero then cost function blows up to ∞. Now you can see for the y = 0 you can see the graph which is on top left hand side inside the Figure 11. This case condition also have same advantages and properties which we have seen earlier. It will go to ∞ when actual value is 0 and hypothesis function predicts 1. If hypothesis function predict 0 and actual target is also 0 then cost =0. As I told you earlier that I will give you reason why we are choosing this cost function then the reason is that this function makes our optimization easy as we are using maximum log likelihood function as we as this function has convex curve which help us to run gradient decent. In order to apply gradient decent we need to generate the partial derivative with respect to θ and we can generate the following equation which is given in Figure 13: Figure 13: Partial derivative for performing gradient descent (Image credit : http://2.bp.blogspot.com) This equation is used for updating the parameter value of θ and α is here define the learning rate. This is the parameter which you can use how fast or how slow your algorithm should learn or train. If you set learning rate too high then algorithm can not learn and if you set it too low then it take lot of time to train. So you need to choose learning rate wisely. Now let's start building the spam filtering application. Data loading and exploration To build the spam filtering application we need dataset. Here we are using small size dataset. This dataset is simply straight forward. This dataset has two attribute. The first attribute is the label and second attribute is the text content of the email. Let's discuss more about the first attribute. Here the presence of label make this dataset a tagged data. This label indicated that the email content is belong to thespam category or ham category. Let's jump into the practical part. Here we are using numpy, pandas, andscikit-learnas dependency libraries. So let's explore or dataset first.We read dataset using pandas library.I have also checked how many total data records we have and basic details of the dataset. Once we load data,we will check its first ten records and then we will replace the spam and ham categories with number. As we have seen that machine can understand numerical format only so here all labels ham is converted into 0 and all labels spam is converted into 1.Refer to Figure 14: Figure 14: Code snippet for converting labels into numerical format Split dataset intotrainingdataset and testingdataset In this part we divide our dataset into two parts one part is called training set and other part is called testing set. Refer to Figure 15: Figure 15: Code snippet for dividing dataset into trainingdataset and testingdataset We are dividing dataset into two partsbecause we will perform training by using our trainingdataset and one our ML algorithm trained on that dataset and generate ML-model after that we will use generated ML-model and feed testing into that model as result our ML-model will generate the prediction. Based on that result we evaluate out ML-model Summary Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 7810
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-learning-dependency-injection-di
Packt
08 Mar 2018
15 min read
Save for later

Learning Dependency Injection (DI)

Packt
08 Mar 2018
15 min read
In this article by Sherwin John CallejaTragura, author of the book Spring 5.0 Cookbook, we will learn about implementation of Spring container using XML and JavaConfig,and also managing of beans in an XML-based container. In this article,you will learn how to: Implementing the Spring container using XML Implementing the Spring container using JavaConfig Managing the beans in an XML-based container (For more resources related to this topic, see here.) Implementing the Spring container using XML Let us begin with the creation of theSpring Web Project using the Maven plugin of our STS Eclipse 8.3. This web project will be implementing our first Spring 5.0 container using the XML-based technique. Thisis the most conventional butrobust way of creating the Spring container. The container is where the objects are created, managed, wired together with their dependencies, and monitored from their initialization up to their destruction.This recipe will mainly highlight how to create an XML-based Spring container. Getting ready Create a Maven project ready for development using the STS Eclipse 8.3. Be sure to have installed the correct JRE. Let us name the project ch02-xml. How to do it… After creating the project, certain Maven errors will be encountered. Bug fix the Maven issues of our ch02-xml projectin order to use the XML-based Spring 5.0 container by performing the following steps:  Open pom.xml of the project and add the following properties which contain the Spring build version and Servlet container to utilize: <properties> <spring.version>5.0.0.BUILD-SNAPSHOT</spring.version> <servlet.api.version>3.1.0</servlet.api.version> </properties> Add the following Spring 5 dependencies inside pom.xml. These dependencies are essential in providing us with the interfaces and classes to build our Spring container: <dependencies> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-context</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-core</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-beans</artifactId> <version>${spring.version}</version> </dependency> </dependencies> It is required to add the following repositories where Spring 5.0 dependencies in Step 2 will be downloaded: <repositories> <repository> <id>spring-snapshots</id> <name>Spring Snapshots</name> <url>https://repo.spring.io/libs-snapshot</url> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories> Then add the Maven plugin for deployment but be sure to recognize web.xml as the deployment descriptor. This can be done by enabling<failOnMissingWebXml>or just deleting the<configuration>tag as follows: <plugin> <artifactId>maven-war-plugin</artifactId> <version>2.3</version> </plugin> <plugin> Follow the Tomcat Maven plugin for deployment, as explained in Chapter 1. After the Maven configuration details, check if there is a WEB-INF folder inside src/main/webapp. If there is none, create one. This is mandatory for this project since we will be using a deployment descriptor (or web.xml). Inside theWEB-INF folder, create a deployment descriptor or drop a web.xml template inside src/main/webapp/WEB-INF directory. Then, create an XML-based Spring container named as ch02-beans.xmlinside thech02-xml/src/main/java/ directory. The configuration file must contain the following namespaces and tags: <?xml version="1.0" encoding="UTF-8"?> <beans xsi_schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring- context.xsd"> </beans> You can generate this file using theSTS Eclipse Wizard (Ctrl-N) and under the module SpringSpring Bean Configuration File option Save all the files. Clean and build the Maven project. Do not deploy yet because this is just a standalone project at the moment. How it works… This project just imported three major Spring 5.0 libraries, namely the Spring-Core, Spring-Beans, and Spring-Context,because the major classes and interfaces in creating the container are found in these libraries. This shows that Spring, unlike other frameworks, does not need the entire load of libraries just to setup the initial platform. Spring can be perceived as a huge enterprise framework nowadays but internally it is still lightweight. The basic container that manages objects in Spring is provided by the org.springframework.beans.factory.BeanFactoryinterfaceand can only be found in theSpring-Beansmodule. Once additional features are needed such as message resource handling, AOP capabilities, application-specific contexts and listener implementation, the sub-interface of BeanFactory, namely the org.springframework.context.ApplicationContextinterface, is then used.This ApplicationContext, found in Spring-Contextmodules, is the one that provides an enterprise-specific container for all its applications becauseit encompasses alarger scope of Spring components than itsBeanFactoryinterface. The container created,ch02-beans.xml, anApplicationContext, is an XML-based configuration that contains XSD schemas from the three main libraries imported. These schemashave tag libraries and bean properties, which areessential in managing the whole framework. But beware of runtime errors once libraries are removed from the dependencies because using these tags is equivalent to using the libraries per se. The final Spring Maven project directory structure must look like this: Implementing the Spring container using JavaConfig Another option of implementing the Spring 5.0 container is through the use of Spring JavaConfig. This is a technique that uses pure Java classes in configuring the framework's container. This technique eliminates the use of bulky and tedious XML metadata and also provides a type-safe and refactoring-free approach in configuring entities or collections of objects into the container. This recipe will showcase how to create the container usingJavaConfig in a web.xml-less approach. Getting ready Create another Maven project and name the projectch02-xml. This STSEclipse project will be using a Java class approach including its deployment descriptor. How to do it… To get rid of the usual Maven bugs, immediately open the pom.xmlof ch02-jc and add<properties>, <dependencies>, and <repositories>equivalent to what was added inthe Implementing the Spring Container using XMLrecipe. Next, get rid of the web.xml. Since the time Servlet 3.0 specification was implemented, servlet containers can now support projects without using web.xml. This is done by implementingthe handler abstract class called org.springframework.web.WebApplicationInitializer to programmatically configure ServletContext. Create aSpringWebinitializerclass and override its onStartup() method without any implementation yet: public class SpringWebinitializer implements WebApplicationInitializer { @Override public void onStartup(ServletContext container) throws ServletException { } } The lines in Step 2 will generate some runtime errors until you add the following Maven dependency: <dependency> <groupId>org.springframework</groupId> <artifactId>spring-web</artifactId> <version>${spring.version}</version> </dependency> In pom.xml, disable the<failOnMissingWebXml>. After the Maven details, create a class named BeanConfig, the ApplicationContext definition, bearing an annotation @Configuration at the top of it. The class must be inside the org.packt.starter.ioc.contextpackage and must be an empty class at the moment: @Configuration public class BeanConfig { } Save all the files and clean and build the Maven project.How it works… The Maven project ch02-xml makes use of both JavaConfig and ServletContainerInitializer, meaning there will be no XML configuration from servlet to Spring 5.0 containers. The BeanConfigclass is the ApplicationContext of the project which has an annotation @Configuration,indicating that the class is used by JavaConfig as asource of bean definitions.This is handy when creating an XML-based configuration with lots of metadata. On the other hand,ch02-xmlimplemented org.springframework.web.WebApplicationInitializer,which is a handler of org.springframework.web.SpringServletContainerInitializer, the framework's implementation class to theservlet'sServletContainerInitializer. The SpringServletContainerInitializerisnotified byWebApplicationInitializerduring the execution of its startup(ServletContext) with regard to theprogramaticalregistration of filters, servlets, and listeners provided by the ServletContext . Eventually, the servlet container will acknowledge the status reported by SpringServletContainerInitialize,thus eliminating the use of web.xml. On Maven's side, the plugin for deployment must be notified that the project will not use web.xml.This is done through setting the<failOnMissingWebXml>to false inside its<configuration>tag. The final Spring Web Project directory structure must look like the following structure: Managing the beans in an XML-based container Frameworks become popular because of the principle behind the architecture they are made up from. Each framework is built from different design patterns that manage the creation and behavior of the objects they manage. This recipe will detail how Spring 5.0 manages objects of the applications and how it shares a set of methods and functions across the platform. Getting ready The two Maven projects previously created will be utilized in illustrating how Spring 5.0 loads objects into the heap memory.We will also be utilizing the ApplicationContextrather than the BeanFactorycontainer in preparation for the next recipes involving more Spring components. How to do it… With our ch02-xml, let us demonstrate how Spring loads objects using the XML-based Application Context container: Create a package layer,org.packt.starter.ioc.model,for our model classes. Our model classes will be typical Plain Old Java Objects(POJO),by which Spring 5.0 architecture is known for. Inside the newly created package, create the classes Employeeand Department,whichcontain the following blueprints: public class Employee { private String firstName; private String lastName; private Date birthdate; private Integer age; private Double salary; private String position; private Department dept; public Employee(){ System.out.println(" an employee is created."); } public Employee(String firstName, String lastName, Date birthdate, Integer age, Double salary, String position, Department dept) { his.firstName = firstName; his.lastName = lastName; his.birthdate = birthdate; his.age = age; his.salary = salary; his.position = position; his.dept = dept; System.out.println(" an employee is created."); } // getters and setters } public class Department { private Integer deptNo; private String deptName; public Department() { System.out.println("a department is created."); } // getters and setters } Afterwards, open the ApplicationContextch02-beans.xml. Register using the<bean>tag our first set of Employee and Department objects as follows: <bean id="empRec1" class="org.packt.starter.ioc.model.Employee" /> <bean id="dept1" class="org.packt.starter.ioc.model.Department" /> The beans in Step 3 containprivate instance variables that havezeroes and null default values. Toupdate them, our classes havemutators or setter methodsthat can be used to avoid NullPointerException, which happens always when we immediately use empty objects. In Spring,calling these setters is tantamount to injecting data into the<bean>,similar to how these following objectsare created: <bean id="empRec2" class="org.packt.starter.ioc.model.Employee"> <property name="firstName"><value>Juan</value></property> <property name="lastName"><value>Luna</value></property> <property name="age"><value>70</value></property> <property name="birthdate"><value>October 28, 1945</value></property> <property name="position"> <value>historian</value></property> <property name="salary"><value>150000</value></property> <property name="dept"><ref bean="dept2"/></property> </bean> <bean id="dept2" class="org.packt.starter.ioc.model.Department"> <property name="deptNo"><value>13456</value></property> <property name="deptName"> <value>History Department</value></property> </bean> A<property>tag is equivalent to a setter definition accepting an actual value oran object reference. The nameattributedefines the name of the setter minus the prefix set with the conversion to itscamel-case notation. The value attribute or the<value>tag both pertain to supported Spring-type values (for example,int, double, float, Boolean, Spring). The ref attribute or<ref>provides reference to another loaded<bean>in the container. Another way of writing the bean object empRec2 is through the use of ref and value attributes such as the following: <bean id="empRec3" class="org.packt.starter.ioc.model.Employee"> <property name="firstName" value="Jose"/> <property name="lastName" value="Rizal"/> <property name="age" value="101"/> <property name="birthdate" value="June 19, 1950"/> <property name="position" value="scriber"/> <property name="salary" value="90000"/> <property name="dept" ref="dept3"/> </bean> <bean id="dept3" class="org.packt.starter.ioc.model.Department"> <property name="deptNo" value="56748"/> <property name="deptName" value="Communication Department" /> </bean> Another way of updating the private instance variables of the model objects is to make use of the constructors. Actual Spring data and object references can be inserted to the through the metadata: <bean id="empRec5" class="org.packt.starter.ioc.model.Employee"> <constructor-arg><value>Poly</value></constructor-arg> <constructor-arg><value>Mabini</value></constructor-arg> <constructor-arg><value> August 10, 1948</value></constructor-arg> <constructor-arg><value>67</value></constructor-arg> <constructor-arg><value>45000</value></constructor-arg> <constructor-arg><value>Linguist</value></constructor-arg> <constructor-arg><ref bean="dept3"></ref></constructor-arg> </bean> After all the modifications, save ch02-beans.xml.Create a TestBeans class inside thesrc/test/java directory. This class will load the XML configuration resource to the ApplicationContext container throughorg.springframework.context.support.ClassPathXmlApplicationContextand fetch all the objects created through its getBean() method. public class TestBeans { public static void main(String args[]){ ApplicationContext context = new ClassPathXmlApplicationContext("ch02-beans.xml"); System.out.println("application context loaded."); System.out.println("****The empRec1 bean****"); Employee empRec1 = (Employee) context.getBean("empRec1"); System.out.println("****The empRec2*****"); Employee empRec2 = (Employee) context.getBean("empRec2"); Department dept2 = empRec2.getDept(); System.out.println("First Name: " + empRec2.getFirstName()); System.out.println("Last Name: " + empRec2.getLastName()); System.out.println("Birthdate: " + empRec2.getBirthdate()); System.out.println("Salary: " + empRec2.getSalary()); System.out.println("Dept. Name: " + dept2.getDeptName()); System.out.println("****The empRec5 bean****"); Employee empRec5 = context.getBean("empRec5", Employee.class); Department dept3 = empRec5.getDept(); System.out.println("First Name: " + empRec5.getFirstName()); System.out.println("Last Name: " + empRec5.getLastName()); System.out.println("Dept. Name: " + dept3.getDeptName()); } } The expected output after running the main() thread will be: an employee is created. an employee is created. a department is created. an employee is created. a department is created. an employee is created. a department is created. application context loaded. *********The empRec1 bean *************** *********The empRec2 bean *************** First Name: Juan Last Name: Luna Birthdate: Sun Oct 28 00:00:00 CST 1945 Salary: 150000.0 Dept. Name: History Department *********The empRec5 bean *************** First Name: Poly Last Name: Mabini Dept. Name: Communication Department How it works… The principle behind creating<bean>objects into the container is called the Inverse of Control design pattern. In order to use the objects, its dependencies, and also its behavior, these must be placed within the framework per se. After registering them in the container, Spring will just take care of their instantiation and their availability to other objects. Developer can just "fetch" them if they want to include them in their software modules,as shown in the following diagram: The IoC design pattern can be synonymous to the Hollywood Principle (“Don't call us, we’ll call you!”), which is a popular line in most object-oriented programming languages. The framework does not care whether the developer needs the objects or not because the lifespan of the objects lies on the framework's rules. In the case of setting new values or updating values of the object's private variables, IoC has an implementation which can be used for "injecting" new actual values or object references to and it is popularly known as the Dependency Injection(DI) design pattern. This principle exposes all the to the public through its setter methods or the constructors. Injecting Spring values and object references to the method signature using the <property>tag without knowing its implementation is called the Method Injection type of DI. On the other hand, if we create the bean with initialized values injected to its constructor through<constructor-arg>, it is known as Constructor Injection. To create the ApplicationContext container, we need to instantiate ClassPathXmlApplicationContext or FileSystemApplicationContext, depending on the location of the XML definition file. Since the file is found in ch02-xml/src/main/java/, ClassPathXmlApplicationContext implementation is the best option. This proves that the ApplicationContext is an object too,bearing all those XML metadata. It has several overloaded getBean() methods used to fetch all the objects loaded with it. Summary In this article we went overhow to create an XML-based Spring container, how to create the container using JavaConfig in a web.xml-less approach andhow Spring 5.0 manages objects of the applications and how it shares a set of methods and functions across the platform. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 25875

article-image-4-common-challenges-web-scraping-handle
Amarabha Banerjee
08 Mar 2018
13 min read
Save for later

4 common challenges in Web Scraping and how to handle them

Amarabha Banerjee
08 Mar 2018
13 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] In this article, we will explore primary challenges of Web Scraping and how to get away with it easily. Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds. Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands: docker pull mheydt/pywebscrapecookbook docker run -p 5001:5001 pywebscrapecookbook Retrying failed page downloads Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes: [500, 502, 503, 504, 408] The process can be further configured using the following parameters: RETRY_ENABLED (True/False - default is True) RETRY_TIMES (# of times to retry on any errors - default is 2) RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408]) How to do it The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.retry.RetryMiddleware": 500 }, 'RETRY_ENABLED': True, 'RETRY_TIMES': 3 }) process.crawl(Spider) process.start() How it works Scrapy will pick up the configuration for retries as specified when the spider is run. When encountering errors, Scrapy will retry up to three times before giving up. Supporting page redirects Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters: REDIRECT_ENABLED: (True/False - default is True) REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20) How to do it The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are a few lines demonstrating the output: Parsing: <200 https://www.nasa.gov/content/earth-expeditions-above/> ['http://www.nasa.gov/content/earth-expeditions-above', 'https://www.nasa.gov/content/earth-expeditions-above'] This particular URL was processed after one redirection, from an HTTP to an HTTPS version of the URL. The list defines all of the URLs that were involved in the redirection. You will also be able to see where redirection exceeded the specified level (2) in the output pages. The following is one example: 2017-10-22 17:55:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://www.nasa.gov/topics/journeytomars/news/index.html>: max redirections reached How it works The spider is defined as the following: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] def parse(self, response): print("Parsing: ", response) print (response.request.meta.get('redirect_urls')) This is identical to our previous NASA sitemap based crawler, with the addition of one line printing the redirect_urls. In any call to parse, this metadata will contain all redirects that occurred to get to this page. The crawling process is configured with the following code: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 500 }, 'REDIRECT_ENABLED': True, 'REDIRECT_MAX_TIMES': 2 }) Redirect is enabled by default, but this sets the maximum number of redirects to 2 instead of the default of 20. Waiting for content to be available in Selenium A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asynchronously to the page after loading. Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button: When pressing the button, we are presented with a progress bar for five seconds: And when this is completed, we are presented with Hello World! Now suppose we want to scrape this page to get the content that is exposed only after the button is pressed and after the wait? How do we do this? How to do it We can do this using Selenium. We will use two features of Selenium. The first is the ability to click on page elements. The second is the ability to wait until an element with a specific ID is available on the page. First, we get the button and click it. The button's HTML is the following: <div id='start'> <button>Start</button> </div> When the button is pressed and the load completes, the following HTML is added to the document: <div id='finish'> <h4>Hello World!"</h4> </div> We will use the Selenium driver to find the Start button, click it, and then wait until a div with an ID of 'finish' is available. Then we get that element and return the text in the enclosed <h4> tag. You can try this by running 06/03_press_and_wait.py. It's output will be the following: clicked Hello World! Now let's see how it worked. How it works Let us break down the explanation: We start by importing the required items from Selenium: from selenium import webdriver from selenium.webdriver.support import ui Now we load the driver and the page: driver = webdriver.PhantomJS() driver.get("http://the-internet.herokuapp.com/dynamic_loading/2") With the page loaded, we can retrieve the button: button = driver.find_element_by_xpath("//*/div[@id='start']/button") And then we can click the button: button.click() print("clicked") Next we create a WebDriverWait object: wait = ui.WebDriverWait(driver, 10) With this object, we can request Selenium's UI wait for certain events. This also sets a maximum wait of 10 seconds. Now using this, we can wait until we meet a criterion; that an element is identifiable using the following XPath: wait.until(lambda driver: driver.find_element_by_xpath("//*/div[@id='finish']")) When this completes, we can retrieve the h4 element and get its enclosing text: finish_element=driver.find_element_by_xpath("//*/div[@id='finish']/ h4") print(finish_element.text) Limiting crawling to a single domain We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class. How to do it The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov. How it works The code is the same as previous NASA site crawlers except that we include allowed_domains=['nasa.gov']: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] allowed_domains=['nasa.gov'] def parse(self, response): print("Parsing: ", response) The NASA site is fairly consistent with staying within its root domain, but there are occasional links to other sites such as content on boeing.com. This code will prevent moving to those external sites. Processing infinitely scrolling pages Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart. While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page.  Let's look at spidyquotes.herokuapp.com/scroll as an example. Getting ready Open http://spidyquotes.herokuapp.com/scroll in your browser. This page will load additional content when you scroll to the bottom of the page: Screenshot of the quotes to scrape Once the page is open, go into your developer tools and select the network panel. Then, scroll to the bottom of the page. You will see new content in the network panel: When we click on one of the links, we can see the following JSON: { "has_next": true, "page": 2, "quotes": [{ "author": { "goodreads_link": "/author/show/82952.Marilyn_Monroe", "name": "Marilyn Monroe", "slug": "Marilyn-Monroe" }, "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "u201cThis life is what you make it...." }, { "author": { "goodreads_link": "/author/show/1077326.J_K_Rowling", "name": "J.K. Rowling", "slug": "J-K-Rowling" }, "tags": ["courage", "friends"], "text": "u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.u201d" }, This is great because all we need to do is continually generate requests to /api/quotes?page=x, increasing x until the has_next tag exists in the reply document. If there are no more pages, then this tag will not be in the document. How to do it The 06/05_scrapy_continuous.py file contains a Scrapy agent, which crawls this set of pages. Run it with your Python interpreter and you will see output similar to the following (the following is multiple excerpts from the output): <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'Sisters']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'Understand']} When this gets to page 10 it will stop as it will see that there is no next page flag set in the Content. How it works Let's walk through the spider to see how this works. The spider starts with the following definition of the start URL: class Spider(scrapy.Spider): name = 'spidyquotes' quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes' start_urls = [quotes_base_url] download_delay = 1.5 The parse method then prints the response and also parses the JSON into the data variable: def parse(self, response): print(response) data = json.loads(response.body) Then it loops through all the items in the quotes element of the JSON objects. For each item, it yields a new Scrapy item back to the Scrapy engine: for item in data.get('quotes', []): yield { 'text': item.get('text'), 'author': item.get('author', {}).get('name'), 'tags': item.get('tags'), } It then checks to see if the data JSON variable has a 'has_next' property, and if so it gets the next page and yields a new request back to Scrapy to parse the next page: if data['has_next']: next_page = data['page'] + 1 yield scrapy.Request(self.quotes_base_url + "?page=%s" % next_page) There's more... It is also possible to process infinite, scrolling pages using Selenium. The following code is in 06/06_scrape_continuous_twitter.py: from selenium import webdriver import time driver = webdriver.PhantomJS() print("Starting") driver.get("https://twitter.com") scroll_pause_time = 1.5 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: print(last_height) # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(scroll_pause_time) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") print(new_height, last_height) if new_height == last_height: break last_height = new_height The output would be similar to the following: Starting 4882 8139 4882 8139 11630 8139 11630 15055 11630 15055 15055 15055 Process finished with exit code 0 This code starts by loading the page from Twitter. The call to .get() will return when the page is fully loaded. The scrollHeight is then retrieved, and the program scrolls to that height and waits for a moment for the new content to load. The scrollHeight of the browser is retrieved again, and if different than last_height, it will loop and continue processing. If the same as last_height, no new content has loaded and you can then continue on and retrieve the HTML for the completed page. We have discussed the common challenges faced in performing Web Scraping using Python and got to know their workaround. If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Python and perform efficient web scraping.
Read more
  • 0
  • 0
  • 35552

article-image-how-to-perform-audio-video-image-scraping-with-python
Amarabha Banerjee
08 Mar 2018
9 min read
Save for later

How to perform Audio-Video-Image Scraping with Python

Amarabha Banerjee
08 Mar 2018
9 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server. Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally. Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It's a simple step from there to also transcode video with ffmpeg. Downloading media content from the web Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content. Getting ready There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file. How to do it Here is how we proceed with the recipe: The URLUtility class can download content from a URL. The code in the recipe's file is the following: import const from util.urls import URLUtility util = URLUtility(const.ApodEclipseImage()) print(len(util.data)) When running this you will see the following output:  Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014 The example reads 171014 bytes of data. How it works The URL is defined as a constant const.ApodEclipseImage() in the const module: def ApodEclipseImage(): return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg" The constructor of the URLUtility class has the following implementation: def __init__(self, url, readNow=True): """ Construct the object, parse the URL, and download now if specified""" self._url = url self._response = None self._parsed = urlparse(url) if readNow: self.read() The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method: def read(self): self._response = urllib.request.urlopen(self._url) self._data = self._response.read() This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property: @property def data(self): self.ensure_response() return self._data The code then simply reports on the length of that data, with the value of 171014. There's more This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next. Parsing a URL with urllib to get the filename When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name? Getting ready We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py. How to do it Execute the recipe's file with your python interpreter. It will run the following code: util = URLUtility(const.ApodEclipseImage()) print(util.filename_without_ext) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The filename is: BT5643s How it works In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively: >>> parsed = urlparse(const.ApodEclipseImage()) >>> parsed ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='') The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension: @property def filename_without_ext(self): filename = os.path.splitext(os.path.basename(self._parsed.path))[0] return filename The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename). There's more It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That's our next recipe. Determining the type of content for a URL When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content. Getting ready We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py. How to do it We proceed as follows: Execute the script for the recipe. It contains the following code: util = URLUtility(const.ApodEclipseImage()) print("The content type is: " + util.contenttype) With the following result: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The content type is: image/jpeg How it works The .contentype property is implemented as follows: @property def contenttype(self): self.ensure_response() return self._response.headers['content-type'] The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed. There's more The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned: >>> response = urllib.request.urlopen(const.ApodEclipseImage()) >>> for header in response.headers: print(header) Date Server Last-Modified ETag Accept-Ranges Content-Length Connection Content-Type Strict-Transport-Security And we can see the values for each of these headers. >>> for header in response.headers: print(header + " ==> " + response.headers[header]) Date ==> Tue, 26 Sep 2017 19:31:41 GMT Server ==> WebServer/1.0 Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT ETag ==> "547bb44-29c06-5581275ce2b86" Accept-Ranges ==> bytes Content-Length ==> 171014 Connection ==> close Content-Type ==> image/jpeg Strict-Transport-Security ==> max-age=31536000; includeSubDomains Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist. Determining the file extension from a content type It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file. Getting ready We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):. How to do it Proceed by running the recipe's script. An extension for the media type can be found using the .extension property: util = URLUtility(const.ApodEclipseImage()) print("Filename from content-type: " + util.extension_from_contenttype) print("Filename from url: " + util.extension_from_url) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes Filename from content-type: .jpg Filename from url: .jpg This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same. How it works The following is the implementation of the .extension_from_contenttype property: @property def extension_from_contenttype(self): self.ensure_response() map = const.ContentTypeToExtensions() if self.contenttype in map: return map[self.contenttype] return None The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension: def ContentTypeToExtensions(): return { "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/png": ".png" } If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url: @property def extension_from_url(self): ext = os.path.splitext(os.path.basename(self._parsed.path))[1] return ext This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename. To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python. If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.
Read more
  • 0
  • 0
  • 33710

article-image-data-explorationusing-spark-sql
Packt
08 Mar 2018
9 min read
Save for later

Data Exploration using Spark SQL

Packt
08 Mar 2018
9 min read
In this article, Aurobindo Sarkar, the author of the book, Learning Spark SQL, we will be covering the following points to introduce you to using Spark SQL for exploratory data analysis. What is exploratory Data Analysis (EDA)? Why EDA is important? Using Spark SQL for basic data analysis Visualizing data with Apache Zeppelin Introducing Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a “feel”, for the data set as a result of such exploration. More specifically, the graphical techniques allow us to, efficiently, select and validate appropriate models, test our assumptions, identify relationships, select estimators, and detect outliers. EDA involves a lot of trial and error, and several iterations.The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement.For example, you should plot simple histograms and scatterplots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take long to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java, and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer there are many good tools available such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data then this section will offer some good tools and techniques data exploration. Here, we will do some basic data exploration exercises to understand a sample dataset. We will use a dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We use the bank-additional-full.csv file that contains 41188 records and 20 input fields, ordered by date (from May 2008 to November 2010). As a first step, let’s define a schema and read in the CSV file to create a DataFrame. You can use :paste to paste-in the initial set of statements in the Spark shell, as shown in the following figure: After the DataFrame is created, we first verify the number of records. We can also define a case class called Call for our input records, and then create a strongly-typed Dataset as follows: In the next section, we will begin our data exploration by identifying missing data in our dataset. Identifying Missing Data Missing data can occur in datasets due to reasons ranging from negligence to a refusal on part of respondants to provide a specific data point. However, in all cases missing data is a common occurrence in real-world datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies for dealing with it. Here, weanalyzethe numbers of records with missing data fields in our sample dataset. In order to simulate missing data, we will edit our sample dataset by replacing fields containing “unknown” values with empty strings. First, we created a DataFrame / Dataset from our edited file, as shown in the following figure: The following two statements give us a count of rows with certain fields having missing data. Later, we will look at effective ways of dealing with missing data and compute some basic statistics for sample dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a dataset containing a subset of fields from our original DataFrame. In our following example, we choose some of the numeric fields and the outcome field that is the “term deposit subscribed” field. Next, we use describe to quickly compute the count, mean, standard deviation, min and max values for the numeric columns in our dataset. Further, we use the stat package to compute additional statistics like covariance, correlation, creating crosstabs, examining items that occur most frequently in data columns, and computing quantiles. These computations are shown in the following figure: Next, we use the typed aggregation functions to summarize our data to understand our data better. In the following statement, we aggregate the results by whether a term deposit was subscribed along with total customers contacted, average number of calls made per customer, the average duration of the calls and the average number of previous calls made to such customers. The results are rounded to two decimal points. Similarly, executing the following statement givessimilar results by customers’ age.   After getting a better understanding of our data by computing basic statistics, we shift our focus to identifying outliers in our data. Identifying data outliers An outlier or an anomalyis an observation of the data that deviates significantly from other observations in the dataset. Erroneous outliers are observations thatare distorted due to possible errors in the data-collection process. These outliers may exert undue influence on the results of statistical analysis, sothey should be identified using reliable detection methods prior to performing data analysis. Many algorithms find outliers as a side-product of clustering algorithms. However these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using a statistical distributions, and the outliers identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically doesnot have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. SparkMLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. In the following example, we use the k-means clustering algorithm to compute two clusters in our data. Other distributed algorithms useful for EDA includeclassification, regression, dimensionality reduction, correlation and hypothesis testing. Visualizing data with Apache Zeppelin Typically, we will generate many graphs to verify our hunches about the data.A lot of thesequick and dirty graphs used during EDA are,ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers, typically, cannot handle millions of data points.Hence we have to summarize, sample or model our data before we can effectively visualize it. Traditionally, BI tools provided extensive aggregation and pivoting features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data wassubsequently downloaded and visualizedon the practitioner’s workstations. Spark can eliminate many of these batch jobs to support interactive data visualization. Here, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is a web-based tool that supports interactive data analysis and visualization. It supports several language interpreters and comes with built-in Spark integration. Hence, it is quick and easy to get started with exploratory data analysis using Apache Zeppelin. You can download Appache Zeppelin from https://zeppelin.apache.org/. Unzip the package on your hard drive and start Zeppelin using the following command: Aurobindos-MacBook-Pro-2:zeppelin-0.6.2-bin-allaurobindosarkar$ bin/zeppelin-daemon.sh start You should see the following message: Zeppelin start                                             [ OK] You should be able to see the Zeppelin home page at: http://localhost:8080/ Click on Create new note link, and specify a path and name for your notebook, as shown in the following figure: In the next step, we paste the same code as in the beginning of this article to create a DataFrame for our sample dataset. We can execute typical DataFrameoperations as shown in the following figure: Next, we create a table from our DataFrame and execute some SQL on it. The results of the SQL statements execution can be charted by clicking on the appropriatechart-type required. Here, we create bar charts as an illustrative example of summarizing and visualizing data: We can also plot a scatter plot, and read the coordinate values of each of the points plotted, as shown in the following two figures. Additionally, we can create a textbox that accepts input values to make experience interactive. In the following figure we create a textbox that can accept different values for the age parameter and the bar chart is updated, accordingly. Similarly, we can also create dropdown lists where the user can select the appropriate option, and the table of values or chart, automatically gets updated. Summary In this article, we demonstrated using Spark SQL for exploring datasets, performing basic data quality checks, generating samples and pivot tables, and visualizing data with Apache Zeppelin.
Read more
  • 0
  • 0
  • 27233
article-image-how-to-set-up-a-deep-learning-system-on-amazon-web-services-aws
Gebin George
07 Mar 2018
5 min read
Save for later

How to set up a Deep Learning System on Amazon Web Services (AWS)

Gebin George
07 Mar 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials written by Wei Di, Anurag Bhardwaj, and Jianing Wei.  This book covers popular Python libraries such as Tensorflow, Keras, and more, along with tips to train, deploy and optimize deep learning models in the best possible manner.[/box] Today, we will learn two different methods of setting up a deep learning system using Amazon Web Services (AWS). Setup from scratch We will illustrate how to set up a deep learning environment on an AWS EC2 GPU instance g2.2xlarge running Ubuntu Server 16.04 LTS. For this example, we will use a pre-baked Amazon Machine Image (AMI) which already has a number of software packages installed—making it easier to set up an end-end deep learning system. We will use a publicly available AMI Image ami-b03ffedf, which has following pre-installed Packages: CUDA 8.0 Anaconda 4.20 with Python 3.0 Keras / Theano The first step to setting up the system is to set up an AWS account and spin a new EC2 GPU instance using the AWS web console as (http://console.aws.amazon.com/) shown in figure Choose EC2 AMI: 2. We pick a g2.2xlarge instance type from the next page as shown in figure Choose instance type: 3. After adding a 30 GB of storage as shown in figure Choose storage, we now launch a cluster and assign an EC2 key pair that can allow us to ssh and log in to the box using the provided key pair file: 4. Once the EC2 box is launched, next step is to install relevant software packages.To ensure proper GPU utilization, it is important to ensure graphics drivers are installed first. We will upgrade and install NVIDIA drivers as follows: $ sudo add-apt-repository ppa:graphics-drivers/ppa -y $ sudo apt-get update $ sudo apt-get install -y nvidia-375 nvidia-settings While NVIDIA drivers ensure that host GPU can now be utilized by any deep learning application, it does not provide an easy interface to application developers for easy programming on the device. Various different software libraries exist today that help achieve this task reliably. Open Computing Language (OpenCL) and CUDA are more commonly used in industry. In this book, we use CUDA as an application programming interface for accessing NVIDIA graphics drivers. To install CUDA driver, we first SSH into the EC2 instance and download CUDA 8.0 to our $HOME folder and install from there: $ wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-r epo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb $ sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_amd64-deb $ sudo apt-get update $ sudo apt-get install -y cuda nvidia-cuda-toolkit Once the installation is finished, you can run the following command to validate the installation: $ nvidia-smi Now your EC2 box is fully configured to be used for a deep learning development. However, for someone who is not very familiar with deep learning implementation details, building a deep learning system from scratch can be a daunting task. To ease this development, a number of advanced deep learning software frameworks exist, such as Keras and Theano. Both of these frameworks are based on a Python development environment, hence we first install a Python distribution on the box, such as Anaconda: $ wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh $ bash Anaconda3-4.2.0-Linux-x86_64.sh Finally, Keras and Theanos are installed using Python’s package manager pip: $ pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git $ pip install keras Once the pip installation is completed successfully, the box is now fully set up for a deep learning development. Setup using Docker The previous section describes getting started from scratch which can be tricky sometimes given continuous changes to software packages and changing links on the web. One way to avoid dependence on links is to use container technology like Docker. In this chapter, we will use the official NVIDIA-Docker image that comes pre-packaged with all the necessary packages and deep learning framework to get you quickly started with deep learning application development: $ sudo add-apt-repository ppa:graphics-drivers/ppa -y $ sudo apt-get update $ sudo apt-get install -y nvidia-375 nvidia-settings nvidia-modprobe We now install Docker Community Edition as follows: $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - # Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 $ sudo apt-key fingerprint 0EBFCD88 $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) Stable" $ sudo apt-get update $ sudo apt-get install -y docker-ce 2. We then install NVIDIA-Docker and its plugin: $ wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nv Idia-docker_1.0.1-1_amd64.deb $ sudo dpkg -i /tmp/nvidia-docker_1.0.1-1_amd64.deb && rm /tmp/nvidia-docker_1.0.1-1_amd64.deb 3. To validate if the installation happened correctly, we use the following command:  $ sudo nvidia-docker run --rm nvidia/cuda nvidia-smi 4. Once it’s setup correctly, we can use the official TensorFlow or Theano Docker Image: $ sudo nvidia-docker run -it tensorflow/tensorflow:latest-gpu bash 5. We can run a simple Python program to check if TensorFlow works properly: import tensorflow as tf a = tf.constant(5, tf.float32) b = tf.constant(5, tf.float32) with tf.Session() as sess: sess.run(tf.add(a, b)) # output is 10.0 print("Output of graph computation is = ",output) You should see the TensorFlow output on the screen now as shown in figure Tensorflow sample output: We saw how to set up deep learning system on AWS from scratch and on Docker. If you found our post useful, do check out this book Deep Learning Essentials  to optimize deep learning models for better performance output.  
Read more
  • 0
  • 0
  • 13127

article-image-working-forensic-evidence-container-recipes
Packt
07 Mar 2018
13 min read
Save for later

Working with Forensic Evidence Container Recipes

Packt
07 Mar 2018
13 min read
In this article by Preston Miller and Chapin Bryce, authors of Learning Python for Forensics, we introduce a recipe from our upcoming book, Python Digital Forensics Cookbook. In Python Digital Forensics Cookbook, each chapter is comprised of many scripts, or recipes, falling under specific themes. The "Iterating Through Files" recipe shown here, is from our chapter that introduces the Sleuth Kit's Python binding's, pystk3, and other libraries, to programmatically interact with forensic evidence containers. Specifically, this recipe shows how to access a forensic evidence container and iterate through all of its files to create an active file listing of its contents. (For more resources related to this topic, see here.) If you are reading this article, it goes without saying that Python is a key tool in DFIR investigations. However, most examiners, are not familiar with or do not take advantage of the Sleuth Kit's Python bindings. Imagine being able to run your existing scripts against forensic containers without needing to mount them or export loose files. This recipe continues to introduce the library, pytsk3, that will allow us to do just that and take our scripting capabilities to the next level. In this recipe, we learn how to recurse through the filesystem and create an active file listing. Oftentimes, one of the first questions we, as the forensic examiner, are asked is "What data is on the device?". An active file listing comes in handy here. Creating a file listing of loose files is a very straightforward task in Python. However, this will be slightly more complicated because we are working with a forensic image rather than loose files. This recipe will be a cornerstone for future scripts as it will allow us to recursively access and process every file in the image. As we continue to introduce new concepts and features from the Sleuth Kit, we will add new functionality to our previous recipes in an iterative process. In a similar way, this recipe will become integral in future recipes to iterate through directories and process files. Getting started Refer to the Getting started section in the Opening Acquisitions recipe for information on the build environment and setup details for pytsk3 and pyewf. All other libraries used in this script are present in Python's standard library. How to do it... We perform the following steps in this recipe: Import argparse, csv, datetime, os, pytsk3, pyewf, and sys; Identify if the evidence container is a raw (DD) image or an EWF (E01) container; Access the forensic image using pytsk3; Recurse through all directories in each partition; Store file metadata in a list; And write the active file list to a CSV. How it works... This recipe's command-line handler takes three positional arguments: EVIDENCE_FILE, TYPE, OUTPUT_CSV which represents the path to the evidence file, the type of evidence file, and the output CSV file, respectively. Similar to the previous recipe, the optional p switch can be supplied to specify a partition type. We use the os.path.dirname() method to extract the desired output directory path for the CSV file and, with the os.makedirs() function, create the necessary output directories if they do not exist. if __name__ == '__main__': # Command-line Argument Parser parser = argparse.ArgumentParser() parser.add_argument("EVIDENCE_FILE", help="Evidence file path") parser.add_argument("TYPE", help="Type of Evidence", choices=("raw", "ewf")) parser.add_argument("OUTPUT_CSV", help="Output CSV with lookup results") parser.add_argument("-p", help="Partition Type", choices=("DOS", "GPT", "MAC", "SUN")) args = parser.parse_args() directory = os.path.dirname(args.OUTPUT_CSV) if not os.path.exists(directory) and directory != "": os.makedirs(directory) Once we have validated the input evidence file by checking that it exists and is a file, the four arguments are passed to the main() function. If there is an issue with initial validation of the input, an error is printed to the console before the script exits. if os.path.exists(args.EVIDENCE_FILE) and os.path.isfile(args.EVIDENCE_FILE): main(args.EVIDENCE_FILE, args.TYPE, args.OUTPUT_CSV, args.p) else: print("[-] Supplied input file {} does not exist or is not a file".format(args.EVIDENCE_FILE)) sys.exit(1) In the main() function, we instantiate the volume variable with None to avoid errors referencing it later in the script. After printing a status message to the console, we check if the evidence type is an E01 to properly process it and create a valid pyewf handle as demonstrated in more detail in the Opening Acquisitions recipe. Refer to that recipe for more details as this part of the function is identical. The end result is the creation of the pytsk3 handle, img_info, for the user supplied evidence file. def main(image, img_type, output, part_type): volume = None print "[+] Opening {}".format(image) if img_type == "ewf": try: filenames = pyewf.glob(image) except IOError, e: print "[-] Invalid EWF format:n {}".format(e) sys.exit(2) ewf_handle = pyewf.handle() ewf_handle.open(filenames) # Open PYTSK3 handle on EWF Image img_info = ewf_Img_Info(ewf_handle) else: img_info = pytsk3.Img_Info(image) Next, we attempt to access the volume of the image using the pytsk3.Volume_Info() method by supplying it the image handle. If the partition type argument was supplied, we add its attribute ID as the second argument. If we receive an IOError when attempting to access the volume, we catch the exception as e and print it to the console. Notice, however, that we do not exit the script as we often do when we receive an error. We'll explain why in the next function. Ultimately, we pass the volume, img_info, and output variables to the openFS() method. try: if part_type is not None: attr_id = getattr(pytsk3, "TSK_VS_TYPE_" + part_type) volume = pytsk3.Volume_Info(img_info, attr_id) else: volume = pytsk3.Volume_Info(img_info) except IOError, e: print "[-] Unable to read partition table:n {}".format(e) openFS(volume, img_info, output) The openFS() method tries to access the filesystem of the container in two ways. If the volume variable is not None, it iterates through each partition, and if that partition meets certain criteria, attempts to open it. If, however, the volume variable is None, it instead tries to directly call the pytsk3.FS_Info() method on the image handle, img. As we saw, this latter method will work and give us filesystem access for logical images whereas the former works for physical images. Let's look at the differences between these two methods. Regardless of the method, we create a recursed_data list to hold our active file metadata. In the first instance, where we have a physical image, we iterate through each partition and check that is it greater than 2,048 sectors and does not contain the words "Unallocated", "Extended", or "Primary Table" in its description. For partitions meeting these criteria, we attempt to access its filesystem using the FS_Info() function by supplying the pytsk3 img object and the offset of the partition in bytes. If we are able to access the filesystem, we use to open_dir() method to get the root directory and pass that, along with the partition address ID, the filesystem object, two empty lists, and an empty string, to the recurseFiles() method. These empty lists and string will come into play in recursive calls to this function as we will see shortly. Once the recurseFiles() method returns, we append the active file metadata to the recursed_data list. We repeat this process for each partition def openFS(vol, img, output): print "[+] Recursing through files.." recursed_data = [] # Open FS and Recurse if vol is not None: for part in vol: if part.len > 2048 and "Unallocated" not in part.desc and "Extended" not in part.desc and "Primary Table" not in part.desc: try: fs = pytsk3.FS_Info(img, offset=part.start*vol.info.block_size) except IOError, e: print "[-] Unable to open FS:n {}".format(e) root = fs.open_dir(path="/") data = recurseFiles(part.addr, fs, root, [], [], [""]) recursed_data.append(data) We employ a similar method for the second instance, where we have a logical image, where the volume is None. In this case, we attempt to directly access the filesystem and, if successful, we pass that to the recurseFiles() method and append the returned data to our recursed_data list. Once we have our active file list, we send it and the user supplied output file path to the csvWriter() method. Let's dive into the recurseFiles() method which is the meat of this recipe. else: try: fs = pytsk3.FS_Info(img) except IOError, e: print "[-] Unable to open FS:n {}".format(e) root = fs.open_dir(path="/") data = recurseFiles(1, fs, root, [], [], [""]) recursed_data.append(data) csvWriter(recursed_data, output) The recurseFiles() function is based on an example of the FLS tool (https://github.com/py4n6/pytsk/blob/master/examples/fls.py) and David Cowen's Automating DFIR series tool dfirwizard (https://github.com/dlcowen/dfirwizard/blob/master/dfirwiza rd-v9.py). To start this function, we append the root directory inode to the dirs list. This list is used later to avoid unending loops. Next, we begin to loop through each object in the root directory and check that it has certain attributes we would expect and that its name is not either "." or "..". def recurseFiles(part, fs, root_dir, dirs, data, parent): dirs.append(root_dir.info.fs_file.meta.addr) for fs_object in root_dir: # Skip ".", ".." or directory entries without a name. if not hasattr(fs_object, "info") or not hasattr(fs_object.info, "name") or not hasattr(fs_object.info.name, "name") or fs_object.info.name.name in [".", ".."]: continue If the object passes that test, we extract its name using the info.name.name attribute. Next, we use the parent variable, which was supplied as one of the function's inputs, to manually create the file path for this object. There is no built-in method or attribute to do this automatically for us. We then check if the file is a directory or not and set the f_type variable to the appropriate type. If the object is a file, and it has an extension, we extract it and store it in the file_ext variable. If we encounter an AttributeError when attempting to extract this data we continue onto the next object. try: file_name = fs_object.info.name.name file_path = "{}/{}".format("/".join(parent), fs_object.info.name.name) try: if fs_object.info.meta.type == pytsk3.TSK_FS_META_TYPE_DIR: f_type = "DIR" file_ext = "" else: f_type = "FILE" if "." in file_name: file_ext = file_name.rsplit(".")[-1].lower() else: file_ext = "" except AttributeError: continue We create variables for the object size and timestamps. However, notice that we pass the dates to a convertTime() method. This function exists to convert the UNIX timestamps into a human-readable format. With these attributes extracted, we append them to the data list using the partition address ID to ensure we keep track of which partition the object is from size = fs_object.info.meta.size create = convertTime(fs_object.info.meta.crtime) change = convertTime(fs_object.info.meta.ctime) modify = convertTime(fs_object.info.meta.mtime) data.append(["PARTITION {}".format(part), file_name, file_ext, f_type, create, change, modify, size, file_path]) If the object is a directory, we need to recurse through it to access all of its sub-directories and files. To accomplish this, we append the directory name to the parent list. Then, we create a directory object using the as_directory() method. We use the inode here, which is for all intents and purposes a unique number and check that the inode is not already in the dirs list. If that were the case, then we would not process this directory as it would have already been processed. If the directory needs to be processed, we call the recurseFiles() method on the new sub_directory and pass it current dirs, data, and parent variables. Once we have processed a given directory, we pop that directory from the parent list. Failing to do this will result in false file path details as all of the former directories will continue to be referenced in the path unless removed. Most of this function was under a large try-except block. We pass on any IOError exception generated during this process. Once we have iterated through all of the subdirectories, we return the data list to the openFS() function. if f_type == "DIR": parent.append(fs_object.info.name.name) sub_directory = fs_object.as_directory() inode = fs_object.info.meta.addr # This ensures that we don't recurse into a directory # above the current level and thus avoid circular loops. if inode not in dirs: recurseFiles(part, fs, sub_directory, dirs, data, parent) parent.pop(-1) except IOError: pass dirs.pop(-1) return data Let's briefly look at the convertTime() function. We've seen this type of function before, if the UNIX timestamp is not 0, we use the datetime.utcfromtimestamp() method to convert the timestamp into a human-readable format. def convertTime(ts): if str(ts) == "0": return "" return datetime.utcfromtimestamp(ts) With the active file listing data in hand, we are now ready to write it to a CSV file using the csvWriter() method. If we did find data (i.e., the list is not empty), we open the output CSV file, write the headers, and loop through each list in the data variable. We use the csvwriterows() method to write each nested list structure to the CSV file. def csvWriter(data, output): if data == []: print "[-] No output results to write" sys.exit(3) print "[+] Writing output to {}".format(output) with open(output, "wb") as csvfile: csv_writer = csv.writer(csvfile) headers = ["Partition", "File", "File Ext", "File Type", "Create Date", "Modify Date", "Change Date", "Size", "File Path"] csv_writer.writerow(headers) for result_list in data: csv_writer.writerows(result_list) The screenshot below demonstrates the type of data this recipe extracts from forensic images. There's more... For this recipe, there are a number of improvements that could further increase its utility: Use tqdm, or another library, to create a progress bar to inform the user of the current execution progress. Learn about the additional metadata values that can be extracted from filesystem objects using pytsk3 and add them to the output CSV file. Summary In summary, we have learned how to use pytsk3 to recursively iterate through any supported filesystem by the Sleuth Kit. This comprises the basis of how we can use the Sleuth Kit to programmatically process forensic acquisitions. With this recipe, we will now be able to further interact with these files in future recipes. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 16190

article-image-administering-arcgis-enterprise-through-rest-administrative-directories
Chad Cooper
07 Mar 2018
8 min read
Save for later

Administering ARCGIS Enterprise through the REST administrative directories

Chad Cooper
07 Mar 2018
8 min read
This is a guest post written by Chad Cooper. Chad has worked in the geospatial industry over the last 15 years as a technician, analyst, and developer, pertaining to state and local government, oil and gas, and academia. He is also the author of the title Mastering ArcGIS Enterprise Administration, which aims to help you learn to install configure, secure, and fully utilize ArcGIS Enterprise system. ArcGIS Enterprise is one of the most widely used GIS packages in the world. With the 10.5 release, Portal for ArcGIS became a first-class citizen, living alongside ArcGIS Server and playing a major role in management and administration of the web GIS. Data Store for ArcGIS allows for local storage of hosted feature services and is also a major player in the ArcGIS Enterprise ecosystem. The ArcGIS Web Adaptor completes ArcGIS Enterprise and is the fourth major component. These components are new to most users (Portal and Data Store), and they come with an increased level of configuration, complexity and administration. Luckily, there are many ways to administer and manage the ArcGIS Enterprise system. In this article, we will look at a few of those methods. How to access the ArcGIS server REST administrator directory ArcGIS Server exposes its functionality through web services using REST. With this architecture comes the ArcGIS Server REST Application Programming Interface, or API, that, in addition to exposing ArcGIS Server services, exposes every administrative task that ArcGIS Server supports. In the API, ArcGIS Server administrative tasks are considered resources and are accessed through URLs (which are Uniform Resource Locators, after all). Operations act on these resources and update their information or state. Resources and their operations are hierarchical and standardized and have unique URLs. Like the web, the REST API is stateless, meaning that it does not retain information from one request to another by either the sender or receiver. Each request that is sent is expected to contain all the necessary information to process that request. If it does, the server processes the request and sends back a well-defined response. As it is accessed over the web, the ArcGIS Server REST API can also be invoked from any language that can make a web service call, such as Python. Accessing the ArcGIS Server Administrator Directory can be done in several ways, depending upon your Web Adaptor configuration. From the ArcGIS Server machine, the Server Administrator Directory can be accessed at https://localhost:6443/arcgis/admin. There is no shortcut to this URL in the Windows Start menu. From another machine on the internal network, the Server Administrator Directory can be accessed by using the fully qualified domain name, or FQDN, instead of localhost, such as https://server.domain.com:6443/arcgis/admin. If, during your Web Adaptor configuration, you chose to Enable administrative access to your site through the Web Adaptor, you also will be able to access the Server Administrator Directory through your Web Adaptor URL, such as https://www.masteringageadmin.com/arcgis/admin. As with Server Manager, you will login as the primary site administrator (PSA) designated during installation or with other administrator credentials. Prior to ArcGIS 10.1, server configuration was held in plain text configuration files in the configuration store. These files are no longer part of the ArcGIS Server architecture. The ArcGIS Server REST Administrator Directory now exposes these settings. How to use the ArcGIS server REST administrator directory The ArcGIS Server REST Administrator Directory, or “REST Admin” as it will be herein referred to, is a powerful way to manage all aspects of ArcGIS Server administration, as it exposes every administrative task that ArcGIS Server supports. Remember from earlier that the API is organized into resources and operations. Resources are settings within ArcGIS Server and operations act on those resources to update their information or change their well-defined state usually through a HTTP GET or POST method. HTTP GET requests data from a resource while HTTP POST submits data to be processed to a resource. In other words, GET retrieves data, POST inserts/updates data. An example of a resource is a service. An existing service can have a well-defined state of stopped or started, it must be one or the other. Operations available on the service resource in the REST API include Start Service, Stop Service, Edit Service, and Delete Service. The Start, Stop, and Delete operations change the state of the service, from stopped to started and started to stopped, and either stopped or started to deleted (technically if the service is started, it is first stopped before it is deleted) respectively. The Edit Service operation changes the information in the resource. Resources can also have child resources which can in turn have their own set of operations and child resources. Remember that the API is hierarchical, so for example, in the case of a service resource, it has the child resource Item Information, which has the Edit Item Information operation. To get to this operation in the REST Admin, we would login to the REST Admin and go to services | | iteminfo | edit, which would resemble the following in URL form: https://www.masteringageadmin.com/arcgis/admin/services/SampleWorldCities.MapServer/iteminfo/edit In the REST Admin, we could now edit the service Description, Summary, Tags, and Thumbnail: By updating the Item Information in the above example and clicking the Update button, we would be sending an edit HTTP POST operation to the https://www.masteringageadmin.com/arcgis/admin/services/SampleWorldCities.MapServer/iteminfo resource. The ArcGIS Server Manager equivalent for this process would be to go to Services | Manage Services | Edit Service pencil button to the right of service name | Item Description. Hopefully this gives you a better understanding of how the REST API works and how actions carried out in Server Manager and Server are executed by the API on the backend. Administering Portal for ArcGIS through the Portal REST administrative directory Just like ArcGIS Server, Portal has a REST backend from which all administrative tasks can be performed. We previously covered how the web interface for ArcGIS Server is a frontend to the ArcGIS Server REST API, and Portal is no different. We also covered services and how REST calls are made to the API. The Portal Administrative Directory, referred to herein as “Portal Admin”, can be accessed from within the internal network (bypassing the Web Adaptor) at a URL such as: https://<FQDN>:7443/arcgis/portaladmin/ If administrative access is enabled on the Portal Web Adaptor, then we can access Portal Admin outside of our internal network at the Web Adaptor URL such as: https://www.your-domain.com/portal/portaladmin/ To login to Portal Admin as an administrator, enter the Username and Password of an account with administrator privileges at the Portal Administrative Directory Login page and click the Login button. Let’s now look at one administrative action that can be performed in the Portal REST Admin. Portal licensing Information on current Portal licensing can be viewed by going to Home | System | Licenses. Here, information on the validity and expiration of licensing and registered members can be viewed. The Import Entitlements operation allows for the import of entitlements for ArcGIS Pro and additional products such as Business Analyst or Insights. For ArcGIS Pro, the operation requires an entitlements file exported out of My Esri. Once the entitlements have been imported, licenses can be assigned to users within Portal. Entitlements can have effective parts and parts that become effective on a certain date. These all get imported, with the effective parts available immediately and the non-effective parts placed into a queue that Portal will automatically apply once they become effective. To import entitlements for ArcGIS Pro, do the following: Have your entitlements file ready In Portal Admin, go to Home | System | Licenses | Import Entitlements Choose your entitlements file under Choose File For Application, choose ArcGISPro For Format, choose JSON or HTML (this is only the response format) Click Import. Once the entitlements are imported, the licenses can be assigned to users in Portal under My Organization | Manage Licenses.  At its latest release, ArcGIS Enterprise has more components than ever before, resulting in additional setup, configuration, administration, and management requirements. Here, we looked at several ways to access the ArcGIS Server and Portal for ArcGIS REST administrative interfaces. These are a few of the many methods available to interact with your ArcGIS Enterprise system. Check out Mastering ArcGIS Enterprise Administration to learn how to administer ArcGIS Server, Portal, and Data Store through user interfaces, the REST API, and Python scripts.
Read more
  • 0
  • 0
  • 3036
article-image-implementing-matrix-operations-using-scipy-numpy
Pravin Dhandre
07 Mar 2018
5 min read
Save for later

Implementing matrix operations using SciPy and NumPy

Pravin Dhandre
07 Mar 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book includes hands-on recipes for using different components of the SciPy Stack such as NumPy, SciPy, matplotlib, pandas, etc.[/box] In this article, we will discuss how to leverage the power of SciPy and NumPy to perform numerous matrix operations and solve common challenges faced while proceeding with statistical analysis. Matrix operations and functions on two-dimensional arrays Basic matrix operations form the backbone of quite a few statistical analyses—for example, neural networks. In this section, we will be covering some of the most used operations and functions on 2D arrays: Addition Multiplication by scalar Matrix arithmetic Matrix-matrix multiplication Matrix inversion Matrix transposition In the following sections, we will look into the methods of implementing each of them in Python using SciPy/NumPy. How to do it… Let's look at the different methods. Matrix addition In order to understand how matrix addition is done, we will first initialize two arrays: # Initializing an array x = np.array([[1, 1], [2, 2]]) y = np.array([[10, 10], [20, 20]]) Similar to what we saw in a previous chapter, we initialize a 2 x 2 array by using the np.array function. There are two methods by which we can add two arrays. Method 1 A simple addition of the two arrays x and y can be performed as follows: x+y Note that x evaluates to: [[1 1] [2 2]] y evaluates to: [[10 10] [20 20]] The result of x+y would be equal to: [[1+10 1+10] [2+20 2+20]] Finally, this gets evaluated to: [[11 11] [22 22]] Method 2 The same preceding operation can also be performed by using the add function in the numpy package as follows: np.add(x,y) Multiplication by a scalar Matrix multiplication by a scalar can be performed by multiplying the vector with a number. We will perform the same using the following two steps: Initialize a two-dimensional array. Multiply the two-dimensional array with a scalar. We perform the steps, as follows: To initialize a two-dimensional array: x = np.array([[1, 1], [2, 2]]) To multiply the two-dimensional array with the k scalar: k*x For example, if the scalar value k = 2, then the value of k*x translates to: 2*x array([[2, 2], [4, 4]]) Matrix arithmetic Standard arithmetic operators can be performed on top of NumPy arrays too. The operations used most often are: Addition Subtraction Multiplication Division Exponentials The other major arithmetic operations are similar to the addition operation we performed on two matrices in the Matrix addition section earlier: # subtraction x-y array([[ -9, -9], [-18, -18]]) # multiplication x*y array([[10, 10], [40, 40]]) While performing multiplication here, there is an element to element multiplication between the two matrices and not a matrix multiplication (more on matrix multiplication in the next section): # division x/y array([[ 0.1, 0.1], [ 0.1, 0.1]]) # exponential x**y array([[ 1, 1], [1048576, 1048576]], dtype=int32) Matrix-matrix multiplication Matrix to matrix multiplication works in the following way: We have a set of two matrices with the following shape: Matrix A has n rows and m columns and matrix B has m rows and p columns. The matrix multiplication of A and B is calculated as follows: The matrix operation is performed by using the built-in dot function available in NumPy as follows: Initialize the arrays: x=np.array([[1, 1], [2, 2]]) y=np.array([[10, 10], [20, 20]]) Perform the matrix multiplication using the dot function in the numpy package: np.dot(x,y) array([[30, 30], [60, 60]]) The np.dot function does the multiplication in the following way: array([[1*10 + 1*20, 1*10 + 1*20], [2*10 + 2*20, 2*10 + 2*20]]) Whenever matrix multiplication happens, the number of columns in the first matrix should be equal to the number of rows in the second matrix. Matrix transposition Matrix transposition is performed by using the transpose function available in numpy package. The process to generate the transpose of a matrix is as follows: Initialize a matrix: A = np.array([[1,2],[3,4]]) Calculate the transpose of the matrix: A.transpose() array([[1, 3], [2, 4]]) The transpose of a matrix with m rows and n columns would be a matrix with n rows and m columns Matrix inversion While we performed most of the basic arithmetic operations on top of matrices earlier, we have not performed any specialist functions within scientific computing/analysis—for example, matrix inversion, transposition, ranking of a matrix, and so on. The other functions available within the scipy package shine through (over and above the previously discussed functions) in such a scenario where more data manipulation is required apart from the standard ones. Matrix inversion can be performed by using the function available in scipy.linalg. The process to perform matrix inversion and its implementation in Python is as follows: Import relevant packages and classes/functions within a package: from scipy import linalg Initialize a matrix: A = np.array([[1,2],[3,4]]) Pass the initialized matrix through the inverse function in package: linalg.inv(A) array([[-2. , 1. ], [ 1.5, -0.5]]) We saw how to easily perform implementation of all the basic matrix operations with Python’s scientific library - SciPy. You may check out this book SciPy Recipes to perform advanced computing tasks like Discrete Fourier Transform and K-means with the SciPy stack.
Read more
  • 0
  • 0
  • 71101

article-image-introduction-aspnet-core-web-api
Packt
07 Mar 2018
13 min read
Save for later

Introduction to ASP.NET Core Web API

Packt
07 Mar 2018
13 min read
In this article by MithunPattankarand MalendraHurbuns, the authors of the book, Mastering ASP.NET Web API,we will start with a quick recap of MVC. We will be looking at the following topics:  Quick recap of MVC framework  Why Web APIs were incepted and it's evolution?  Introduction to .NET Core?  Overview of ASP.NET Core architecture (For more resources related to this topic, see here.) Quick recap of MVC framework Model-View-Controller (MVC) is a powerful and elegant way of separating concerns within an application and applies itself extremely well to web applications. With ASP.NETMVC, it's translated roughly as follows: Models (M): These are the classes that represent the domain you are interested in. These domain objects often encapsulate data stored in a database as well as code that manipulates the data and enforces domain-specific business logic. With ASP.NETMVC, this is most likely a Data Access Layer of some kind, using a tool like Entity Framework or NHibernate or classic ADO.NET.  View (V): This is a template to dynamically generate HTML.  Controller(C): This is a special class that manages the relationship between the View and the Model. It responds to user input, talks to the Model, and decides which view to render (if any). In ASP.NETMVC, this class is conventionally denoted by the suffix Controller. Why Web APIs were incepted and it's evolution? Looking back to days when ASP.NETASMX-based XML web service was widely used for building service-oriented applications, it was easiest way to create SOAP-based service which can be used by both .NET applications and non .NET applications. It was available only over HTTP. Around 2006, Microsoft released Windows Communication Foundation (WCF).WCF was and even now a powerful technology for building SOA-based applications. It was giant leap in the world of Microsoft .NET world. WCF was flexible enough to be configured as HTTP service, Remoting service, TCP service, and so on. Using Contracts of WCF, we would keep entire business logic code base same and expose the service as HTTP based or non HTTP based via SOAP/ non SOAP. Until 2010 the ASMX based XML web service or WCF service were widely used in client server based applications, in-fact everything was running smoothly. But the developers of .NET or non .NET community started to feel need for completely new SOA technology for client server applications. Some of reasons behind them were as follows: With applications in production, the amount of data while communicating started to explode and transferring them over the network was bandwidth consuming. SOAP being light weight to some extent started to show signs of payload increase. A few KB SOAP packets were becoming few MBs of data transfer.  Consuming the SOAP service in applications lead to huge applications size because of WSDL and proxy generation. This was even worse when it was used in web applications. Any changes to SOAP services lead to repeat of consuming them by proxy generation. This wasn't easy task for any developers.  JavaScript-based web frameworks were getting released and gaining ground for much simpler way of web development. Consuming SOAP-based services were not that optimal way. Hand-held devices were becoming popular like tablets, smartphones. They had more focused applications and needed very lightweight service oriented approach.  Browser based Single Page Applications (SPA) was gaining ground very rapidly. Using SOAP based services for quite heavy for these SPA. Microsoft released REST based WCF components which can be configured to respond in JSON or XML, but even then it was WCF which was heavy technology to be used.  Applications where no longer just large enterprise services, but there was need was more focused light weight service to be up & running in few days and much easier to use. Any developer who has seen evolving nature of SOA based technologies like ASMX, WCF or any SOAP based felt the need to have much lighter, HTTP based services. HTTP only, JSON compatible POCO based lightweight services was need of the hour and concept of Web API started gaining momentum. What is Web API? A Web API is a programmatic interface to a system that is accessed via standard HTTP methods and headers. A Web API can be accessed by a variety of HTTP clients, including browsers and mobile devices. For Web API to be successful HTTP based service, it needed strong web infrastructure like hosting, caching, concurrency, logging, security etc. One of the best web infrastructure was none other than ASP.NET. ASP.NET either in form Web Form or MVC was widely adopted, so the solid base for web infrastructure was mature enough to be extended as Web API. Microsoft responded to community needs by creating ASP.NET Web API- a super-simple yet very powerful framework for building HTTP-only, JSON-by-default web services without all the fuss of WCF. ASP.NET Web API can be used to build REST based services in matter of minutes and can easily consumed with any front end technologies. It used IIS (mostly) for hosting, caching, concurrency etc. features, it became quite popular. It was launched in 2012 with most basics needs for HTTP based services like convention-based Routing, HTTP Request and Response messages. Later Microsoft released much bigger and better ASP.NET Web API 2 along with ASP.NETMVC 5 in Visual Studio 2013. ASP.NET Web API 2 evolved at much faster pace with these features. Installed via NuGet Installing of Web API 2 was made simpler by using NuGet, either create empty ASP.NET or MVC project and then run command in NuGet Package Manager Console: Install-Package Microsoft.AspNet.WebApi Attribute Routing Initial release of Web API was based on convention-based routing meaning we define one or more route templates and work around it. It's simple without much fuss as routing logic in a single place & it's applied across all controllers. The real world applications are more complicated with resources (controllers/ actions) have child resources like customers having orders, books having authors etc. In such cases convention-based routing is not scalable. Web API 2 introduced a new concept of Attribute Routing which uses attributes in programming languages to define routes. One straight forward advantage is developer has full controls how URIs for Web API are formed. Here is quick snippet of Attribute Routing: [Route("customers/{customerId}/orders")] public IEnumerable<Order>GetOrdersByCustomer(intcustomerId) { ... } For more understanding on this, read Attribute Routing in ASP.NET Web API 2(https://www.asp.net/web-api/overview/web-api-routing-and-actions/attribute-routing-in-web-api-2) OWIN self-host ASP.NET Web API lives on ASP.NET framework, leading to think that it can be hosted on IIS only. The Web API 2 came new hosting package. Microsoft.AspNet.WebApi.OwinSelfHost With this package it can self-hosted outside IIS using OWIN/Katana. CORS (Cross Origin Resource Sharing) Any Web API developed either using .NET or non .NET technologies and meant to be used across different web frameworks, then enabling CORS is must. A must read on CORS&ASP.NET Web API 2 (https://www.asp.net/web-api/overview/security/enabling-cross-origin-requests-in-web-api). IHTTPActionResult and Web API OData improvements are other few notable features which helped evolve Web API 2 as strong technology for developing HTTP based services. ASP.NET Web API 2 has becoming more powerful over the years with C# language improvements like Asynchronous programming using Async/ Await, LINQ, Entity Framework Integration, Dependency Injection with DI frameworks, and so on. ASP.NET into Open Source world Every technology has to evolve with growing needs and advancements in hardware, network and software industry, ASP.NET Web API is no exception to that. Some of the evolution that ASP.NET Web API should undergo from perspectives of developer community, enterprises and end users are: ASP.NETMVC and Web API even though part of ASP.NET stack but their implementation and code base is different. A unified code base reduces burden of maintaining them. It's known that Web API's are consumed by various clients like web applications, Native apps, and Hybrid apps, desktop applications using different technologies (.NET or non .NET). But how about developing Web API in cross platform way, need not rely always on Windows OS/ Visual Studio IDE. Open sourcing the ASP.NET stack so that it's adopted on much bigger scale. End users are benefitted with open source innovations. We saw that why Web APIs were incepted, how they evolved into powerful HTTP based service and some evolutions required. With these thoughts Microsoft made an entry into world of Open Source by launching .NET Core and ASP.NET Core 1.0. What is .NET Core? .NET Core is a cross-platform free and open-source managed software framework similar to .NET Framework. It consists of CoreCLR, a complete cross-platform runtime implementation of CLR. .NET Core 1.0 was released on 27 June 2016 along with Visual Studio 2015 Update 3, which enables .NET Core development. In much simpler terms .NET Core applications can be developed, tested, deployed on cross platforms such as Windows, Linux flavours, macOS systems. With help of .NET Core, we don't really need Windows OS and in particular Visual Studio IDE to develop ASP.NET web applications, command-line apps, libraries, and UWP apps. In short, let's understand .NET Core components: CoreCLR:It is a virtual machine that manages the execution of .NET programs. CoreCLRmeans Core Common Language Runtime, it includes the garbage collector, JIT compiler, base .NET data types and many low-level classes. CoreFX: .NET Core foundational libraries likes class for collections, file systems, console, XML, Async and many others. CoreRT: .NET Core runtime optimized for AOT (ahead of time compilation) scenarios, with the accompanying .NET Native compiler toolchain. Its main responsibility is to do native compilation of code written in any of our favorite .NET programming language. .NET Core shares subset of original .NET framework, plus it comes with its own set of APIs that is not part of .NET framework. This results in some shared APIs that can be used by both .NET core and .NET framework. A .Net Core application can easily work on existing .NET Framework but not vice versa. .NET Core provides a CLI (Command Line Interface) for an execution entry point for operating systems and provides developer services like compilation and package management. The following are the .NET Core interesting points to know: .NET Core can be installed on cross platforms like Windows, Linux, andmacOS. It can be used in device, cloud, and embedded/IoT scenarios.  Visual Studio IDE is not mandatory to work with .NET Core, but when working on Windows OS we can leverage existing IDE knowledge to work.  .NET Core is modular, meaning that instead of assemblies, developers deal with NuGet packages.  .NET Core relies on its package manager to receive updates because cross platform technology can't rely on Windows Updates. To learn .NET Core, we just need a shell, text editor and its runtime installed. .NET Core comes with flexible deployment. It can be included in your app or installed side-by-side user- or machine-wide.  .NET Core apps can also be self-hosted/run as standalone apps. .NET Core supports four cross-platform scenarios--ASP.NET Core web apps, command-line apps, libraries, and Universal Windows Platform apps. It does not implement Windows Forms or WPF which render the standard GUI for desktop software on Windows. At present only C# programming language can be used to write .NET Core apps. F# and VB support are on the way. We will primarily focus on ASP.NET Core web apps which includes MVC and Web API. CLI apps, libraries will be covered briefly. What is ASP.NET Core? A new open-source and cross-platform framework for building modern cloud-based web applications using .NET. ASP.NET Core is completely open-source, you can download it from GitHub. It's cross platform meaning you can develop ASP.NET Core apps on Linux/macOS and of course on Windows OS. ASP.NET was first released almost 15 years back with .NET framework. Since then it's adopted by millions of developers for large, small applications. ASP.NET has evolved with many capabilities. With .NET Core as cross platform, ASP.NET took a huge leap beyond boundaries of Windows OS environment for development and deployment of web applications. ASP.NET Core overview                                                ASP.NET Core Architecture overview ASP.NET Core high level overview provides following insights: ASP.NET Core runs both on Full .NET framework and .NET Core.  ASP.NET Core applications with full .NET framework can only be developed and deployed only Windows OS/Server.  When using .NET core, it can be developed and deployed on platform of choice. The logos of Windows, Linux, macOSindicates that you can work with ASP.NET Core.  ASP.NET Core when on non-Windows machine, use the .NET Core libraries to run the applications. It's obvious you won't have all full .NET libraries but most of them are available.  Developers working on ASP.NET Core can easily switch working on any machine not confined to Visual Studio 2015 IDE. ASP.NET Core can run with different version of .NET Core. ASP.NET Core has much more foundational improvements apart from being cross-platform, we gain following advantages of using ASP.NET Core: Totally Modular: ASP.NET Core takes totally modular approach for application development, every component needed to build application are well factored into NuGet packages. Only add required packages through NuGet to keep overall application lightweight.  ASP.NET Core is no longer based on System.Web.dll. Choose your editors and tools: Visual Studio IDE was used to develop ASP.NET applications on Windows OS box, now since we have moved beyond the Windows world. Then we will require IDE/editors/ Tools required for developingASP.NET applications on Linux/macOS. Microsoft developed powerful lightweight code editors for almost any type of web applications called as Visual Studio Code.  ASP.NET Core is such a framework that we don't need Visual Studio IDE/ code to develop applications. We can use code editors like Sublime, Vim also. To work with C# code in editors, installed and use OmniSharp plugin.  OmniSharp is a set of tooling, editor integrations and libraries that together create an ecosystem that allows you to have a great programming experience no matter what your editor and operating system of choice may be.  Integration with modern web frameworks: ASP.NET Core has powerful, seamless integration with modern web frameworks like Angular, Ember, NodeJS, and Bootstrap.  Using bower andNPM, we can work with modern web frameworks.  Cloud ready: ASP.NET Core apps are cloud ready with configuration system, it just seamlessly gets transitioned from on-premises to cloud.  Built in Dependency Injection. Can be hosted on IIS or self-host in your own process or on nginx.  New light-weight and modular HTTP request pipeline. Unified code base for Web UI and Web APIs. We will see more on this when we explore anatomy of ASP.NET Core application. Summary So in this article we covered MVC framework and introduced .NET Core and its architecture. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 12008
Modal Close icon
Modal Close icon