Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-fingerprint-detection-using-opencv

07 Oct 2015

11 min read

Fingerprint detection using OpenCV 3

07 Oct 2015

In this article by Joseph Howse, Quan Hua, Steven Puttemans, and Utkarsh Sinha, the authors of OpenCV Blueprints, we delve into the aspect of fingerprint detection using OpenCV. (For more resources related to this topic, see here.) Fingerprint identification, how is it done? We have already discussed the use of the first biometric, which is the face of the person trying to login to the system. However since we mentioned that using a single biometric can be quite risky, we suggest adding secondary biometric checks to the system, like the fingerprint of a person. There are many of the shelf fingerprint scanners which are quite cheap and return you the scanned image. However you will still have to write your own registration software for these scanners and which can be done by using OpenCV software. Examples of such fingerprint images can be found below. Examples of single individual thumb fingerprint in different scanning positions This dataset can be downloaded from the FVC2002 competition website released by the University of Bologna. The website (http://bias.csr.unibo.it/fvc2002/databases.asp) contains 4 databases of fingerprints available for public download of the following structure: Four fingerprint capturing devices DB1 - DB4. For each device, the prints of 10 individuals are available. For each person, 8 different positions of prints were recorded. We will use this publicly available dataset to build our system upon. We will focus on the first capturing device, using up to 4 fingerprints of each individual for training the system and making an average descriptor of the fingerprint. Then we will use the other 4 fingerprints to evaluate our system and make sure that the person is still recognized by our system. You could apply exactly the same approach on the data grabbed from the other devices if you would like to investigate the difference between a system that captures almost binary images and one that captures grayscale images. However we will provide techniques for doing the binarization yourself. Implement the approach in OpenCV 3 The complete fingerprint software for processing fingerprints derived from a fingerprint scanner can be found at https://github.com/OpenCVBlueprints/OpenCVBlueprints/tree/master/chapter_6/source_code/fingerprint/fingerprint_process/. In this subsection we will describe how you can implement this approach in the OpenCV interface. We will start by grabbing the image from the fingerprint system and apply binarization. This will enable us to remove any desired noise from the image as well as help us to make the contrast better between the kin and the wrinkled surface of the finger. // Start by reading in an image Mat input = imread("/data/fingerprints/image1.png", CV_LOAD_GRAYSCALE); // Binarize the image, through local thresholding Mat input_binary; threshold(input, input_binary, 0, 255, CV_THRESH_BINARY | CV_THRESH_OTSU); The Otsu thresholding will automatically choose the best generic threshold for the image to obtain a good contrast between foreground and background information. This is because the image contains a bimodal distribution (which means that we have an image with a 2 peak histogram) of pixel values. For that image, we can approximately take a value in the middle of those peaks as threshold value. (for images which are not bimodal, binarization won't be accurate.) Otsu allows us to avoid using a fixed threshold value, and thus making the system more general for any capturing device. However we do acknowledge that if you have only a single capturing device, then playing around with a fixed threshold value could result in a better image for that specific setup. The result of the thresholding can be seen below. In order to make thinning from the next skeletization step as effective as possible we need the inverse binary image. Comparison between grayscale and binarized fingerprint image Once we have a binary image we are actually already set to go to calculate our feature points and feature point descriptors. However, in order to improve the process a bit more, we suggest to skeletize the image. This will create more unique and stronger interest points. The following piece of code can apply the skeletization on top of the binary image. The skeletization is based on the Zhang-Suen line thinning approach. Special thanks to @bsdNoobz of the OpenCV Q&A forum who supplied this iteration approach. #include <opencv2/imgproc.hpp> #include <opencv2/highgui.hpp> using namespace std; using namespace cv; // Perform a single thinning iteration, which is repeated until the skeletization is finalized void thinningIteration(Mat& im, int iter) { Mat marker = Mat::zeros(im.size(), CV_8UC1); for (int i = 1; i < im.rows-1; i++) { for (int j = 1; j < im.cols-1; j++) { uchar p2 = im.at<uchar>(i-1, j); uchar p3 = im.at<uchar>(i-1, j+1); uchar p4 = im.at<uchar>(i, j+1); uchar p5 = im.at<uchar>(i+1, j+1); uchar p6 = im.at<uchar>(i+1, j); uchar p7 = im.at<uchar>(i+1, j-1); uchar p8 = im.at<uchar>(i, j-1); uchar p9 = im.at<uchar>(i-1, j-1); int A = (p2 == 0 && p3 == 1) + (p3 == 0 && p4 == 1) + (p4 == 0 && p5 == 1) + (p5 == 0 && p6 == 1) + (p6 == 0 && p7 == 1) + (p7 == 0 && p8 == 1) + (p8 == 0 && p9 == 1) + (p9 == 0 && p2 == 1); int B = p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9; int m1 = iter == 0 ? (p2 * p4 * p6) : (p2 * p4 * p8); int m2 = iter == 0 ? (p4 * p6 * p8) : (p2 * p6 * p8); if (A == 1 && (B >= 2 && B <= 6) && m1 == 0 && m2 == 0) marker.at<uchar>(i,j) = 1; } } im &= ~marker; } // Function for thinning any given binary image within the range of 0-255. If not you should first make sure that your image has this range preset and configured! void thinning(Mat& im) { // Enforce the range tob e in between 0 - 255 im /= 255; Mat prev = Mat::zeros(im.size(), CV_8UC1); Mat diff; do { thinningIteration(im, 0); thinningIteration(im, 1); absdiff(im, prev, diff); im.copyTo(prev); } while (countNonZero(diff) > 0); im *= 255; } The code above can then simply be applied to our previous steps by calling the thinning function on top of our previous binary generated image. The code for this is: Apply thinning algorithm Mat input_thinned = input_binary.clone(); thinning(input_thinned); This will result in the following output: Comparison between binarized and thinned fingerprint image using skeletization techniques. When we got this skeleton image, the following step would be to look for crossing points on the ridges of the fingerprint, which are being then called minutiae points. We can do this by a keypoint detector that looks at a large change in local contrast, like the Harris corner detector. Since the Harris corner detector is both able to detect strong corners and edges, this is ideally for the fingerprint problem, where the most important minutiae are short edges and bifurcation, the positions where edges come together. More information about minutae points and Harris Corner detection can be found in the following publications: Ross, Arun A., Jidnya Shah, and Anil K. Jain. "Toward reconstructing fingerprints from minutiae points." Defense and Security. International Society for Optics and Photonics, 2005. Harris, Chris, and Mike Stephens. "A combined corner and edge detector." Alvey vision conference. Vol. 15. 1988. Calling the Harris Corner operation on a skeletonized and binarized image in OpenCV is quite straightforward. The Harris corners are stored as positions corresponding in the image with their cornerness response value. If we want to detect points with a certain cornerness, than we should simply threshold the image. Mat harris_corners, harris_normalised; harris_corners = Mat::zeros(input_thinned.size(), CV_32FC1); cornerHarris(input_thinned, harris_corners, 2, 3, 0.04, BORDER_DEFAULT); normalize(harris_corners, harris_normalised, 0, 255, NORM_MINMAX, CV_32FC1, Mat()); We now have a map with all the available corner responses rescaled to the range of [0 255] and stored as float values. We can now manually define a threshold, that will generate a good amount of keypoints for our application. Playing around with this parameter could improve performance in other cases. This can be done by using the following code snippet: float threshold = 125.0; vector<KeyPoint> keypoints; Mat rescaled; convertScaleAbs(harris_normalised, rescaled); Mat harris_c(rescaled.rows, rescaled.cols, CV_8UC3); Mat in[] = { rescaled, rescaled, rescaled }; int from_to[] = { 0,0, 1,1, 2,2 }; mixChannels( in, 3, &harris_c, 1, from_to, 3 ); for(int x=0; x<harris_normalised.cols; x++){ for(int y=0; y<harris_normalised.rows; y++){ if ( (int)harris_normalised.at<float>(y, x) > threshold ){ // Draw or store the keypoint location here, just like you decide. In our case we will store the location of the keypoint circle(harris_c, Point(x, y), 5, Scalar(0,255,0), 1); circle(harris_c, Point(x, y), 1, Scalar(0,0,255), 1); keypoints.push_back( KeyPoint (x, y, 1) ); } } } Comparison between thinned fingerprint and Harris corner response, as well as the selected Harris corners. Now that we have a list of keypoints we will need to create some of formal descriptor of the local region around that keypoint to be able to uniquely identify it among other keypoints. Chapter 3, Recognizing facial expressions with machine learning, discusses in more detail the wide range of keypoints out there. In this article, we will mainly focus on the process. Feel free to adapt the interface with other keypoint detectors and descriptors out there, for better or for worse performance. Since we have an application where the orientation of the thumb can differ (since it is not a fixed position), we want a keypoint descriptor that is robust at handling these slight differences. One of the mostly used descriptors for that is the SIFT descriptor, which stands for scale invariant feature transform. However SIFT is not under a BSD license and can thus pose problems to use in commercial software. A good alternative in OpenCV is the ORB descriptor. In OpenCV you can implement it in the following way. Ptr<Feature2D> orb_descriptor = ORB::create(); Mat descriptors; orb_descriptor->compute(input_thinned, keypoints, descriptors); This will enable us to calculate only the descriptors using the ORB approach, since we already retrieved the location of the keypoints using the Harris corner approach. At this point we can retrieve a descriptor for each detected keypoint of any given fingerprint. The descriptors matrix will contain a row for each keypoint containing the representation. Let us now start from the case where we have only a single reference image for each fingerprint. In that case we will have a database containing a set of feature descriptors for the training persons in the database. We then have a single new entry, consisting of multiple descriptors for the keypoints found at registration time. We now have to match these descriptors to the descriptors stored in the database, to see which one has the best match. The most simple way is by performing a brute force matching using the hamming distance criteria between descriptors of different keypoints. // Imagine we have a vector of single entry descriptors as a database // We will still need to fill those once we compare everything, by using the code snippets above vector<Mat> database_descriptors; Mat current_descriptors; // Create the matcher interface Ptr<DescriptorMatcher> matcher = DescriptorMatcher::create("BruteForce-Hamming"); // Now loop over the database and start the matching vector< vector< DMatch > > all_matches; for(int entry=0; i<database_descriptors.size();entry++){ vector< DMatch > matches; matcheràmatch(database_descriptors[entry], current_descriptors, matches); all_matches.push_back(matches); } We now have all the matches stored as DMatch objects. This means that for each matching couple we will have the original keypoint, the matched keypoint and a floating point score between both matches, representing the distance between the matched points. The idea about finding the best match seems pretty straightforward. We take a look at the amount of matches that have been returned by the matching process and weigh them by their Euclidean distance in order to add some certainty. We then look for the matching process that yielded the biggest score. This will be our best match and the match we want to return as the selected one from the database. If you want to avoid an imposter getting assigned to the best matching score, you can again add a manual threshold on top of the scoring to avoid matches that are not good enough, to be ignored. However it is possible, and should be taken into consideration, that if you increase the score to high, that people with a little change will be rejected from the system, like for example in the case where someone cuts his finger and thus changing his pattern to drastically. Fingerprint matching process visualized. To summarize, we saw how to detect fingerprints and implement it using OpenCV 3. Resources for Article: Further resources on this subject: Making subtle color shifts with curves [article] Tracking Objects in Videos [article] Hand Gesture Recognition Using a Kinect Depth Sensor [article]

0
1
62522

article-image-manipulating-text-data-using-python-regular-expressions-regex

Sugandha Lahoti

16 Feb 2018

8 min read

Manipulating text data using Python Regular Expressions (regex)

Sugandha Lahoti

16 Feb 2018

8 min read

0
0
62366

article-image-the-cap-theorem-in-practice-the-consistency-vs-availability-trade-off-in-distributed-databases

Richard Gall

12 Sep 2019

7 min read

The CAP Theorem in practice: The consistency vs. availability trade-off in distributed databases

Richard Gall

12 Sep 2019

7 min read

When you choose a database you are making a design decision. One of the best frameworks for understanding what this means in practice is the CAP Theorem. What is the CAP Theorem? The CAP Theorem, developed by computer scientist Eric Brewer in the late nineties, states that databases can only ever fulfil two out of three elements: Consistency - that reads are always up to date, which means any client making a request to the database will get the same view of data. Availability - database requests always receive a response (when valid). Partition tolerance - that a network fault doesn’t prevent messaging between nodes. In the context of distributed (NoSQL) databases, this means there is always going to be a trade-off between consistency and availability. This is because distributed systems are always necessarily partition tolerant (ie. it simply wouldn’t be a distributed database if it wasn’t partition tolerant.) Read next: Different types of NoSQL databases and when to use them How do you use the CAP Theorem when making database decisions? Although the CAP Theorem can feel quite abstract, it has practical, real-world consequences. From both a technical and business perspective the trade-offs will lead you to some very important questions. There are no right answers. Ultimately it will be all about the context in which your database is operating, the needs of the business, and the expectations and needs of users. You will have to consider things like: Is it important to avoid throwing up errors in the client? Or are we willing to sacrifice the visible user experience to ensure consistency? Is consistency an actual important part of the user’s experience Or can we actually do what we want with a relational database and avoid the need for partition tolerance altogether? As you can see, these are ultimately user experience questions. To properly understand those, you need to be sensitive to the overall goals of the project, and, as said above, the context in which your database solution is operating. (Eg. Is it powering an internal analytics dashboard? Or is it supporting a widely used external-facing website or application?) And, as the final bullet point highlights, it’s always worth considering whether the consistency v availability trade-off should matter at all. Avoid the temptation to think a complex database solution will always be better when a simple, more traditional solution will do the job. Of course, it’s important to note that systems that aren’t partition tolerant are a single point of failure in a system. That introduces the potential for unreliability. Prioritizing consistency in a distributed database It’s possible to get into a lot of technical detail when talking about consistency and availability, but at a really fundamental level the principle is straightforward: you need consistency (or what is called a CP database) if the data in the database must always be up to date and aligned, even in the instance of a network failure (eg. the partitioned nodes are unable to communicate with one another for whatever reason). Particular use cases where you would prioritize consistency is when you need multiple clients to have the same view of the data. For example, where you’re dealing with financial information, personal information, using a database that gives you consistency and confidence that data you are looking at is up to date in a situation where the network is unreliable or fails. Examples of CP databases MongoDB Learning MongoDB 4 [Video] MongoDB 4 Quick Start Guide MongoDB, Express, Angular, and Node.js Fundamentals Redis Build Complex Express Sites with Redis and Socket.io [Video] Learning Redis HBase Learn by Example : HBase - The Hadoop Database [Video] HBase Design Patterns Prioritizing availability in a distributed database Availability is essential when data accumulation is a priority. Think here of things like behavioral data or user preferences. In scenarios like these, you will want to capture as much information as possible about what a user or customer is doing, but it isn’t critical that the database is constantly up to date. It simply just needs to be accessible and available even when network connections aren’t working. The growing demand for offline application use is also one reason why you might use a NoSQL database that prioritizes availability over consistency. Examples of AP databases Cassandra Learn Apache Cassandra in Just 2 Hours [Video] Mastering Apache Cassandra 3.x - Third Edition DynamoDB Managed NoSQL Database In The Cloud - Amazon AWS DynamoDB [Video] Hands-On Amazon DynamoDB for Developers [Video] Limitations and criticisms of CAP Theorem It’s worth noting that the CAP Theorem can pose problems. As with most things, in truth, things are a little more complicated. Even Eric Brewer is circumspect about the theorem, especially as what we expect from distributed databases. Back in 2012, twelve years after he first put his theorem into the world, he wrote that: “Although designers still need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them. The modern CAP goal should be to maximize combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations.” So, this means we must think about the trade-off between consistency and availability as a balancing act, rather than a binary design decision. Elsewhere, there have been more robust criticisms of CAP Theorem. Software engineer Martin Kleppmann, for example, pleaded Please stop calling databases CP or AP in 2015. In a blog post he argues that CAP Theorem only works if you adhere to specific definitions of consistency, availability, and partition tolerance. “If your use of words matches the precise definitions of the proof, then the CAP theorem applies to you," he writes. “But if you’re using some other notion of consistency or availability, you can’t expect the CAP theorem to still apply.” The consequences of this are much like those described in Brewer’s piece from 2012. You need to take a nuanced approach to database trade-offs in which you think them through on your own terms and up against your own needs. The PACELC Theorem One of the developments of this line of argument is an extension to the CAP Theorem: the PACELC Theorem. This moves beyond thinking about consistency and availability and instead places an emphasis on the trade-off between consistency and latency. The PACELC Theorem builds on the CAP Theorem (the ‘PAC’) and adds an else (the ‘E’). What this means is that while you need to choose between availability and consistency if communication between partitions has failed in a distributed system, even if things are running properly and there are no network issues, there is still going to be a trade-off between consistency and latency (the ‘LC’). Conclusion: Learn to align context with technical specs Although the CAP Theorem might seem somewhat outdated, it is valuable in providing a way to think about database architecture design. It not only forces engineers and architects to ask questions about what they want from the technologies they use, but it also forces them to think carefully about the requirements of a given project. What are the business goals? What are user expectations? The PACELC Theorem builds on CAP in an effective way. However, the most important thing about these frameworks is how they help you to think about your problems. Of course the CAP Theorem has limitations. Because it abstracts a problem it is necessarily going to lack nuance. There are going to be things it simplifies. It’s important, as Kleppmann reminds us - to be mindful of these nuances. But at the same time, we shouldn’t let an obsession with nuance and detail allow us to miss the bigger picture.

0
0
60879

article-image-visualizing-3d-plots-matplotlib-2-0

Sugandha Lahoti

16 Nov 2017

7 min read

Visualizing 3D plots in Matplotlib 2.0

Sugandha Lahoti

16 Nov 2017

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim titled Matplotlib 2.x By Example.[/box] By transitioning to the three-dimensional space, you may enjoy greater creative freedom when creating visualizations. The extra dimension can also accommodate more information in a single plot. However, some may argue that 3D is nothing more than a visual gimmick when projected to a 2D surface (such as paper) as it would obfuscate the interpretation of data points. In Matplotlib version 2, despite significant developments in the 3D API, annoying bugs or glitches still exist. We will discuss some workarounds toward the end of this article. More powerful Python 3D visualization packages do exist (such as MayaVi2, Plotly, and VisPy), but it's good to use Matplotlib's 3D plotting functions if you want to use the same package for both 2D and 3D plots, or you would like to maintain the aesthetics of its 2D plots. For the most part, 3D plots in Matplotlib have similar structures to 2D plots. As such, we will not go through every 3D plot type in this section. We will put our focus on 3D scatter plots and bar charts. 3D scatter plot Let's try to create a 3D scatter plot. Before doing that, we need some data points in three dimensions (x, y, z): import pandas as pd source = "https://raw.githubusercontent.com/PointCloudLibrary/data/master/tutorials/ ism_train_cat.pcd" cat_df = pd.read_csv(source, skiprows=11, delimiter=" ", names=["x","y","z"], encoding='latin_1') cat_df.head() To declare a 3D plot, we first need to import the Axes3D object from the mplot3d extension in mpl_toolkits, which is responsible for rendering 3D plots in a 2D plane. After that, we need to specify projection='3d' when we create subplots: from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(cat_df.x, cat_df.y, cat_df.z) plt.show() Behold, the mighty sCATter plot in 3D. Cats are currently taking over the internet. According to the New York Times, cats are "the essential building block of the Internet" (https://www.nytimes.com/2014/07/23/upshot/what-the-internet-can-see-from-your-cat-pictures.html). Undoubtedly, they deserve a place in this chapter as well. Contrary to the 2D version of scatter(), we need to provide X, Y, and Z coordinates when we are creating a 3D scatter plot. Yet the parameters that are supported in 2D scatter() can be applied to 3D scatter() as well: fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Change the size, shape and color of markers ax.scatter(cat_df.x, cat_df.y, cat_df.z, s=4, c="g", marker="o") plt.show() To change the viewing angle and elevation of the 3D plot, we can make use of view_init(). The azim parameter specifies the azimuth angle in the X-Y plane, while elev specifies the elevation angle. When the azimuth angle is 0, the X-Y plane would appear to the north from you. Meanwhile, an azimuth angle of 180 would show you the south side of the X-Y plane: fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(cat_df.x, cat_df.y, cat_df.z,s=4, c="g", marker="o") # elev stores the elevation angle in the z plane azim stores the # azimuth angle in the x,y plane ax.view_init(azim=180, elev=10) plt.show() 3D bar chart We introduced candlestick plots for showing Open-High-Low-Close (OHLC) financial data. In addition, a 3D bar chart can be employed to show OHLC across time. The next figure shows a typical example of plotting a 5-day OHLC bar chart: import matplotlib.pyplot as plt import numpy as np from mpl_toolkits.mplot3d import Axes3D # Get 1 and every fifth row for the 5-day AAPL OHLC data ohlc_5d = stock_df[stock_df["Company"]=="AAPL"].iloc[1::5, :] fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) ax.bar(xs, ys, zs=z, zdir='y', color=colors, alpha=0.8, width=5) plt.show() The method for setting ticks and labels is similar to other Matplotlib plotting functions: fig = plt.figure(figsize=(9,7)) ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) ax.bar(xs, ys, zs=z, zdir='y', color=colors, alpha=0.8) # Manually assign the ticks and tick labels ax.set_xticks(np.arange(ohlc_5d.shape[0])) ax.set_xticklabels(ohlc_5d["Date"], rotation=20, verticalalignment='baseline', horizontalalignment='right', fontsize='8') ax.set_yticks([30, 20, 10, 0]) ax.set_yticklabels(["Open", "High", "Low", "Close"]) # Set the z-axis label ax.set_zlabel('Price (US $)') # Rotate the viewport ax.view_init(azim=-42, elev=31) plt.tight_layout() plt.show() Caveats to consider while visualizing 3D plots in Matplotlib Due to the lack of a true 3D graphical rendering backend (such as OpenGL) and proper algorithm for detecting 3D objects' intersections, the 3D plotting capabilities of Matplotlib are not great but just adequate for typical applications. In the official Matplotlib FAQ (https://matplotlib.org/mpl_toolkits/mplot3d/faq.html), the author noted that 3D plots may not look right at certain angles. Besides, we also reported that mplot3d would failed to clip bar charts if zlim is set (https://github.com/matplotlib/matplotlib/ issues/8902; see also https://github.com/matplotlib/matplotlib/issues/209). Without improvements in the 3D rendering backend, these issues are hard to fix. To better illustrate the latter issue, let's try to add ax.set_zlim3d(bottom=110, top=150) right above plt.tight_layout() in the previous 3D bar chart: Clearly, something is going wrong, as the bars overshoot the lower boundary of the axes. We will try to address the latter issue through the following workaround: # FuncFormatter to add 110 to the tick labels def major_formatter(x, pos): return "{}".format(x+110) fig = plt.figure(figsize=(9,7)) ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) # Truncate the y-values by 110 ax.bar(xs, ys-110, zs=z, zdir='y', color=colors, alpha=0.8) # Manually assign the ticks and tick labels ax.set_xticks(np.arange(ohlc_5d.shape[0])) ax.set_xticklabels(ohlc_5d["Date"], rotation=20, verticalalignment='baseline', horizontalalignment='right', fontsize='8') # Set the z-axis label ax.set_yticks([30, 20, 10, 0]) ax.set_yticklabels(["Open", "High", "Low", "Close"]) ax.zaxis.set_major_formatter(FuncFormatter(major_formatter)) ax.set_zlabel('Price (US $)') # Rotate the viewport ax.view_init(azim=-42, elev=31) plt.tight_layout() plt.show() Basically, we truncated the y values by 110, and then we used a tick formatter (major_formatter) to shift the tick value back to the original. For 3D scatter plots, we can simply remove the data points that exceed the boundary of set_zlim3d() in order to generate a proper figure. However, these workarounds may not work for every 3D plot type. Conclusion We didn't go into too much detail of the 3D plotting capability of Matplotlib, as it is yet to be polished. For simple 3D plots, Matplotlib already suffices. The learning curve can be reduced if we use the same package for both 2D and 3D plots. You are advised to take a look at MayaVi2, Plotly, and VisPy if you require more powerful 3D plotting functions. If you enjoyed this excerpt, be sure to check out the book it is from.

0
0
59052

Packt

25 May 2015

15 min read

Cleaning Data in PDF Files

Packt

25 May 2015

15 min read

In this article by Megan Squire, author of the book Clean Data, we will experiment with several data decanters to extract all the good stuff hidden inside inscrutable PDF files. We will explore the following topics: What PDF files are for and why it is difficult to extract data from them How to copy and paste from PDF files, and what to do when this does not work How to shrink a PDF file by saving only the pages that we need How to extract text and numbers from a PDF file using the tools inside a Python package called pdfMiner How to extract tabular data from within a PDF file using a browser-based Java application called Tabula How to use the full, paid version of Adobe Acrobat to extract a table of data (For more resources related to this topic, see here.) Why is cleaning PDF files difficult? Files saved in Portable Document Format (PDF) are a little more complicated than some of the text files. PDF is a binary format that was invented by Adobe Systems, which later evolved into an open standard so that multiple applications could create PDF versions of their documents. The purpose of a PDF file is to provide a way of viewing the text and graphics in a document independent of the software that did the original layout. In the early 1990s, the heyday of desktop publishing, each graphic design software package had a different proprietary format for its files, and the packages were quite expensive. In those days, in order to view a document created in Word, Pagemaker, or Quark, you would have to open the document using the same software that had created it. This was especially problematic in the early days of the Web, since there were not many available techniques in HTML to create sophisticated layouts, but people still wanted to share files with each other. PDF was meant to be a vendor-neutral layout format. Adobe made its Acrobat Reader software free for anyone to download, and subsequently the PDF format became widely used. Here is a fun fact about the early days of Acrobat Reader. The words click here when entered into Google search engine still bring up Adobe's Acrobat PDF Reader download website as the first result, and have done so for years. This is because so many websites distribute PDF files along with a message saying something like, "To view this file you must have Acrobat Reader installed. Click here to download it." Since Google's search algorithm uses the link text to learn what sites go with what keywords, the keyword click here is now associated with Adobe Acrobat's download site. PDF is still used to make vendor- and application-neutral versions of files that have layouts that are more complicated than what could be achieved with plain text. For example, viewing the same document in the various versions of Microsoft Word still sometimes causes documents with lots of embedded tables, styles, images, forms, and fonts to look different from one another. This can be due to a number of factors, such as differences in operating systems or versions of the installed Word software itself. Even with applications that are intended to be compatible between software packages or versions, subtle differences can result in incompatibilities. PDF was created to solve some of this. Right away we can tell that PDF is going to be more difficult to deal with than a text file, because it is a binary format, and because it has embedded fonts, images, and so on. So most of the tools in our trusty data cleaning toolbox, such as text editors and command-line tools (less) are largely useless with PDF files. Fortunately there are still a few tricks we can use to get the data out of a PDF file. Try simple solutions first – copying Suppose that on your way to decant your bottle of fine red wine, you spill the bottle on the floor. Your first thought might be that this is a complete disaster and you will have to replace the whole carpet. But before you start ripping out the entire floor, it is probably worth trying to clean the mess with an old bartender's trick: club soda and a damp cloth. In this section, we outline a few things to try first, before getting involved in an expensive file renovation project. They might not work, but they are worth a try. Our experimental file Let's practice cleaning PDF data by using a real PDF file. We also do not want this experiment to be too easy, so let's choose a very complicated file. Suppose we are interested in pulling the data out of a file we found on the Pew Research Center's website called "Is College Worth It?". Published in 2011, this PDF file is 159 pages long and contains numerous data tables showing various ways of measuring if attaining a college education in the United States is worth the investment. We would like to find a way to quickly extract the data within these numerous tables so that we can run some additional statistics on it. For example, here is what one of the tables in the report looks like: This table is fairly complicated. It only has six columns and eight rows, but several of the rows take up two lines, and the header row text is only shown on five of the columns. The complete report can be found at the PewResearch website at http://www.pewsocialtrends.org/2011/05/15/is-college-worth-it/, and the particular file we are using is labeled Complete Report: http://www.pewsocialtrends.org/files/2011/05/higher-ed-report.pdf. Step one – try copying out the data we want The data we will experiment on in this example is found on page 149 of the PDF file (labeled page 143 in their document). If we open the file in a PDF viewer, such as Preview on Mac OSX, and attempt to select just the data in the table, we already see that some strange things are happening. For example, even though we did not mean to select the page number (143); it got selected anyway. This does not bode well for our experiment, but let's continue. Copy the data out by using Command-C or select Edit | Copy. How text looks when selected in this PDF from within Preview Step two – try pasting the copied data into a text editor The following screenshot shows how the copied text looks when it is pasted into Text Wrangler, our text editor: Clearly, this data is not in any sensible order after copying and pasting it. The page number is included, the numbers are horizontal instead of vertical, and the column headers are out of order. Even some of the numbers have been combined; for example, the final row contains the numbers 4,4,3,2; but in the pasted version, this becomes a single number 4432. It would probably take longer to clean up this data manually at this point than it would have taken just to retype the original table. We can conclude that with this particular PDF file, we are going to have to take stronger measures to clean it. Step three – make a smaller version of the file Our copying and pasting procedures have not worked, so we have resigned ourselves to the fact that we are going to need to prepare for more invasive measures. Perhaps if we are not interested in extracting data from all 159 pages of this PDF file, we can identify just the area of the PDF that we want to operate on, and save that section to a separate file. To do this in Preview on MacOSX, launch the File | Print… dialog box. In the Pages area, we will enter the range of pages we actually want to copy. For the purpose of this experiment, we are only interested in page 149; so enter 149 in both the From: and to: boxes as shown in the following screenshot. Then from the PDF dropdown box at the bottom, select Open PDF in Preview. You will see your single-page PDF in a new window. From here, we can save this as a new file and give it a new name, such as report149.pdf or the like. Another technique to try – pdfMiner Now that we have a smaller file to experiment with, let's try some programmatic solutions to extract the text and see if we fare any better. pdfMiner is a Python package with two embedded tools to operate on PDF files. We are particularly interested in experimenting with one of these tools, a command-line program called pdf2txt that is designed to extract text from within a PDF document. Maybe this will be able to help us get those tables of numbers out of the file correctly. Step one – install pdfMiner Launch the Canopy Python environment. From the Canopy Terminal Window, run the following command: pip install pdfminer This will install the entire pdfMiner package and all its associated command-line tools. The documentation for pdfMiner and the two tools that come with it, pdf2txt and dumpPDF, is located at http://www.unixuser.org/~euske/python/pdfminer/. Step two – pull text from the PDF file We can extract all text from a PDF file using the command-line tool called pdf2txt.py. To do this, use the Canopy Terminal and navigate to the directory where the file is located. The basic format of the command is pdf2txt.py <filename>. If you have a larger file that has multiple pages (or you did not already break the PDF into smaller ones), you can also run pdf2txt.py –p149 <filename> to specify that you only want page 149. Just as with the preceding copy-and-paste experiment, we will try this technique not only on the tables located on page 149, but also on the Preface on page 3. To extract just the text from page 3, we run the following command: pdf2txt.py –p3 pewReport.pdf After running this command, the extracted preface of the Pew Research report appears in our command-line window: To save this text to a file called pewPreface.txt, we can simply add a redirect to our command line as follows: pdf2txt.py –p3 pewReport.pdf > pewPreface.txt But what about those troublesome data tables located on page 149? What happens when we use pdf2txt on those? We can run the following command: pdf2txt.py pewReport149.pdf The results are slightly better than copy and paste, but not by much. The actual data output section is shown in the following screenshot. The column headers and data are mixed together, and the data from different columns are shown out of order. We will have to declare the tabular data extraction portion of this experiment a failure, though pdfMiner worked reasonably well on line-by-line text-only extraction. Remember that your success with each of these tools may vary. Much of it depends on the particular characteristics of the original PDF file. It looks like we chose a very tricky PDF for this example, but let's not get disheartened. Instead, we will move on to another tool and see how we fare with it. Third choice – Tabula Tabula is a Java-based program to extract data within tables in PDF files. We will download the Tabula software and put it to work on the tricky tables in our page 149 file. Step one – download Tabula Tabula is available to be downloaded from its website at http://tabula.technology/. The site includes some simple download instructions. On Mac OSX version 10.10.1, I had to download the legacy Java 6 application before I was able to run Tabula. The process was straightforward and required only following the on-screen instructions. Step two – run Tabula Launch Tabula from inside the downloaded .zip archive. On the Mac, the Tabula application file is called simply Tabula.app. You can copy this to your Applications folder if you like. When Tabula starts, it launches a tab or window within your default web browser at the address http://127.0.0.1:8080/. The initial action portion of the screen looks like this: The warning that auto-detecting tables takes a long time is true. For the single-page perResearch149.pdf file, with three tables in it, table auto-detection took two full minutes and resulted in an error message about an incorrectly formatted PDF file. Step three – direct Tabula to extract the data Once Tabula reads in the file, it is time to direct it where the tables are. Using your mouse cursor, select the table you are interested in. I drew a box around the entire first table. Tabula took about 30 seconds to read in the table, and the results are shown as follows: Compared to the way the data was read with copy and paste and pdf2txt, this data looks great. But if you are not happy with the way Tabula reads in the table, you can repeat this process by clearing your selection and redrawing the rectangle. Step four – copy the data out We can use the Download Data button within Tabula to save the data to a friendlier file format, such as CSV or TSV. Step five – more cleaning Open the CSV file in Excel or a text editor and take a look at it. At this stage, we have had a lot of failures in getting this PDF data extracted, so it is very tempting to just quit now. Here are some simple data cleaning tasks: We can combine all the two-line text cells into a single cell. For example, in column B, many of the phrases take up more than one row. Prepare students to be productive and members of the workforce should be in one cell as a single phrase. The same is true for the headers in Rows 1 and 2 (4-year and Private should be in a single cell). To clean this in Excel, create a new column between columns B and C. Use the concatenate() function to join B3:B4, B5:B6, and so on. Use Paste-Special to add the new concatenated values into a new column. Then remove the two columns you no longer need. Do the same for rows 1 and 2. Remove blank lines between rows. When these procedures are finished, the data looks like this: Tabula might seem like a lot of work compared to cutting and pasting data or running a simple command-line tool. That is true, unless your PDF file turns out to be finicky like this one was. Remember that specialty tools are there for a reason—but do not use them unless you really need them. Start with a simple solution first and only proceed to a more difficult tool when you really need it. When all else fails – fourth technique Adobe Systems sells a paid, commercial version of their Acrobat software that has some additional features above and beyond just allowing you to read PDF files. With the full version of Acrobat, you can create complex PDF files and manipulate existing files in various ways. One of the features that is relevant here is the Export Selection As… option found within Acrobat. To get started using this feature, launch Acrobat and use the File Open dialog to open the PDF file. Within the file, navigate to the table holding the data you want to export. The following screenshot shows how to select the data from the page 149 PDF we have been operating on. Use your mouse to select the data, then right-click and choose Export Selection As… At this point, Acrobat will ask you how you want the data exported. CSV is one of the choices. Excel Workbook (.xlsx) would also be a fine choice if you are sure you will not want to also edit the file in a text editor. Since I know that Excel can also open CSV files, I decided to save my file in that format so I would have the most flexibility between editing in Excel and my text editor. After choosing the format for the file, we will be prompted for a filename and location for where to save the file. When we launch the resulting file, either in a text editor or in Excel, we can see that it looks a lot like the Tabula version we saw in the previous section. Here is how our CSV file will look when opened in Excel: At this point, we can use the exact same cleaning routine we used with the Tabula data, where we concatenated the B2:B3 cells into a single cell and then removed the empty rows. Summary The goal of this article was to learn how to export data out of a PDF file. Like sediment in a fine wine, the data in PDF files can appear at first to be very difficult to separate. Unlike decanting wine, however, which is a very passive process, separating PDF data took a lot of trial and error. We learned four ways of working with PDF files to clean data: copying and pasting, pdfMiner, Tabula, and Acrobat export. Each of these tools has certain strengths and weaknesses: Copying and pasting costs nothing and takes very little work, but is not as effective with complicated tables. pdfMiner/Pdf2txt is also free, and as a command-line tool, it could be automated. It also works on large amounts of data. But like copying and pasting, it is easily confused by certain types of tables. Tabula takes some work to set up, and since it is a product undergoing development, it does occasionally give strange warnings. It is also a little slower than the other options. However, its output is very clean, even with complicated tables. Acrobat gives similar output to Tabula, but with almost no setup and very little effort. It is a paid product. By the end, we had a clean dataset that was ready for analysis or long-term storage. Resources for Article: Further resources on this subject: Machine Learning Using Spark MLlib [article] Data visualization [article] First steps with R [article]

0
1
59000

article-image-build-generative-chatbot-using-recurrent-neural-networks-lstm-rnns

Savia Lobo

15 Feb 2018

8 min read

Build a generative chatbot using recurrent neural networks (LSTM RNNs)

Savia Lobo

15 Feb 2018

8 min read

In today’s tutorial we will learn to build generative chatbot using recurrent neural networks. The RNN used here is Long Short Term Memory(LSTM). Generative chatbots are very difficult to build and operate. Even today, most workable chatbots are retrieving in nature; they retrieve the best response for the given question based on semantic similarity, intent, and so on. For further reading, refer to the paper Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation by Kyunghyun Cho et. al. (https://arxiv.org/pdf/1406.1078.pdf). [box type="note" align="" class="" width=""]This article is an excerpt from a book written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled Natural Language Processing with Python Cookbook. In this book you will come across various recipes covering natural language understanding, Natural Language Processing, and syntactic analysis.[/box] Getting ready... The A.L.I.C.E Artificial Intelligence Foundation dataset bot.aiml Artificial Intelligence Markup Language (AIML), which is customized syntax such as XML file has been used to train the model. In this file, questions and answers are mapped. For each question, there is a particular answer. Complete .aiml files are available at aiml-en-us-foundation-alice.v1-9 from https://code.google.com/archive/p/aiml-en-us-foundation-alice/downloads. Unzip the folder to see the bot.aiml file and open it using Notepad. Save as bot.txt to read in Python: >>> import os """ First change the following directory link to where all input files do exist """ >>> os.chdir("C:UsersprataDocumentsbook_codesNLP_DL") >>> import numpy as np >>> import pandas as pd # File reading >>> with open('bot.txt', 'r') as content_file: ... botdata = content_file.read() >>> Questions = [] >>> Answers = [] AIML files have unique syntax, similar to XML. The pattern word is used to represent the question and the template word for the answer. Hence, we are extracting respectively: >>> for line in botdata.split("</pattern>"): ... if "<pattern>" in line: ... Quesn = line[line.find("<pattern>")+len("<pattern>"):] ... Questions.append(Quesn.lower()) >>> for line in botdata.split("</template>"): ... if "<template>" in line: ... Ans = line[line.find("<template>")+len("<template>"):] ... Ans = Ans.lower() ... Answers.append(Ans.lower()) >>> QnAdata = pd.DataFrame(np.column_stack([Questions,Answers]),columns = ["Questions","Answers"]) >>> QnAdata["QnAcomb"] = QnAdata["Questions"]+" "+QnAdata["Answers"] >>> print(QnAdata.head()) The question and answers are joined to extract the total vocabulary used in the modeling, as we need to convert all words/characters into numeric representation. The reason is the same as mentioned before—deep learning models can't read English and everything is in numbers for the model. How to do it... After extracting the question-and-answer pairs, the following steps are needed to process the data and produce the results: Preprocessing: Convert the question-and-answer pairs into vectorized format, which will be utilized in model training. Model building and validation: Develop deep learning models and validate the data. Prediction of answers from trained model: The trained model will be used to predict answers for given questions. How it works... The question and answers are utilized to create the vocabulary of words to index mapping, which will be utilized for converting words into vector mappings: # Creating Vocabulary >>> import nltk >>> import collections >>> counter = collections.Counter() >>> for i in range(len(QnAdata)): ... for word in nltk.word_tokenize(QnAdata.iloc[i][2]): ... counter[word]+=1 >>> word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())} >>> idx2word = {v:k for k,v in word2idx.items()} >>> idx2word[0] = "PAD" >>> vocab_size = len(word2idx)+1 >>> print (vocab_size) Encoding and decoding functions are used to convert text to indices and indices to text respectively. As we know, Deep learning models work on numeric values rather than text or character data: >>> def encode(sentence, maxlen,vocab_size): ... indices = np.zeros((maxlen, vocab_size)) ... for i, w in enumerate(nltk.word_tokenize(sentence)): ... if i == maxlen: break ... indices[i, word2idx[w]] = 1 ... return indices >>> def decode(indices, calc_argmax=True): ... if calc_argmax: ... indices = np.argmax(indices, axis=-1) ... return ' '.join(idx2word[x] for x in indices) The following code is used to vectorize the question and answers with the given maximum length for both questions and answers. Both might be different lengths. In some pieces of data, the question length is greater than answer length, and in a few cases, it's length is less than answer length. Ideally, the question length is good to catch the right answers. Unfortunately in this case, question length is much less than the answer length, which is a very bad example to develop generative models: >>> question_maxlen = 10 >>> answer_maxlen = 20 >>> def create_questions(question_maxlen,vocab_size): ... question_idx = np.zeros(shape=(len(Questions),question_maxlen, vocab_size)) ... for q in range(len(Questions)): ... question = encode(Questions[q],question_maxlen,vocab_size) ... question_idx[q] = question ... return question_idx >>> quesns_train = create_questions(question_maxlen=question_maxlen, vocab_size=vocab_size) >>> def create_answers(answer_maxlen,vocab_size): ... answer_idx = np.zeros(shape=(len(Answers),answer_maxlen, vocab_size)) ... for q in range(len(Answers)): ... answer = encode(Answers[q],answer_maxlen,vocab_size) ... answer_idx[q] = answer ... return answer_idx >>> answs_train = create_answers(answer_maxlen=answer_maxlen,vocab_size= vocab_size) >>> from keras.layers import Input,Dense,Dropout,Activation >>> from keras.models import Model >>> from keras.layers.recurrent import LSTM >>> from keras.layers.wrappers import Bidirectional >>> from keras.layers import RepeatVector, TimeDistributed, ActivityRegularization The following code is an important part of the chatbot. Here we have used recurrent networks, repeat vector, and time-distributed networks. The repeat vector used to match dimensions of input to output values. Whereas time-distributed networks are used to change the column vector to the output dimension's vocabulary size: >>> n_hidden = 128 >>> question_layer = Input(shape=(question_maxlen,vocab_size)) >>> encoder_rnn = LSTM(n_hidden,dropout=0.2,recurrent_dropout=0.2) (question_layer) >>> repeat_encode = RepeatVector(answer_maxlen)(encoder_rnn) >>> dense_layer = TimeDistributed(Dense(vocab_size))(repeat_encode) >>> regularized_layer = ActivityRegularization(l2=1)(dense_layer) >>> softmax_layer = Activation('softmax')(regularized_layer) >>> model = Model([question_layer],[softmax_layer]) >>> model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) >>> print (model.summary()) The following model summary describes the change in flow of model size across the model. The input layer matches the question's dimension and the output matches the answer's dimension: # Model Training >>> quesns_train_2 = quesns_train.astype('float32') >>> answs_train_2 = answs_train.astype('float32') >>> model.fit(quesns_train_2, answs_train_2,batch_size=32,epochs=30, validation_split=0.05) The results are a bit tricky in the following screenshot even though the accuracy is significantly higher. The chatbot model might produce complete nonsense, as most of the words are padding here. The reason? The number of words in this data is less: # Model prediction >>> ans_pred = model.predict(quesns_train_2[0:3]) >>> print (decode(ans_pred[0])) >>> print (decode(ans_pred[1])) The following screenshot depicts the sample output on test data. The output does not seem to make sense, which is an issue with generative models: Our model did not work well in this case, but still some areas of improvement are possible going forward with generative chatbot models. Readers can give it a try: Have a dataset with lengthy questions and answers to catch signals well Create a larger architecture of deep learning models and train over longer iterations Make question-and-answer pairs more generic rather than factoid-based, such as retrieving knowledge and so on, where generative models fail miserably. Here, you saw how to build chatbots using LSTM. You can go ahead and try building one of your own generative chatbots using the example above. If you found this post useful, do check out this book Natural Language Processing with Python Cookbook to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more.

0
4
58183

article-image-deploy-rethinkdb-using-docker

Vijin Boricha

14 Feb 2018

7 min read

How to deploy RethinkDB using Docker

Vijin Boricha

14 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will help you develop efficient and real-time applications in RethinkDB with ease.[/box] In today’s tutorial, we will learn to install Docker, create an Docker image and deploy RethinkDB using Docker. Your code is not working in Production? But it's working on the QA (quality analysis server)! I am sure you have heard statements like these in your team during the deployment phase. Well no more of that, Docker everything and forget about the infrastructure of different environments, say, QA, Staging and Production, because your code is going to run Docker container not in those machines, hence write once, run everywhere. In this section, we will learn how to use Docker to deploy a RethinkDB Server or PaaS services. I am going to cover a few docker basics too; if you are already aware of them, please skip to the next section. Installing Docker Docker is available for all major platforms, such as, Linux-based distributions, Mac, and Windows. Visit the official website at h t t p s ://w w w . d o c k e r . c o m / and download the package suitable for your platform. We are installing Docker in our machine to create a new Docker image. Docker images are independent of platform and should not be confused with Docker for Mac or Docker for Windows. It's referred to as a Docker client too. Once you have installed the Docker, you need to start the Daemon process first; I am using a Mac so I can view this in the launchpad, as shown here: Upon clicking that, it will open up a nice console showing the Docker official logo and an indication that Docker is successfully booted, as shown in the following screenshot: Now we can begin creating our Docker image that in turn will run RethinkDB. Creating a Docker image For installing our RethinkDB on Ubuntu inside the Docker, we need to install the Ubuntu operating system. Run the following command to install a Ubuntu image from the official Docker hub repository: docker pull ubuntu This will download and install the Ubuntu image in our system. We will later use this Ubuntu image and install our RethinkDB instance; you can choose different operating systems as well. Before going to the Docker configuration code, I would like to point out the steps we require to install RethinkDB on a fresh Ubuntu installation: Update the system Add the RethinkDB repository to the known repository list Install RethinkDB Set the data folder Expose the port We are going to do this using Docker. To create a Docker image, we require Dockerfile. Create a file called Dockerfile with no extension and apply the code shown here: FROM ubuntu:latest # Install RethinkDB. RUN apt-get update && echo "deb http://download.rethinkdb.com/apt `lsb_release -cs` main" > /etc/apt/sources.list.d/rethinkdb.list && apt-get install -y wget && wget -O- http://download.rethinkdb.com/apt/pubkey.gpg | apt-key add - && apt-get update && apt-get install -y rethinkdb python-pip && rm -rf /var/lib/apt/lists/* # Install python driver for rethinkdb RUN pip install rethinkdb # Define mountable directories. VOLUME ["/data"] # Define working directory. WORKDIR /data # Define default command. CMD ["rethinkdb", "--bind", "all"] # Expose ports. # - 8080: web UI # - 28015: process # - 29015: cluster EXPOSE 8080 EXPOSE 28015 EXPOSE 29015 The first line is our entry point to the Ubuntu operating system, then we are performing an update of the system and using the installation commands recommended by RethinkDB here: h t t p s ://w w w . r e t h i n k d b . c o m /d o c s /i n s t a l l /u b u n t u /. Once the installation is complete, we install the rethinkdb python driver to perform the import/export operation. The next two commands mount a new volume in Ubuntu and telling RethinkDB to use that volume. The next command runs rethinkdb by binding all the ports and exposing the ports to be used by the client driver and web console. In order to make this a docker image, save the file and run the following command within the project directory: docker build -t docker-rethinkdb. Here, we are building our docker image and giving it a name docker-rethinkdb; upon running this command, Docker will execute the Dockerfile and you're on. The representation of the previous steps is shown here: Once everything works, and I am sure it will, you will see a success message in the console, as shown here: Congratulations! You have successfully created a docker image for RethinkDB. If you want to see your image and its properties, run the following command: docker images And this will list all the images of Docker, as shown in the following screenshot: Awesome! Now let's run it. To access the web portal, we need to run our docker image and bind port 8080 of the docker image to some port of our machine; here is the command to do so: docker run -p 3000:8080 -d docker-rethinkdb As per the command above, -p is used to specify port binding, the first is the target and second port is source, that is, Docker port and -d is used to run it in the background or Daemon. This will run the docker image in the background; to extract more information about this process, we need to run the following command: docker ps This will list all the running images called as a container, along with the information, as shown in the following screenshot: You can also check the logs of specific containers using the following command: docker logs <container id> Now, in order to access the RethinkDB web console from our machine, we need to find out the IP address on which the Docker machine is running. To get that, we need to run the following command: docker-machine ip default This will print out the IP. Copy the IP and hit IP:3000 from the browser to view the RethinkDB web console, as shown here: So we have docker running and accessible from the browser. In order to import and export the data, we need to log in to our Docker image. To do that, run the following command: docker exec -i -t <container-id> /bin/bash This will log in to the docker image running Ubuntu; refer to the following screenshot: You can now run the rethinkdb command to perform the data import to the existing RethinkDB cluster. Deploying the Docker image Almost every PaaS service we have covered in earlier sections provides support for Docker. You can submit your Dockerfile to git and clone it anywhere if you want to create Docker image. You can submit the whole docker image (not Dockerfile) to Dockerhub and pull your docker image directly using the docker pull command, which is no doubt an easy way because you will be directly working on the image running on the server. We covered RethinkDB deployment using Docker and learned how to create our own RethinkDB image. You can learn more about RethinkDB Query Language and Performance Tuning in RethinkDB from this book Mastering RethinkDB.

0
0
57987

article-image-exploring-structure-motion-using-opencv

Packt

09 Jan 2017

20 min read

Exploring Structure from Motion Using OpenCV

Packt

09 Jan 2017

20 min read

0
1
57530

article-image-install-elasticsearch-ubuntu-windows

Fatema Patrawala

16 Feb 2018

3 min read

How to install Elasticsearch in Ubuntu and Windows

Fatema Patrawala

16 Feb 2018

3 min read

[box type="note" align="" class="" width=""]This article is an extract from the book, Mastering Elastic Stack co-authored by Ravi Kumar Gupta and Yuvraj Gupta.This book will brush you up with basic knowledge on implementing the Elastic Stack and then dives deep into complex and advanced implementations. [/box] In today’s tutorial we aim to learn Elasticsearch v5.1.1 installation for Ubuntu and Windows. Installation of Elasticsearch on Ubuntu 14.04 In order to install Elasticsearch on Ubuntu, refer to the following steps: Download Elasticsearch 5.1.1 as a debian package using terminal: wget https://artifacts.elastic.co /downloads/elasticsearch/elasticsearch-5.1.1.deb 2. Install the debian package using following command: sudo dpkg -i elasticsearch-5.1.1.deb Elasticsearch will be installed in /usr/share/elasticsearch directory. The configuration files will be present at /etc/elasticsearch. The init script will be present at /etc/init.d/elasticsearch. The log files will be present within /var/log/elasticsearch directory. 3. Configure Elasticsearch to run automatically on bootup . If you are using SysV init distribution, then run the following command: sudo update-rc.d elasticsearch defaults 95 10 The preceding command will print on screen: Adding system startup for, /etc/init.d/elasticsearch Check status of Elasticsearch using following command: sudo service elasticsearch status Run Elasticsearch as a service using following command: sudo service elasticsearch start Elasticsearch may not start if you have any plugin installed which is not supported in ES-5.0.x version onwards. As plugins have been deprecated, it is required to uninstall any plugin if exists in prior version of ES. Remove a plugin after going to ES Home using following command: bin/elasticsearch-plugin remove head Usage of Elasticsearch command: sudo service elasticsearch {start|stop|restart|force- reload|status} If you are using systemd distribution, then run following command: sudo /bin/systemctl daemon-reload sudo /bin/systemctl enable elasticsearch.service To verify elasticsearch installation open open http://localhost:9200 in browser or run the following command from command line: curl -X GET http://localhost:9200 Installation of Elasticsearch on Windows In order to install Elasticsearch on Windows, refer to the following steps: Download Elasticsearch 5.1.1 version from its site using the following link: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch -5.1.1.zip Upon opening the link, click on it and it will download the ZIP package. 2. Extract the downloaded ZIP package by unzipping it using WinRAR, 7-Zip, and other such extracting softwares (if you don't have one of these then download it). This will extract the files and folders in the directory. 3. Then click on the extracted folder and navigate the folder to reach inside the bin folder. 4. Click on the elasticsearch.bat file to run Elasticsearch. If this window is closed Elasticsearch will stop running, as the node will shut down. 5. To verify Elasticsearch installation, open http://localhost:9200 in the browser: Installation of Elasticsearch as a service After installing Elasticsearch as previously mentioned, open Command Prompt after navigating to the bin folder and use the following command: elasticsearch-service.bat install Usage: elasticsearch-service.bat install | remove | start | stop | manager To summarize, we learnt installation of Elasticsearch on Ubuntu and Windows. If you are keen to know more about how to work with the Elastic Stack in a production environment, you can grab our comprehensive guide Mastering Elastic Stack.

0
0
56496

Amey Varangaonkar

30 Apr 2018

9 min read

12 most common MySQL errors you should be aware of

Amey Varangaonkar

30 Apr 2018

9 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide written by Chintan Mehta, Ankit Bhavsar, Subhash Shah and Hetal Oza. This book provides tips and tricks to tackle problems you might encounter while administering MySQL solution.[/box] While using MySQL 8 there can be few scenarios where you would not be able to access or use MySQL properly. These situations can be very annoying, but are easily fixable. However, before you look for the solution, you must know the problem! Here are some of the common errors you might come across when using MySQL 8. 1. Access denied MySQL provides a privilege system that authenticates the user who connects from a host, and associates the user with access privileges on a database. The privileges include SELECT, INSERT, UPDATE, and DELETE and are able to identify anonymous users and grant privileges for MySQL specific functions, such as LOAD DATA INFILE and administrative operations. The access denied error may occur because of many causes. In many cases, the problem is caused because of MySQL accounts that the client programs use to connect with the MySQL server with permission from the server. 2. Lost connection to MySQL server The lost connection to MySQL server error can occur because of one of the three likely causes explained in this section. One potential reason for the error is that the network connectivity is troublesome. Network conditions should be checked if this is a frequent error. If an error message like “Lost connection to MySQL server” appears while querying the database, it is certain that the error has occurred because of network connection issues. The connection_timeout system variable defines the number of seconds that the mysqld server waits for a connection packet before connection timeout response. Infrequently, this error may occur when a client is trying for the initial connection to the server and the connection_timeout value is set to a few seconds. In this case, the problem can be resolved by increasing the connection_timeout value based on the the distance and connection speed. SHOW GLOBAL STATUS LIKE and Aborted_connects can be used to determine if we are experiencing this more frequently. It can be certainly said that increasing the connection_timeout value is the solution if the error message contains reading authorization packet. It is possible that the problem may be faced because of larger Binary Large OBject (BLOB) values than max_allowed_packet. This can cause a lost connection to the MySQL server error with clients. If the ER_NET_PACKET_TOO_LARGE error is observed, it confirms that the max_allowed_packet value should be increased. 3. Password fails when entered incorrectly MySQL clients ask for a password when the client program is invoked with the -- password or -p option without the password value. The following is the command: > mysql -u user_name -p Enter password: On a few systems, it may happen that the password works fine when specified in an option file or on the command line. But it does not work when entered interactively on the Command Prompt at the Enter password: prompt. It occurs because the system-provided library to read the passwords limits the password values to a small number of characters (usually eight). It is an issue with the system library and not with MySQL. As a workaround to this, change the MySQL password to a value that is eight or fewer characters or store the password in the option file. 4. Host host_name is blocked If the mysqld server receives too many connection requests from the host that is interrupted in the middle, the following error occurs: Host 'host_name' is blocked because of many connection errors. Unblock with 'mysqladmin flush-hosts' The max_connect_errors system variable determines the number of successive interrupted connection requests that are allowed. Once there are max_connect_errors failed requests without a successful connection, mysqld assumes that something is wrong and blocks the host from further connections until the FLUSH HOSTS statement or mysqladmin flush-hosts command is issued. mysqld blocks a host after 100 connection errors as a default. It can be adjusted by setting the max_connect_errors value on the server startup, as follows: > mysqld_safe --max_connect_errors=10000 This value can also be set up at runtime, as follows: mysql> SET GLOBAL max_connect_errors=10000; It should be checked first that there is nothing wrong with TCP/IP connections from the host if the host_name is blocked error is received for a particular host. Increasing the value of the max_connect_errors variable does not help if the network has problems. 5. Too many connections This error indicates that all available connection are in use for other client connections. The max_connections is the system variable that controls the number of connections to the server. The default value for the maximum number of connections is 151. We can set a larger value than 151 for the max_connections system variable to support more connections than 151. The mysqld server process actually allows one more than max_connections (max_connections + 1) value clients to connect. The additional one connection is kept reserved for accounts with CONNECTION_ADMIN or the SUPER privilege. The privilege can be granted to the administrators with access to the PROCESS privilege. With this access, the administrator can connect to the server using the reserved connection. They can execute the SHOW PROCESSLIST command to diagnose the problems even though the maximum number of client connections is exhausted. 6. Out of memory If the mysql does not have enough memory to store the entire request of the query issued by the MySQL client program, the server throws the following error: mysql: Out of memory at line 42, 'malloc.c' mysql: needed 8136 byte (8k), memory in use: 12481367 bytes (12189k) ERROR 2008: MySQL client ran out of memory In order to fix the problem, we must first check if the query is correct. Do we expect the query to return so many rows? If not, we should correct the query and execute it again. If the query is correct and needs no correction, we can connect mysql with the --quick option. Using the --quick option results in the mysql_use_result() C API function for fetching the result set. The function adds more load on the server and less load on the client. 7. Packet too large The communication packet is one of the following: A single SQL statement that the MySQL client sends to the MySQL server A single row that is sent to the MySQL client from the MySQL server A binary log event that is sent from a replication master server to the replication slave A 1 GB packet size is the largest possible packet size that can be transmitted to or from the MySQL 8 server or client. The MySQL server or client issues an ER_NET_PACKET_TOO_LARGE error and closes the connection if it receives a packet bigger than max_allowed_packet bytes. The default max_allowed_packet size is 16 MB for the MySQL client program. The following command can be used to set a larger value: > mysql --max_allowed_packet=32M The default value for the MySQL server is 64 MB. It should be noted that there is no harm in setting a larger value for this system variable, as the additional memory is allocated as needed. 8. The table is full The table-full error occurs in one of the following conditions: The disk is full The table has reached the maximum size The actual maximum table size in the MySQL database can be determined by the constraints imposed by the operating system on the file sizes. 9. Can't create/write to file This indicates that MySQL is unable to create a temporary file in the temporary directory for the result set if we get the following error while executing a query: Can't create/write to file 'sqla3fe_0.ism' The possible workaround for the error is to start the mysqld server with the --tmpdir option. The following is the command: > mysqld --tmpdir C:/temp 10. Commands out of sync If the client functions are called in the wrong order, the commands out of sync error is received. It means that the command cannot be executed in the client code. As an example, if we execute mysql_use_result() and try to execute another query before executing mysql_free_result(), this error may occur. It may also happen if we execute two queries that return a result set without calling the mysql_use_result() or mysql_store_result() functions in between. 11. Ignoring user The following error is received when an account in the user table is found with an invalid password upon the mysqld server startup or when the server reloads the grant tables: Found wrong password for user 'some_user'@'some_host'; ignoring user The account is ignored by the MySQL permission system as a result. To fix the problem, we should assign a new valid password for the account. 12. Table tbl_name doesn't exist The following error indicates that a specified table does not exist in the default database: Table 'tbl_name' doesn't exist Can't find file: 'tbl_name' (errno: 2) In some cases, the user may be referring to the table incorrectly. It is possible because the MySQL server uses directories and files for storing database tables. Depending upon the operating system file management, the database and table names can be case sensitive. For non case-sensitive file systems, such as Windows, the references to a specified table used within a query must use the same letter case. In addition to these, you might come across MySQL 8 server errors such as issue with permission, or client errors like problem with NULL values. To know how to deal with them, you may check out this book MySQL 8 Administrator’s Guide. MySQL 8.0 is generally available with added features Basic Website using Node.js and MySQL database

0
0
56288

article-image-5-reasons-why-you-should-use-an-open-source-data-analytics-stack-in-2020

Amey Varangaonkar

28 Jan 2020

7 min read

5 reasons why you should use an open-source data analytics stack in 2020

Amey Varangaonkar

28 Jan 2020

7 min read

Today, almost every company is trying to be data-driven in some sense or the other. Businesses across all the major verticals such as healthcare, telecommunications, banking, insurance, retail, education, etc. make use of data to better understand their customers, optimize their business processes and, ultimately, maximize their profits. This is a guest post sponsored by our friends at RudderStack. When it comes to using data for analytics, companies face two major challenges: Data tracking: Tracking the required data from a multitude of sources in order to get insights out of it. As an example, tracking customer activity data such as logins, signups, purchases, and even clicks such as bookmarks from platforms such as mobile apps and websites becomes an issue for many eCommerce businesses. Building a link between the Data and Business Intelligence: Once data is acquired, transforming it and making it compatible for a BI tool can often prove to be a substantial challenge. A well designed data analytics stack comes is essential in combating these challenges. It will ensure you're well-placed to use the data at your disposal in more intelligent ways. It will help you drive more value. What does a data analytics stack do? A data analytics stack is a combination of tools which when put together, allows you to bring together all of your data in one platform, and use it to get actionable insights that help in better decision-making. As seen the diagram above illustrates, a data analytics stack is built upon three fundamental steps: Data Integration: This step involves collecting and blending data from multiple sources and transforming them in a compatible format, for storage. The sources could be as varied as a database (e.g. MySQL), an organization’s log files, or event data such as clicks, logins, bookmarks, etc from mobile apps or websites. A data analytics stack allows you to use all of such data together and use it to perform meaningful analytics. Data Warehousing: This next step involves storing the data for the purpose of analytics. As the complexity of data grows, it is feasible to consolidate all the data in a single data warehouse. Some of the popular modern data warehouses include Amazon’s Redshift, Google BigQuery and platforms such as Snowflake and MarkLogic. Data Analytics: In this final step, we use a visualization tool to load the data from the warehouse and use it to extract meaningful insights and patterns from the data, in the form of charts, graphs and reports. Choosing a data analytics stack - proprietary or open-source? When it comes to choosing a data analytics stack, businesses are often left with two choices - buy it or build it. On one hand, there are proprietary tools such as Google Analytics, Amplitude, Mixpanel, etc. - where the vendors alone are responsible for their configuration and management to suit your needs. With the best in class features and services that come along with the tools, your primary focus can just be project management, rather than technology management. While using proprietary tools have their advantages, there are also some major cons to them that revolve mainly around cost, data sharing, privacy concerns, and more. As a result, businesses today are increasingly exploring the open-source alternatives to build their data analytics stack. The advantages of open source analytics tools Let's now look at the 5 main advantages that open-source tools have over these proprietary tools. Open source analytics tools are cost effective Proprietary analytics products can cost hundreds of thousands of dollars beyond their free tier. For small to medium-sized businesses, the return on investment does not often justify these costs. Open-source tools are free to use and even their enterprise versions are reasonably priced compared to their proprietary counterparts. So, with a lower up-front costs, reasonable expenses for training, maintenance and support, and no cost for licensing, open-source analytics tools are much more affordable. More importantly, they're better value for money. Open source analytics tools provide flexibility Proprietary SaaS analytics products will invariably set restrictions on the ways in which they can be used. This is especially the case with the trial or the lite versions of the tools, which are free. For example, full SQL is not supported by some tools. This makes it hard to combine and query external data alongside internal data. You'll also often find that warehouse dumps provide no support either. And when they do, they'll probably cost more and still have limited functionality. Data dumps from Google Analytics, for instance, can only be loaded into Google BigQuery. Also, these dumps are time-delayed. That means the loading process can be very slow.. With open-source software, you get complete flexibility: from the way you use your tools, how you combine to build your stack, and even how you use your data. If your requirements change - which, let's face it, they probably will - you can make the necessary changes without paying extra for customized solutions. Avoid vendor lock-in Vendor lock-in, also known as proprietary lock-in, is essentially a state where a customer becomes completely dependent on the vendor for their products and services. The customer is unable to switch to another vendor without paying a significant switching cost. Some organizations spend a considerable amount of money on proprietary tools and services that they heavily rely on. If these tools aren't updated and properly maintained, the organization using it is putting itself at a real competitive disadvantage. This is almost never the case with open-source tools. Constant innovation and change is the norm. Even if the individual or the organization handling the tool moves on, the community catn take over the project and maintain it. With open-source, you can rest assured that your tools will always be up-to-date without heavy reliance on anyone. Improved data security and privacy Privacy has become a talking point in many data-related discussions of late. This is thanks, in part, to data protection laws such as the GDPR and CCPA coming into force. High-profile data leaks have also kept the issue high on the agenda. An open-source stack analytics running inside your cloud or on-prem environment gives complete control of your data. This lets you decide which data is to be used when, and how. It lets you dictate how third parties can access and use your data, if at all. Open-source is the present It's hard to counter the fact that open-source is now mainstream. Companies like Microsoft, Apple, and IBM are now not only actively participating in the open-source community, they're also contributing to it. Open-source puts you on the front foot when it comes to innovation. With it, you'll be able to leverage the power of a vibrant developer community to develop better products in more efficient ways. How RudderStack helps you build an ideal open-source data analytics stack RudderStack is a completely open-source, enterprise-ready platform to simplify data management in the most secure and reliable way. It works as a perfect data integration platform by routing your event data from data sources such as websites, mobile apps and servers, to multiple destinations of your choice - thus helping you save time and effort. RudderStack integrates effortlessly with a multitude of destinations such as Google Analytics, Amplitude, MixPanel, Salesforce, HubSpot, Facebook Ads, and more, as well as popular data warehouses such as Amazon Redshift or S3. If performing efficient clickstream analytics is your goal, RudderStack offers you the perfect data pipeline to collect and route your data securely. Learn more about Rudderstack by visiting the RudderStack website, or check out its GitHub page to find out how it works.

0
0
55561

article-image-how-data-scientists-test-hypotheses-and-probability

Richard Gall

23 Apr 2018

4 min read

How data scientists test hypotheses and probability

Richard Gall

23 Apr 2018

4 min read

0
0
55468

article-image-how-to-build-a-gaussian-mixture-model

Gebin George

27 Jan 2018

8 min read

How to build a Gaussian Mixture Model

Gebin George

27 Jan 2018

8 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Osvaldo Martin titled Bayesian Analysis with Python. This book will help you implement Bayesian analysis in your application and will guide you to build complex statistical problems using Python.[/box] Our article teaches you to build an end to end gaussian mixture model with a practical example. The general idea when building a finite mixture model is that we have a certain number of subpopulations, each one represented by some distribution, and we have data points that belong to those distribution but we do not know to which distribution each point belongs. Thus we need to assign the points properly. We can do that by building a hierarchical model. At the top level of the model, we have a random variable, often referred as a latent variable, which is a variable that is not really observable. The function of this latent variable is to specify to which component distribution a particular observation is assigned to. That is, the latent variable decides which component distribution we are going to use to model a given data point. In the literature, people often use the letter z to indicate latent variables. Let us start building mixture models with a very simple example. We have a dataset that we want to describe as being composed of three Gaussians. clusters = 3 n_cluster = [90, 50, 75] n_total = sum(n_cluster) means = [9, 21, 35] std_devs = [2, 2, 2] mix = np.random.normal(np.repeat(means, n_cluster), np.repeat(std_devs, n_cluster)) sns.kdeplot(np.array(mix)) plt.xlabel('$x$', fontsize=14) In many real situations, when we wish to build models, it is often more easy, effective and productive to begin with simpler models and then add complexity, even if we know from the beginning that we need something more complex. This approach has several advantages, such as getting familiar with the data and problem, developing intuition, and avoiding choking us with complex models/codes that are difficult to debug. So, we are going to begin by supposing that we know that our data can be described using three Gaussians (or in general, k-Gaussians), maybe because we have enough previous experimental or theoretical knowledge to reasonably assume this, or maybe we come to that conclusion by eyeballing the data. We are also going to assume we know the mean and standard deviation of each Gaussian. Given this assumptions the problem is reduced to assigning each point to one of the three possible known Gaussians. There are many methods to solve this task. We of course are going to take the Bayesian track and we are going to build a probabilistic model. To develop our model, we can get ideas from the coin-flipping problem. Remember that we have had two possible outcomes and we used the Bernoulli distribution to describe them. Since we did not know the probability of getting heads or tails, we use a beta prior distribution. Our current problem with the Gaussians mixtures is similar, except that we now have k-Gaussian outcomes. The generalization of the Bernoulli distribution to k-outcomes is the categorical distribution and the generalization of the beta distribution is the Dirichlet distribution. This distribution may look a little bit weird at first because it lives in the simplex, which is like an n-dimensional triangle; a 1-simplex is a line, a 2-simplex is a triangle, a 3-simplex a tetrahedron, and so on. Why a simplex? Intuitively, because the output of this distribution is a k-length vector, whose elements are restricted to be positive and sum up to one. To understand how the Dirichlet generalize the beta, let us first refresh a couple of features of the beta distribution. We use the beta for 2-outcome problems, one with probability p and the other 1-p. In this sense we can think that the beta returns a two-element vector, [p, 1-p]. Of course, in practice, we omit 1-p because it is fully determined by p. Another feature of the beta distribution is that it is parameterized using two scalars and . How does these features compare to the Dirichlet distribution? Let us think of the simplest Dirichlet distribution, one we could use to model a three-outcome problem. We get a Dirichlet distribution that returns a three element vector [p, q , r], where r=1 – (p+q). We could use three scalars to parameterize such Dirichlet and we may call them , , and ; however, it does not scale well to higher dimensions, so we just use a vector named with lenght k, where k is the number of outcomes. Note that we can think of the beta and Dirichlet as distributions over probabilities. To get an idea about this distribution pay attention to the following figure and try to relate each triangular subplot to a beta distribution with similar parameters. The preceding figure is the output of the code written by Thomas Boggs with just a few minor tweaks. You can find the code in the accompanying text; also check the Keep reading sections for details. Now that we have a better grasp of the Dirichlet distribution we have all the elements to build our mixture model. One way to visualize it, is as a k-side coin flip model on top of a Gaussian estimation model. Of course, instead of k-sided coins The rounded-corner box is indicating that we have k-Gaussian likelihoods (with their corresponding priors) and the categorical variables decide which of them we use to describe a given data point. Remember, we are assuming we know the means and standard deviations of the Gaussians; we just need to assign each data point to one Gaussian. One detail of the following model is that we have used two samplers, Metropolis and ElemwiseCategorical, which is specially designed to sample discrete variables with pm.Model() as model_kg: p = pm.Dirichlet('p', a=np.ones(clusters)) category = pm.Categorical('category', p=p, shape=n_total) means = pm.math.constant([10, 20, 35]) y = pm.Normal('y', mu=means[category], sd=2, observed=mix) step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters)) step2 = pm.Metropolis(vars=[p]) trace_kg = pm.sample(10000, step=[step1, step2]) chain_kg = trace_kg[1000:] varnames_kg = ['p'] pm.traceplot(chain_kg, varnames_kg) Now that we know the skeleton of a Gaussian mixture model, we are going to add a complexity layer and we are going to estimate the parameters of the Gaussians. We are going to assume three different means and a single shared standard deviation. As usual, the model translates easily to the PyMC3 syntax. with pm.Model() as model_ug: p = pm.Dirichlet('p', a=np.ones(clusters)) category = pm.Categorical('category', p=p, shape=n_total) means = pm.Normal('means', mu=[10, 20, 35], sd=2, shape=clusters) sd = pm.HalfCauchy('sd', 5) y = pm.Normal('y', mu=means[category], sd=sd, observed=mix) step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters)) step2 = pm.Metropolis(vars=[means, sd, p]) trace_ug = pm.sample(10000, step=[step1, step2]) Now we explore the trace we got: chain = trace[1000:] varnames = ['means', 'sd', 'p'] pm.traceplot(chain, varnames) And a tabulated summary of the inference: pm.df_summary(chain, varnames) mean sd mc_error hpd_2.5 hpd_97.5 means__0 21.053935 0.310447 0.012280 20.495889 21.735211 means__1 35.291631 0.246817 0.008159 34.831048 35.781825 means__2 8.956950 0.235121 0.005993 8.516094 9.429345 sd 2.156459 0.107277 0.002710 1.948067 2.368482 p__0 0.235553 0.030201 0.000793 0.179247 0.297747 p__1 0.349896 0.033905 0.000957 0.281977 0.412592 p__2 0.347436 0.032414 0.000942 0.286669 0.410189 Now we are going to do a predictive posterior check to see what our model learned from the data: ppc = pm.sample_ppc(chain, 50, model) for i in ppc['y']: sns.kdeplot(i, alpha=0.1, color='b') sns.kdeplot(np.array(mix), lw=2, color='k') plt.xlabel('$x$', fontsize=14) Notice how the uncertainty, represented by the lighter blue lines, is smaller for the smaller and larger values of and is higher around the central Gaussian. This makes intuitive sense since the regions of higher uncertainty correspond to the regions where the Gaussian overlaps and hence it is harder to tell if a point belongs to one or the other Gaussian. I agree that this is a very simple problem and not that much of a challenge, but it is a problem that contributes to our intuition and a model that can be easily applied or extended to more complex problems. We saw how to build a gaussian mixture model using a very basic model as an example, which can be applied to solve more complex models. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python to understand the Bayesian framework and solve complex statistical problems using Python.

0
0
54379

Packt

05 Jul 2017

14 min read

SQL Server basics

Packt

05 Jul 2017

14 min read

In this article by Jasmin Azemović, author of the book SQL Server 2017 for Linux, we will cover basic a overview of SQL server and learn about backup. Linux, or to be precise GNU/Linux, is one of the best alternatives to Windows; and in many cases, it is the first choice of environment for daily tasks such as system administration, running different kinds of services, or just a tool for desktop application Linux's native working interface is the command line. Yes, KDE and GNOME are great graphic user interfaces. From a user's perspective, clicking is much easier than typing; but this observation is relative. GUI is something that changed the perception of modern IT and computer usage. Some tasks are very difficult without a mouse, but not impossible. On the other hand, command line is something where you can solve some tasks quicker, more efficiently, and better than in GUI. You don't believe me? Imagine these situations and try to implement them through your favorite GUI tool: In a folder of 1000 files, copy only those the names of which start with A and end with Z, .txt extension Rename 100 files at the same time Redirect console output to the file There are many such examples; in each of them, Command Prompt is superior—Linux Bash, even more. Microsoft SQL Server is considered to be one the most commonly used systems for database management in the world. This popularity has been gained by high degree of stability, security, and business intelligence and integration functionality. Microsoft SQL Server for Linux is a database server that accepts queries from clients, evaluates them and then internally executes them, to deliver results to the client. The client is an application that produces queries, through a database provider and communication protocol sends requests to the server, and retrieves the result for client side processing and/or presentation. (For more resources related to this topic, see here.) Overview of SQL Server When writing queries, it's important to understand that the interaction between the tool of choice and the database based on client-server architecture, and the processes that are involved. It's also important to understand which components are available and what functionality they provide. With a broader understanding of the full product and its components and tools, you'll be able to make better use of its functionality, and also benefit from using the right tool for specific jobs. Client-server architecture concepts In a client-server architecture, the client is described as a user and/or device, and the server as a provider of some kind of service. SQL Server client-server communication As you can see in the preceding figure, the client is represented as a machine, but in reality can be anything. Custom application (desktop, mobile, web) Administration tool (SQL Server Management Studio, dbForge, sqlcmd…) Development environment (Visual Studio, KDevelop…) SQL Server Components Microsoft SQL Server consists of many different components to serve a variety of organizational needs of their data platform. Some of these are: Database Engine is the relational database management system (RDBMS), which hosts databases and processes queries to return results of structured, semi-structured, and non-structured data in online transactional processing solutions (OLTP). Analysis Services is the online analytical processing engine (OLAP) as well as the data mining engine. OLAP is a way of building multi-dimensional data structures for fast and dynamic analysis of large amounts of data, allowing users to navigate hierarchies and dimensions to reach granular and aggregated results to achieve a comprehensive understanding of business values. Data mining is a set of tools used to predict and analyse trends in data behaviour and much more. Integration Services supports the need to extract data from sources, transform it, and load it in destinations (ETL) by providing a central platform that distributes and adjusts large amounts of data between heterogeneous data destinations. Reporting Services is a central platform for delivery of structured data reports and offers a standardized, universal data model for information workers to retrieve data and model reports without the need of understanding the underlying data structures. Data Quality Services (DQS) is used to perform a variety data cleaning, correction and data quality tasks, based on knowledge base. DQS is mostly used in ETL process before loading DW. R services (advanced analytics) is a new service that actually incorporate powerful R language for advanced statistic analytics. It is part of database engine and you can combine classic SQL code with R scripts. While writing this book, only one service was actually available in SQL Server for Linux and its database engine. This will change in the future and you can expect more services to be available. How it works on Linux? SQL Server is a product with a 30-year-long history of development. We are speaking about millions of lines of code on a single operating system (Windows). The logical question is how Microsoft successfully ports those millions of lines of code to the Linux platform so fast. SQL Server@Linux, officially became public in the autumn of 2016. This process would take years of development and investment. Fortunately, it was not so hard. From version 2005, SQL Server database engine has a platform layer called SQL Operating system (SOS). It is a setup between SQL Server engine and the Windows operating systems. The main purpose of SOS is to minimize the number of system calls by letting SQL Server deal with its own resources. It greatly improves performance, stability and debugging process. On the other hand, it is platform dependent and does not provide an abstraction layer. That was the first big problem for even start thinking to make Linux version. Project Drawbridge is a Microsoft research project created to minimize virtualization resources when a host runs many VM on the same physical machine. The technical explanation goes beyond the scope of this book (https://www.microsoft.com/en-us/research/project/drawbridge/). Drawbridge brings us to the solution of the problem. Linux solution uses a hybrid approach, which combines SOS and Liberty OS from Drawbridge project to create SQL PAL (SQL Platform Abstraction Layer). This approach creates a set of SOS API calls which does not require Win32 or NT calls and separate them from platform depended code. This is a dramatically reduced process of rewriting SQL Server from its native environment to a Linux platform. This figure gives you a high-level overview of SQL PAL( https://blogs.technet.microsoft.com/dataplatforminsider/2016/12/16/sql-server-on-linux-how-introduction/). SQL PAL architecture Retrieving and filtering data Databases are one of the cornerstones of modern business companies. Data retrieval is usually made with SELECT statement and is therefore very important that you are familiar with this part of your journey. Retrieved data is often not organized in the way you want them to be, so they require additional formatting. Besides formatting, accessing very large amount of data requires you to take into account the speed and manner of query execution which can have a major impact on system performance Databases usually consist of many tables where all data are stored. Table names clearly describe entities whose data are stored inside and therefore if you need to create a list of new products or a list of customers who had the most orders, you need to retrieve those data by creating a query. A query is an inquiry into the database by using the SELECT statement which is the first and most fundamental SQL statement that we are going to introduce in this chapter. SELECT statement consists of a set of clauses that specifies which data will be included into query result set. All clauses of SQL statements are the keywords and because of that will be written in capital letters. Syntactically correct SELECT statement requires a mandatory FROM clause which specifies the source of the data you want to retrieve. Besides mandatory clauses, there are a few optional ones that can be used to filter and organize data: INTO enables you to insert data (retrieved by the SELECT clause) into a different table. It is mostly used to create table backup. WHERE places conditions on a query and eliminates rows that would be returned by a query without any conditions. ORDER BY displays the query result in either ascending or descending alphabetical order. GROUP BY provides mechanism for arranging identical data into groups. HAVING allows you to create selection criteria at the group level. SQL Server recovery models When it comes to the database, backup is something that you should consider and reconsider really carefully. Mistakes can cost you: money, users, data and time and I don't know which one has bigger consequences. Backup and restore are elements of a much wider picture known by the name of disaster recovery and it is science itself. But, from the database perspective and usual administration task these two operations are the foundation for everything else. Before you even think about your backups, you need to understand recovery models that SQL Server internally uses while the database is in operational mode. Recovery model is about maintaining data in the event of a server failure. Also, it defines amount of information that SQL Server writes in log file with purpose of recovery. SQL Server has three database recovery models: Simple recovery model Full recovery model Bulk-logged recovery model Simple recovery model This model is typically used for small databases and scenarios were data changes are infrequent. It is limited to restoring the database to the point when the last backup was created. It means that all changes made after the backup are gone. You will need to recreate all changes manually. Major benefit of this model is that it takes small amount of storage space for log file. How to use it and when, depends on business scenarios. Full recovery model This model is recommended when recovery from damaged storage is the highest priority and data loss should be minimal. SQL Server uses copies of database and log files to restore database. Database engine logs all changes to the database including bulk operation and most DDL commands. If the transaction log file is not damaged, SQL Server can recover all data except transaction which are in process at the time of failure (not committed in to database file). All logged transactions give you an opportunity of point in time recovery, which is a really cool feature. Major limitation of this model is the large size of the log files which leads you to performance and storage issues. Use it only in scenarios where every insert is important and loss of data is not an option. Bulk-logged recovery model This model is somewhere in the middle of simple and full. It uses database and log backups to recreate database. Comparing to full recovery model, it uses less log space for: CREATE INDEX and bulk load operations such as SELECT INTO. Let's look at this example. SELECT INTO can load a table with 1, 000, 000 records with a single statement. The log will only record occurrence of this operations but details. This approach uses less storage space comparing to full recovery model. Bulk-logged recovery model is good for databases which are used to ETL process and data migrations. SQL Server has system database model. This database is the template for each new one you create. If you use just CREATE DATABASE statement without any additional parameters it simply copies model database with all properties and metadata. It also inherits default recovery model which is full. So, conclusion is that each new database will be in full recovery mode. This can be changed during and after creation process. Elements of backup strategy Good backup strategy is not just about creating a backup. This is a process of many elements and conditions that should be filed to achieve final goal and this is the most efficient backup strategy plan. To create a good strategy, we need to answer the following questions: Who can create backups? Backup media Types of backups Who can create backups? Let's say that SQL Server user needs to be a member of security role which is authorized to execute backup operations. They are members of: sysadmin server role Every user with sysadmin permission can work with backups. Our default sa user is a member of the sysadmin role. db_owner database role Every user who can create databases by default can execute any backup/restore operations. db_backupoperator database role Some time you need just a person(s) to deal with every aspect of backup operation. This is common for large-scale organizations with tens or even hundreds of SQL Server instances. In those environments, backup is not trivial business. Backup media An important decision is where to story backup files and how to organize while backup files and devices. SQL Server gives you a large set of combinations to define your own backup media strategy. Before we explain how to store backups, let's stop for a minute and describe the following terms: Backup disk is a hard disk or another storage device that contains backup files. Back file is just ordinary file on the top of file system. Media set is a collection of backup media in ordered way and fixed type (example: three type devices, Tape1, Tape2, and Tape3). Physical backup device can be a disk file of tape drive. You will need to provide information to SQL Server about your backup device. A backup file that is created before it is used for a backup operation is called a backup device. Figure Backup devices The simplest way to store and handle database backups is by using a back disk and storing them as regular operating system files, usually with the extension .bak. Linux does not care much about extension, but it is good practice to mark those files with something obvious. This chapter will explain how to use backup disk devices because every reader of this book should have a hard disk with an installation of SQL Server on Linux; hope so! Tapes and media sets are used for large-scale database operations such as enterprise-class business (banks, government institutions and so on). Disk backup devices can anything such as a simple hard disk drive, SSD disk, hot-swap disk, USB drive and so on. The size of the disk determines the maximum size of the database backup file. It is recommended that you use a different disk as backup disk. Using this approach, you will separate database data and log disks. Imagine this. Database files and backup are on the same device. If that device fails, your perfect backup strategy will fall like a tower of cards. Don't do this. Always separate them. Some serious disaster recovery strategies (backup is only smart part of it) suggest using different geographic locations. This makes sense. A natural disaster or something else of that scale can knock down the business if you can't restore your system from a secondary location in a reasonably small amount of time. Summary Backup and restore is not something that you can leave aside. It requires serious analyzing and planning, and SQL Server gives you powerful backup types and options to create your disaster recovery policy on SQL Server on Linux. Now you can do additional research and expand your knowledge A database typically contains dozens of tables, and therefore it is extremely important that you master creating queries over multiple tables. This implies the knowledge of the functioning JOIN operators with a combination with elements of string manipulation. Resources for Article: Further resources on this subject: Review of SQL Server Features for Developers [article] Configuring a MySQL linked server on SQL Server 2008 [article] Exception Handling in MySQL for Python [article]

0
0
54334

article-image-recommendation-engines-explained

Packt

02 Jan 2017

10 min read

Recommendation Engines Explained

Packt

02 Jan 2017

10 min read

0
0
53536

How-To Tutorials - Data

Fingerprint detection using OpenCV 3

Manipulating text data using Python Regular Expressions (regex)

The CAP Theorem in practice: The consistency vs. availability trade-off in distributed databases

Visualizing 3D plots in Matplotlib 2.0

Cleaning Data in PDF Files

Build a generative chatbot using recurrent neural networks (LSTM RNNs)

How to deploy RethinkDB using Docker

Exploring Structure from Motion Using OpenCV

How to install Elasticsearch in Ubuntu and Windows

12 most common MySQL errors you should be aware of

Trending Topics

5 reasons why you should use an open-source data analytics stack in 2020

How data scientists test hypotheses and probability

How to build a Gaussian Mixture Model

SQL Server basics

Recommendation Engines Explained

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access