Data | 0 articles | Tech News, Tutorials & Expert Insights

07 Jul 2016

12 min read

Recommendation Systems

07 Jul 2016

0
0
11478

Packt

04 Jul 2016

16 min read

Data Science with R

Packt

04 Jul 2016

16 min read

0
0
3621

article-image-getting-started-tensorflow-api-primer

Sam Abrahams

19 Jun 2016

8 min read

Getting Started with TensorFlow: an API Primer

Sam Abrahams

19 Jun 2016

8 min read

TensorFlow has picked up a lot of steam over the past couple of months, and there's been more and more interest in learning how to use the library. I've seen tons of tutorials out there that just slap together TensorFlow code, roughly describe what some of the lines do, and call it a day. Conversely, I've seen really dense tutorials that mix together universal machine learning concepts and TensorFlow's API. There is value in both of these sorts of examples, but I find them either a little too sparse or too confusing respectively. In this post, I plan to focus solely on information related to the TensorFlow API, and not touch on general machine learning concepts (aside from describing computational graphs). Additionally, I will link directly to relevant portions of the TensorFlow API for further reading. While this post isn't going to be a proper tutorial, my hope is that focusing on the core components and workflows of the TensorFlow API will make working with other resources more accessible and comprehensible. As a final note, I'll be referring to the Python API and not the C++ API in this post. Definitions Let's start off with a glossary of key words you're going to see when using TensorFlow. Tensor: An n-dimensional matrix. For most practical purposes, you can think of them the same way you would a two-dimensional matrix for matrix algebra. In TensorFlow, the return value of any mathematical operation is a tensor. See here for more about TensorFlow Tensor objects. Graph: The computational graph that is defined by the user. It's constructed of nodes and edges, representing computations and connections between those computations respectively. For a quick primer on computation graphs and how they work in backpropagation, check out Chris Olah's post here. A TensorFlow user can define more than one Graph object and run them separately. Additionally, it is possible to define a large graph and run only smaller portions of it. See here for more information about TensorFlow Graphs. Op, Operation (Ops, Operations): Any sort of computation on tensors. Operations (or Ops) can take in zero or more TensorFlow Tensor objects, and output zero or more Tensor objects as a result of the computation. Ops are used all throughout TensorFlow, from doing simple addition to matrix multiplication to initializing TensorFlow variables. Operations run only when they are passed to the Session object, which I'll discuss below. For the most part, nodes and operations are interchangable concepts. In this guide, I'll try to use the term Operation or Op when referring to TensorFlow-specific operations and node when referring to general computation graph terminology. Here's the API reference for the Operation class. Node: A computation in the graph that takes as input zero or more tensors and outputs zero or more tensors. A node does not have to interact with any other nodes, and thus does not have to have any edges connected to it. Visually, these are usually depicted as ellipses or boxes. Edge: The directed connection between two nodes. In TensorFlow, each edge can be seen as one or more tensors, and usually represents the output of one node becoming the input of the next node. Visually, these are usually depicted as lines or arrows. Device: A CPU or GPU. In TensorFlow, computations can occur across many different CPUs and GPUs, and it must keep track of these devices in order to coordinate work properly. The Typical TensorFlow Coding Workflow Writing a working TensorFlow model boils down to two steps: Build the Graph using a series of Operations, placeholders, and Variables. Run the Graph with training data repeatedly using the Session (you'll want to test the model while training to make sure it's learning properly). Sounds simple enough, and once you get a hang of it, it really is! We talked about Ops in the section above, but now I want to put special emphasis on placeholders, Variables, and the Session. They are fairly easy to grasp, but getting these core fundamentals solidified will give context to learning the rest of the TensorFlow API. Placeholders A Placeholder is a node in the graph that must be fed data via the feed_dict parameter in Session.run (see below). In general, these are used to specify input data and label data. Basically, you use placeholders to tell TensorFlow, "Hey TF, the data here is going to change each time you run the graph, but it will always be a tensor of size [N] and data-type [D]. Use that information to make sure that my matrix/tensor calculations are set up properly." TensorFlow needs to have that information in order to compile the program, as it has to guarantee that you don't accidentally try to multiply a 5x5 matrix with an 8x8 matrix (amongst other things). Placeholders are easy to define. Just make a variable that is assigned to the result of tensorflow.placeholder(): import tensorflow as tf # Create a Placeholder of size 100x400 that will contain 32-bit floating point numbers my_placeholder = tf.placeholder(tf.float32, shape=(100, 400)) Read more about Placeholder objects here. Note: We are required to feed data to the placeholder when we run our graph. We'll cover this in the Session section below. Variables Variables are objects that contain tensor information but persist across multiple calls to Session.run(). That is, they contain information that can be altered during the run of a graph, and then that altered state can be accessed the next time the graph is run. Variables are used to hold the weights and biases of a machine learning model while it trains, and their final values are what define the trained model. Defining and using Variables is mostly straightforward. Define a Variable with tensorflow.Variable() and update its information with the assign() method: import tensorflow as tf # Create a variable with the value 0 and the name of 'my_variable' my_var = tf.Variable(0, name='my_variable') # Increment the variable by one my_var.assign(my_var + 1) One catch with Variable objects is that you can't run Ops with them until you initialize them within the Session object. This is usually done with the Operation returned from tf.initialize_all_variables(), as I'll describe in the next section. Variable API reference The official how-to for Variable objects The Session Finally, let's talk about running the Session. The TensorFlow Session object is in charge of keeping track of all Variables, coordinating computation across devices, and generally doing anything that involves running the graph. You generally start a Session by calling tensorflow.Session(), and either directly assign the value of that statement to a handle or use a with ... as statement. The most important method in the Session object is run(), which takes in as input fetches, a list of Operations and Tensors that the user wishes to calculate; and feed_dict, which is an optional dictionary mapping Tensors (often Placeholders) to values that should override that Tensor. This is how you specify which values you want returned from your computation as well as the input values for training. Here is a toy example that uses a placeholder, a Variable, and the Session to showcase their basic use: import tensorflow as tf # Create a placeholder for inputting floating point data later a = tf.placeholder(tf.float32) # Make a base Variable object with the starting value of 0 start = tf.Variable(0.0) # Create a node that is the value of incrementing the 'start' Variable by the value of 'a' y = start.assign(start + a) # Open up a TensorFlow Session and assign it to the handle 'sess' sess = tf.Session() # Important: initialize the Variable, or else we won't be able to run our computation init = tf.initialize_all_variables() # 'init' is an Op: must be run by sess sess.run(init) # Now the Variable is initialized! # Get the value of 'y', feeding in different values for 'a', and print the result # Because we are using a Variable, the value should change each time print(sess.run(y, feed_dict={a:1})) # Prints 1.0 print(sess.run(y, feed_dict={a:0.5})) # Prints 1.5 print(sess.run(y, feed_dict={a:2.2})) # Prints 3.7 # Close the Session sess.close() Check out the documentation for TensorFlow's Session object here. Finishing Up Alright! This primer should give you a head start on understanding more of the resources out there for TensorFlow. The less you have to think about how TensorFlow works, the more time you can spend working out how to set up the best neural network you can! Good luck, and happy flowing! About the author Sam Abrahams is a freelance data engineer and animator in Los Angeles, CA, USA. He specializes in real-world applications of machine learning and is a contributor to TensorFlow. Sam runs a small tech blog, Memdump, and is an active member of the local hacker scene in West LA.

0
0
3326

article-image-use-sqlite-ionic-store-data

Oli Huggins

13 Jun 2016

10 min read

How to use SQLite with Ionic to store data?

Oli Huggins

13 Jun 2016

10 min read

Hybrid Mobile apps have a challenging task of being as performant as native apps, but I always tell other developers that it depends on not the technology but how we code. The Ionic Framework is a popular hybrid app development library, which uses optimal design patterns to create awe-inspiring mobile experiences. We cannot exactly use web design patterns to create hybrid mobile apps. The task of storing data locally on a device is one such capability, which can make or break the performance of your app. In a web app, we may use localStorage to store data but mobile apps require much more data to be stored and swift access. Localstorage is synchronous, so it acts slow in accessing the data. Also, web developers who have experience of coding in a backend language such as C#, PHP, or Java would find it more convenient to access data using SQL queries than using object-based DB. SQLite is a lightweight embedded relational DBMS used in web browsers and web views for hybrid mobile apps. It is similar to the HTML5 WebSQL API and is asynchronous in nature, so it does not block the DOM or any other JS code. Ionic apps can leverage this tool using an open source Cordova Plugin by Chris Brody (@brodybits). We can use this plugin directly or use it with the ngCordova library by the Ionic team, which abstracts Cordova plugin calls into AngularJS-based services. We will create an Ionic app in this blog post to create Trackers to track any information by storing it at any point in time. We can use this data to analyze the information and draw it on charts. We will be using the ‘cordova-sqlite-ext’ plugin and the ngCordova library. We will start by creating a new Ionic app with a blank starter template using the Ionic CLI command, ‘$ ionic start sqlite-sample blank’. We should also add appropriate platforms for which we want to build our app. The command to add a specific platform is ‘$ ionic platform add <platform_name>’. Since we will be using ngCordova to manage SQLite plugin from the Ionic app, we have to now install ngCordova to our app. Run the following bower command to download ngCordova dependencies to the local bower ‘lib’ folder: bower install ngCordova We need to inject the JS file using a script tag in our index.html: <script src=“lib/ngCordova/dist/ng-cordova.js"></script> Also, we need to include the ngCordova module as a dependency in our app.js main module declaration: angular.module('starter', [‘ionic’,’ngCordova']) Now, we need to add the Cordova plugin for SQLite using the CLI command: cordova plugin add https://github.com/litehelpers/Cordova-sqlite-storage.git Since we will be using the $cordovaSQLite service of ngCordova only to access this plugin from our Ionic app, we need not inject any other plugin. We will have the following two views in our Ionic app: Trackers list: This list shows all the trackers we add to DB Tracker details: This is a view to show list of data entries we make for a specific tracker We would need to create the routes by registering the states for the two views we want to create. We need to add the following config block code for our ‘starter’ module in the app.js file only: .config(function($stateProvider,$urlRouterProvider){ $urlRouterProvider.otherwise('/') $stateProvider.state('home', { url: '/', controller:'TrackersListCtrl', templateUrl: 'js/trackers-list/template.html' }); $stateProvider.state('tracker', { url: '/tracker/:id', controller:'TrackerDetailsCtrl', templateUrl: 'js/tracker-details/template.html' }) }); Both views would have similar functionality, but will display different entities. Our view will display a list of trackers from the SQLite DB table and also provide a feature to add a new tracker or delete an existing one. Create a new folder named trackers-list where we can store our controller and template for the view. We will also abstract our code to access the SQLite DB into an Ionic factory. We will implement the following methods: initDB: This will initialize or create a table for this entity if it does not exist getAllTrackers: This will get all trackers list rows from the created table addNewTracker - This is a method to insert a new row for a new tracker into the table deleteTracker - This is a method to delete a specific tracker using its ID getTracker - This will get a specific Tracker from the cached list using an ID to display anywhere We will be injecting the $cordovaSQLite service into our factory to interact with our SQLite DB. We can open an existing DB or create a new DB using the command $cordovaSQLite.openDB(“myApp.db”). We have to store the object reference returned from this method call, so we will store it in a module-level variable called db. We have to pass this object reference to our future $cordovaSQLite service calls. $cordovaSQLite has a handful of methods to provide varying features: openDB: This is a method to establish a connection to the existing DB or create a new DB execute: This is a method to execute a single SQL command query insertCollection: This is a method to insert bulk values nestedExecute: This is a method to run nested queries deleted: This is a method to delete a particular DB We see the usage of openDB and execute the command in this post. In our factory, we will create a standard method runQuery to adhere to DRY(Don’t Repeat Yourself) principles. The code for the runQuery function is as follows: function runQuery(query,dataParams,successCb,errorCb) { $ionicPlatform.ready(function() { $cordovaSQLite.execute(db, query,dataParams).then(function(res) { successCb(res); }, function (err) { errorCb(err); }); }.bind(this)); } In the preceding code, we pass the query as a string, dataParams (dynamic query parameters) as an array, and successCB/errorCB as callback functions. We should always ensure that any Cordova plugin code should be called when the Cordova ready event is already fired, which is ensured by the $ionicPlatform.ready() method. We will then call the execute method of the $cordovaSQLite service passing the ‘db’ object reference, query, and dataParams as arguments. The method returns a promise to which we register callbacks using the ‘.then’ method. We pass the results or error using the success callback or error callback. Now, we will write code for each of the methods to initialize DB, insert a new row, fetch all rows, and then delete a row. initDB Method: function initDB() { db = $cordovaSQLite.openDB("myapp.db"); var query = "CREATE TABLE IF NOT EXISTS trackers_list (id in-teger autoincrement primary key, name string)"; runQuery(query,[],function(res) { console.log("table created "); }, function (err) { console.log(err); }); } In the preceding code, the openDB method is used to establish a connection with an existing DB or create a new DB. Then, we run the query to create a new table called ‘trackers_list’ if it does not exist. We define the columns ID with integer autoincrement primary key properties with the name string. addNewTracker Method: function addNewTracker(name) { var deferred = $q.defer(); var query = "INSERT INTO trackers_list (name) VALUES (?)"; runQuery(query,[name],function(response){ //Success Callback console.log(response); deferred.resolve(response); },function(error){ //Error Callback console.log(error); deferred.reject(error); }); return deferred.promise; } In the preceding code, we take ‘name’ as an argument, which will be passed into the insert query. We write the insert query and add a new row to the trackers_list table where ID will be auto-generated. We pass dynamic query parameters using the ‘?’ character in our query string, which will be replaced by elements in the dataParams array passed as the second argument to the runQuery method. We also use a $q library to return a promise to our factory methods so that controllers can manage asynchronous calls. getAllTrackers Method: This method is the same as the addNewTracker method, only without the name parameter, and it has the following query: var query = "SELECT * from trackers_list”; This method will return a promise, which when resolved will give the response from the $cordovaSQLite method. The response object will have the following structure: { insertId: <specific_id>, rows: {item: function, length: <total_no_of_rows>} rowsAffected: 0 } The response object has properties insertId representing the new ID generated for the row, rowsAffected giving the number of rows affected by the query and rows object with item method property, to which we can pass the index of the row to retrieve it. We will write the following code in the controller to convert the response.rows object into an utterable array of rows to be displayed using the ng-repeat directive: for(var i=0;i<response.rows.length;i++) { $scope.trackersList.push({ id:response.rows.item(i).id, name:response.rows.item(i).name }); } The code in the template to display the list of Trackers would be as follows: <ion-item ui-sref="tracker({id:tracker.id})" class="item-icon-right" ng-repeat="tracker in trackersList track by $index"> {{tracker.name}} <ion-delete-button class="ion-minus-circled" ng-click=“deleteTracker($index,tracker.id)"> </ion-delete-button> <i class="icon ion-chevron-right”></i> </ion-item> deleteTracker Method: function deleteTracker(id) { var deferred = $q.defer(); var query = "DELETE FROM trackers_list WHERE id = ?"; runQuery(query,[id],function(response){ … [Same Code as addNewTrackerMethod] } The delete tracker method has the same code as the addNewTracker method, where the only change is in the query and the argument passed. We pass ‘id’ as the argument to be used in the WHERE clause of delete query to delete the row with that specific ID. Rest of the Ionic App Code: The rest of the app code has not been discussed in this post because we have already discussed the code that is intended for integration with SQLite. You can implement your own version of this app or even use this sample code for any other use case. The trackers details view will be implemented in the same way to store data into the tracker_entries table with a foreign key, tracker_id, used for this table. It will also use this ID in the SELECT query to fetch entries for a specific tracker on its detail view. The GitHub link for the exact functioning code for complete app developed during this tutorial. About the author Rahat Khanna is a techno nerd experienced in developing web and mobile apps for many international MNCs and start-ups. He has completed his bachelors in technology with computer science and engineering as specialization. During the last 7 years, he has worked for a multinational IT service company and ran his own entrepreneurial venture in his early twenties. He has worked on projects ranging from static HTML websites to scalable web applications and engaging mobile apps. Along with his current job as a senior UI developer at Flipkart, a billion dollar e-commerce firm, he now blogs on the latest technology frameworks on sites such as www.airpair.com, appsonmob.com, and so on, and delivers talks at community events. He has been helping individual developers and start-ups in their Ionic projects to deliver amazing mobile apps.

0
0
25068

Packt

10 Jun 2016

14 min read

Setting up Spark

Packt

10 Jun 2016

14 min read

0
0
2471

Packt

09 Jun 2016

18 min read

Clustering Methods

Packt

09 Jun 2016

18 min read

In this article by Magnus Vilhelm Persson, author of the book Mastering Python Data Analysis, we will see that with data comprising of several separated distributions, how do we find and characterize them? In this article, we will look at some ways to identify clusters in data. Groups of points with similar characteristics form clusters. There are many different algorithms and methods to achieve this, with good and bad points. We want to detect multiple separate distributions in the data; for each point, we determine the degree of association (or similarity) with another point or cluster. The degree of association needs to be high if they belong in a cluster together or low if they do not. This can, of course, just as previously, be a one-dimensional problem or a multidimensional problem. One of the inherent difficulties of cluster finding is determining how many clusters there are in the data. Various approaches to define this exist—some where the user needs to input the number of clusters and then the algorithm finds which points belong to which cluster, and some where the starting assumption is that every point is a cluster and then two nearby clusters are combined iteratively on trial basis to see if they belong together. In this article, we will cover the following topics: A short introduction to cluster finding, reminding you of the general problem and an algorithm to solve it Analysis of a dataset in the context of cluster finding, the Cholera outbreak in central London 1854 By Simple zeroth order analysis, calculating the centroid of the whole dataset By finding the closest water pump for each recorded Cholera-related death Applying the K-means nearest neighbor algorithm for cluster finding to the data and identifying two separate distributions (For more resources related to this topic, see here.) The algorithms and methods covered here are focused on those available in SciPy. Start a new Notebook, and put in the default imports. Perhaps you want to change to interactive Notebook plotting to try it out a bit more. For this article, we are adding the following specific imports. The ones related to clustering are from SciPy, while later on we will need some packages to transform astronomical coordinates. These packages are all preinstalled in the Anaconda Python 3 distribution and have been tested there: import scipy.cluster.hierarchy as hac import scipy.cluster.vq as vq Introduction to cluster finding There are many different algorithms for cluster identification. Many of them try to solve a specific problem in the best way. Therefore, the specific algorithm that you want to use might depend on the problem you are trying to solve and also on what algorithms are available in the specific package that you are using. Some of the first clustering algorithms consisted of simply finding the centroid positions that minimize the distances to all the points in each cluster. The points in each cluster are closer to that centroid than other cluster centroids. As might be obvious at this point, the hardest part with this is figuring out how many clusters there are. If we can determine this, it is fairly straightforward to try various ways of moving the cluster centroid around, calculate the distance to each point, and then figure out where the cluster centroids are. There are also obvious situations where this might not be the best solution, for example, if you have two very elongated clusters next to each other. Commonly, the distance is the Euclidean distance: Here, p is a vector with all the points' positions,that is,{p1,p2,...,pN–1,pN} in cluster Ck, that is P E Ck , the distances are calculated from the cluster centroid,Ui . We have to find the cluster centroids that minimize the sum of the absolute distances to the points: In this first example, we shall first work with fixed cluster centroids. Starting out simple – John Snow on Cholera In 1854, there was an outbreak of cholera in North-western London, in the neighborhood around Broad Street. The leading theories at the time claimed that cholera spread, just like it was believed that the plague spread: through foul, bad air. John Snow, a physician at the time, hypothesized that cholera spread through drinking water. During the outbreak, John tracked the deaths and drew them on a map of the area. Through his analysis, he concluded that most of the cases were centered on the Broad Street water pump. Rumors say that he then removed the handle of the water pump, thus stopping an epidemic. Today, we know that cholera is usually transmitted through contaminated food or water, thus confirming John's hypothesis. We will do a short but instructive reanalysis of John Snow's data. The data comes from the public data archives of The National Center for Geographic Information and Analysis (http://www.ncgia.ucsb.edu/ and http://www.ncgia.ucsb.edu/pubs/data.php). A cleaned up map and copy of the data files along with an example of Geospatial information analysis of the data can also be found at https://www.udel.edu/johnmack/frec682/cholera/cholera2.html.A wealth of information about physician and scientist John Snow's life and works can be found at http://johnsnow.matrix.msu.edu. To start the analysis, we read the data into a Pandas DataFrame; the data is already formatted in CSV-files readable by Pandas: deaths = pd.read_csv('data/cholera_deaths.txt') pumps = pd.read_csv('data/cholera_pumps.txt') Each file contains two columns, one for X coordinates and one for Y coordinates. Let's check what it looks like: deaths.head() pumps.head() With this information, we can now plot all the pumps and deaths to visualize the data: plt.figure(figsize=(4,3.5)) plt.plot(deaths['X'], deaths['Y'], marker='o', lw=0, mew=1, mec='0.9', ms=6) plt.plot(pumps['X'],pumps['Y'], marker='s', lw=0, mew=1, mec='0.9', color='k', ms=6) plt.axis('equal') plt.xlim((4.0,22.0)); plt.xlabel('X-coordinate') plt.ylabel('Y-coordinate') plt.title('John Snow's Cholera') It is fairly easy to see that the pump in the middle is important. As a first data exploration, we will simply calculate the mean centroid of the distribution and plot that in the figure as an ellipse. We will calculate the mean and standard deviation along the x and y axis as the centroid position: fig = plt.figure(figsize=(4,3.5)) ax = fig.add_subplot(111) plt.plot(deaths['X'], deaths['Y'], marker='o', lw=0, mew=1, mec='0.9', ms=6) plt.plot(pumps['X'],pumps['Y'], marker='s', lw=0, mew=1, mec='0.9', color='k', ms=6) from matplotlib.patches import Ellipse ellipse = Ellipse(xy=(deaths['X'].mean(), deaths['Y'].mean()), width=deaths['X'].std(), height=deaths['Y'].std(), zorder=32, fc='None', ec='IndianRed', lw=2) ax.add_artist(ellipse) plt.plot(deaths['X'].mean(), deaths['Y'].mean(), '.', ms=10, mec='IndianRed', zorder=32) for i in pumps.index: plt.annotate(s='{0}'.format(i), xy=(pumps[['X','Y']].loc[i]), xytext=(-15,6), textcoords='offset points') plt.axis('equal') plt.xlim((4.0,22.5)) plt.xlabel('X-coordinate') plt.ylabel('Y-coordinate') plt.title('John Snow's Cholera') Here, we also plotted the pump index, which we can get from DataFrame with the pumps.index method. The next step in the analysis is to see which pump is the closest to each point. We do this by calculating the distance from all pumps to all points. Then we want to figure out which pump is the closest for each point. We save the closest pump to each point in a separate column of the deaths' DataFrame. With this dataset, the for-loop runs fairly quickly. However, the DataFrame subtract method chained with sum() and idxmin() methods takes a few seconds. I strongly encourage you to play around with various ways to speed this up. We also use the .apply() method of DataFrame to square and square root the values. The simple brute force first attempt of this took over a minute to run. The built-in functions and methods help a lot: deaths_tmp = deaths[['X','Y']].as_matrix() idx_arr = np.array([], dtype='int') for i in range(len(deaths)): idx_arr = np.append(idx_arr, (pumps.subtract(deaths_tmp[i])).apply(lambda x:x**2).sum(axis=1).apply(lambda x:x**0.5).idxmin()) deaths['C'] = idx_arr Quickly check whether everything seems fine by printing out the top rows of the table: deaths.head() Now we want to visualize what we have. With colors, we can show which water pump we associate each death with. To do this, we use a colormap, in this case, the jet colormap. By calling the colormap with a value between 0 and 1, it returns a color; thus we give it the pump indexes and then divide it with the total number of pumps, 12 in our case: fig = plt.figure(figsize=(4,3.5)) ax = fig.add_subplot(111) np.unique(deaths['C'].values) plt.scatter(deaths['X'].as_matrix(), deaths['Y'].as_matrix(), color=plt.cm.jet(deaths['C']/12.), marker='o', lw=0.5, edgecolors='0.5', s=20) plt.plot(pumps['X'],pumps['Y'], marker='s', lw=0, mew=1, mec='0.9', color='0.3', ms=6) for i in pumps.index: plt.annotate(s='{0}'.format(i), xy=(pumps[['X','Y']].loc[i]), xytext=(-15,6), textcoords='offset points', ha='right') ellipse = Ellipse(xy=(deaths['X'].mean(), deaths['Y'].mean()), width=deaths['X'].std(), height=deaths['Y'].std(), zorder=32, fc='None', ec='IndianRed', lw=2) ax.add_artist(ellipse) plt.axis('equal') plt.xlim((4.0,22.5)) plt.xlabel('X-coordinate') plt.ylabel('Y-coordinate') plt.title('John Snow's Cholera') The majority of deaths are dominated by the proximity of the pump in the center. This pump is located on Broad Street. Now, remember that we have used fixed positions for the cluster centroids. In this case, we are basically working on the assumption that the water pumps are related to the cholera cases. Furthermore, the Euclidean distance is not really the real-life distance. People go along roads to get water and the road there is not necessarily straight. Thus, one would have to map out the streets and calculate the distance to each pump from that. Even so, already at this level, it is clear that there is something with the center pump related to the cholera cases. How would you account for the different distance? To calculate the distance, you would do what is called cost-analysis (c.f. when you hit directions on your sat-nav to go to a place). There are many different ways of doing cost analysis, and it also relates to the problem of finding the correct way through a maze. In addition to these things, we do not have any data in the time domain, that is, the cholera would possibly spread to other pumps with time and thus the outbreak might have started at the Broad Street pump and spread to other nearby pumps. Without time data, it is difficult to figure out what happened. This is the general approach to cluster finding. The coordinates might be attributes instead, length and weight of dogs for example, and the location of the cluster centroid something that we would iteratively move around until we find the best position. K-means clustering The K-means algorithm is also referred to as vector quantization. What the algorithm does is find the cluster (centroid) positions that minimize the distances to all points in the cluster. This is done iteratively; the problem with the algorithm is that it can be a bit greedy, meaning that it will find the nearest minima quickly. This is generally solved with some kind of basin-hopping approach where the nearest minima found is randomly perturbed and the algorithm restarted. Due to this fact, the algorithm is dependent on good initial guesses as input. Suicide rate versus GDP versus absolute lattitude We will analyze the data of suicide rates versus GDP versus absolute lattitude or Degrees From Equator (DFE) for clusters. Our hypothesis from the visual inspection was that there were at least two distinct clusters, one with higher suicide rate, GDP, and absolute lattitude and one with lower. Here, the Hierarchical Data Format (HDF) file is now read in as a DataFrame. This time, we want to discard all the rows where one or more column entries are NaN or empty. Thus, we use the appropriate DataFrame method for this: TABLE_FILE = 'data/data_ch4.h5' d2 = pd.read_hdf(TABLE_FILE) d2 = d2.dropna() Next, while the DataFrame is a very handy format, which we will utilize later on, the input to the cluster algorithms in SciPy do not handle Pandas data types natively. Thus, we transfer the data to a NumPy array: rates = d2[['DFE','GDP_CD','Both']].as_matrix().astype('float') Next, to recap, we visualise the data by one histogram of the GDP and one scatter plot for all the data. We do this to aid us in the initial guesses of the cluster centroid positions: plt.subplots(12, figsize=(8,3.5)) plt.subplot(121) plt.hist(rates.T[1], bins=20,color='SteelBlue') plt.xticks(rotation=45, ha='right') plt.yscale('log') plt.xlabel('GDP') plt.ylabel('Counts') plt.subplot(122) plt.scatter(rates.T[0], rates.T[2], s=2e5*rates.T[1]/rates.T[1].max(), color='SteelBlue', edgecolors='0.3'); plt.xlabel('Absolute Latitude (Degrees, 'DFE')') plt.ylabel('Suicide Rate (per 100')') plt.subplots_adjust(wspace=0.25); The scatter plots shows the GDP as size. The function to run the clustering k-means takes a special kind of normalized input. The data arrays (columns) have to be normalized by the standard deviation of the array. Although this is straightforward, there is a function included in the module called whiten. It will scale the data with the standard deviation: w = vq.whiten(rates) To show what it does to the data, we plot the same plots as we did previously, but with the output from the whiten function: plt.subplots(12, figsize=(8,3.5)) plt.subplot(121) plt.hist(w[:,1], bins=20, color='SteelBlue') plt.yscale('log') plt.subplot(122) plt.scatter(w.T[0], w.T[2], s=2e5*w.T[1]/w.T[1].max(), color='SteelBlue', edgecolors='0.3') plt.xticks(rotation=45, ha='right'); As you can see, all the data is scaled from the previous figure. However, as mentioned, the scaling is just the standard deviation. So let's calculate the scaling and save it to the sc variable: sc = rates.std(axis=0) Now we are ready to estimate the initial guesses for the cluster centroids. Reading off the first plot of the data, we guess the centroids to be at 20 DFE, 200,000 GDP, and 10 suicides and the second at 45 DFE, 100,000 GDP, and 15 suicides. We put this in an array and scale it with our scale parameter to the same scale as the output from the whiten function. This is then sent to the kmeans2 function of SciPy: init_guess = np.array([[20,20E3,10],[45,100E3,15]]) init_guess /= sc z2_cb, z2_lbl = vq.kmeans2(w, init_guess, minit='matrix', iter=500) There is another function, kmeans (without the 2), which is a less complex version and does not stop iterating when it reaches a local minima. It stops when the changes between two iterations go below some level. Thus, the standard K-means algorithm is represented in SciPy by the kmeans2 function. The function outputs the centroids' scaled positions (here z2_cb) and a lookup table (z2_lbl) telling us which row belongs to which centroid. To get the centroid positions in units we understand, we simply multiply with our scaling value: z2_cb_sc = z2_cb * sc At this point, we can plot the results. The following section is rather long and contains many different parts so we will go through them section by section. However, the code should be run in one cell of the Notebook: # K-means clustering figure START plt.figure(figsize=(6,4)) plt.scatter(z2_cb_sc[0,0], z2_cb_sc[0,2], s=5e2*z2_cb_sc[0,1]/rates.T[1].max(), marker='+', color='k', edgecolors='k', lw=2, zorder=10, alpha=0.7); plt.scatter(z2_cb_sc[1,0], z2_cb_sc[1,2], s=5e2*z2_cb_sc[1,1]/rates.T[1].max(), marker='+', color='k', edgecolors='k', lw=3, zorder=10, alpha=0.7); The first steps are quite simple—we set up the figure size and plot the points of the cluster centroids. We hypothesized about two clusters; thus, we plot them with two different calls to plt.scatter. Here, z2_cb_sc[1,0] gets the second cluster x-coordinate (DFE); then switching 0 for 1 gives us the y coordinate (rate). We set the size of the marker s-parameter to scale with the value of the third data axis, the GDP. We also do this further down for the data, just as in previous plots, so that it is easier to compare and differentiate the clusters. The zorder keyword gives the order in depth of the elements that are plotted; a high zorder will put them on top of everything else and a negative zorder will send them to the back. s0 = abs(z2_lbl==0).astype('bool') s1 = abs(z2_lbl==1).astype('bool') pattern1 = 5*'x' pattern2 = 4*'/' plt.scatter(w.T[0][s0]*sc[0], w.T[2][s0]*sc[2], s=5e2*rates.T[1][s0]/rates.T[1].max(), lw=1, hatch=pattern1, edgecolors='0.3', color=plt.cm.Blues_r( rates.T[1][s0]/rates.T[1].max())); plt.scatter(rates.T[0][s1], rates.T[2][s1], s=5e2*rates.T[1][s1]/rates.T[1].max(), lw=1, hatch=pattern2, edgecolors='0.4', marker='s', color=plt.cm.Reds_r( rates.T[1][s1]/rates.T[1].max()+0.4)) In this section, we plot the points of the clusters. First, we get the selection (Boolean) arrays. They are simply found by setting all indexes that refer to cluster 0 to True and all else to False; this gives us the Boolean array for cluster 0 (the first cluster). The second Boolean array is matched for the second cluster (cluster 1). Next, we define the hatch pattern for the scatter plot markers, which we later give as input to the plotting function. The multiplier for the hatch pattern gives the density of the pattern. The scatter plots for the points are created in a similar fashion as the centroids, except that the markers are a bit more complex. They are both color-coded, like in the previous example with Cholera deaths, but in a gradient instead of the exact same colors for all points. The gradient is defined by the GDP, which also defines the size of the points. The x and y data sent to the plot is different between the clusters, but they access the same data in the end because we multiply with our scaling factor. p1 = plt.scatter([],[], hatch='None', s=20E3*5e2/rates.T[1].max(), color='k', edgecolors='None',) p2 = plt.scatter([],[], hatch='None', s=40E3*5e2/rates.T[1].max(), color='k', edgecolors='None',) p3 = plt.scatter([],[], hatch='None', s=60E3*5e2/rates.T[1].max(), color='k', edgecolors='None',) p4 = plt.scatter([],[], hatch='None', s=80E3*5e2/rates.T[1].max(), color='k', edgecolors='None',) labels = ["20'", "40'", "60'", ">80'"] plt.legend([p1, p2, p3, p4], labels, ncol=1, frameon=True, #fontsize=12, handlelength=1, loc=1, borderpad=0.75,labelspacing=0.75, handletextpad=0.75, title='GDP', scatterpoints=1.5) plt.ylim((-4,40)) plt.xlim((-4,80)) plt.title('K-means clustering') plt.xlabel('Absolute Lattitude (Degrees, 'DFE')') plt.ylabel('Suicide Rate (per 100 000)'); The last tweak to the plot is made by creating a custom legend. We want to show different sizes of the points and what GDP they correspond to. As there is a continuous gradient from low to high, we cannot use the plotted points. Thus we create our own, but leave the x and y input coordinates as empty lists. This will not show anything in the plot but we can use them to register in the legend. The various tweaks to the legend function controls different aspects of the legend layout. I encourage you to experiment with it to see what happens: As for the final analysis, two different clusters are identified. Just as our previous hypothesis, there is a cluster with a clear linear trend with relatively higher GDP, which is also located at higher absolute latitude. Although the identification is rather weak, it is clear that the two groups are separated. Countries with low GDP are clustered closer to the equator. What happens when you add more clusters? Try to add a cluster for the low DFE, high rate countries, visualize it, and think about what this could mean for the conclusion(s). Summary In this article, we identified clusters using methods such as finding the centroid positions and K-means clustering. For more information about Python Data Analysis, refer to the following books by Packt Publishing: Python Data Analysis (https://www.packtpub.com/big-data-and-business-intelligence/python-data-analysis) Getting Started with Python Data Analysis (https://www.packtpub.com/big-data-and-business-intelligence/getting-started-python-data-analysis) Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Basics of Jupyter Notebook and Python [article] Scientific Computing APIs for Python [article]

0
0
2818

Packt

01 Jun 2016

8 min read

Classifier Construction

Packt

01 Jun 2016

8 min read

In this article by Pratik Joshi, author of the book Python Machine Learning Cookbook, we will build a simple classifier using supervised learning, and then go onto build a logistic-regression classifier. Building a simple classifier In the field of machine learning, classification refers to the process of using the characteristics of data to separate it into a certain number of classes. A supervised learning classifier builds a model using labeled training data, and then uses this model to classify unknown data. Let's take a look at how to build a simple classifier. (For more resources related to this topic, see here.) How to do it… Before we begin, make sure thatyou have imported thenumpy and matplotlib.pyplot packages. After this, let's create some sample data: X = np.array([[3,1], [2,5], [1,8], [6,4], [5,2], [3,5], [4,7], [4,-1]]) Let's assign some labels to these points: y = [0, 1, 1, 0, 0, 1, 1, 0] As we have only two classes, the list y contains 0s and 1s. In general, if you have N classes, then the values in y will range from 0 to N-1. Let's separate the data into classes that are based on the labels: class_0 = np.array([X[i] for i in range(len(X)) if y[i]==0]) class_1 = np.array([X[i] for i in range(len(X)) if y[i]==1]) To get an idea about our data, let's plot this, as follows: plt.figure() plt.scatter(class_0[:,0], class_0[:,1], color='black', marker='s') plt.scatter(class_1[:,0], class_1[:,1], color='black', marker='x') This is a scatterplot where we use squares and crosses to plot the points. In this context,the marker parameter specifies the shape that you want to use. We usesquares to denote points in class_0 and crosses to denote points in class_1. If you run this code, you will see the following figure: In the preceding two lines, we just use the mapping between X and y to create two lists. If you were asked to inspect the datapoints visually and draw a separating line, what would you do? You would simply draw a line in between them. Let's go ahead and do this: line_x = range(10) line_y = line_x We just created a line with the mathematical equation,y = x. Let's plot this, as follows: plt.figure() plt.scatter(class_0[:,0], class_0[:,1], color='black', marker='s') plt.scatter(class_1[:,0], class_1[:,1], color='black', marker='x') plt.plot(line_x, line_y, color='black', linewidth=3) plt.show() If you run this code, you should see the following figure: There's more… We built a really simple classifier using the following rule: the input point (a, b) belongs to class_0 if a is greater than or equal tob;otherwise, it belongs to class_1. If you inspect the points one by one, you will see that this is true. This is it! You just built a linear classifier that can classify unknown data. It's a linear classifier because the separating line is a straight line. If it's a curve, then it becomes a nonlinear classifier. This formation worked fine because there were a limited number of points, and we could visually inspect them. What if there are thousands of points? How do we generalize this process? Let's discuss this in the next section. Building a logistic regression classifier Despite the word regression being present in the name, logistic regression is actually used for classification purposes. Given a set of datapoints, our goal is to build a model that can draw linear boundaries between our classes. It extracts these boundaries by solving a set of equations derived from the training data. Let's see how to do that in Python: We will use the logistic_regression.pyfile that is already provided to you as a reference. Assuming that you have imported the necessary packages, let's create some sample data along with training labels: X = np.array([[4, 7], [3.5, 8], [3.1, 6.2], [0.5, 1], [1, 2], [1.2, 1.9], [6, 2], [5.7, 1.5], [5.4, 2.2]]) y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]) Here, we assume that we have three classes. Let's initialize the logistic regression classifier: classifier = linear_model.LogisticRegression(solver='liblinear', C=100) There are a number of input parameters that can be specified for the preceding function, but a couple of important ones are solver and C. The solverparameter specifies the type of solver that the algorithm will use to solve the system of equations. The C parameter controls the regularization strength. A lower value indicates higher regularization strength. Let's train the classifier: classifier.fit(X, y) Let's draw datapoints and boundaries: plot_classifier(classifier, X, y) We need to define this function: def plot_classifier(classifier, X, y): # define ranges to plot the figure x_min, x_max = min(X[:, 0]) - 1.0, max(X[:, 0]) + 1.0 y_min, y_max = min(X[:, 1]) - 1.0, max(X[:, 1]) + 1.0 The preceding values indicate the range of values that we want to use in our figure. These values usually range from the minimum value to the maximum value present in our data. We add some buffers, such as 1.0 in the preceding lines, for clarity. In order to plot the boundaries, we need to evaluate the function across a grid of points and plot it. Let's go ahead and define the grid: # denotes the step size that will be used in the mesh grid step_size = 0.01 # define the mesh grid x_values, y_values = np.meshgrid(np.arange(x_min, x_max, step_size), np.arange(y_min, y_max, step_size)) The x_values and y_valuesvariables contain the grid of points where the function will be evaluated. Let's compute the output of the classifier for all these points: # compute the classifier output mesh_output = classifier.predict(np.c_[x_values.ravel(), y_values.ravel()]) # reshape the array mesh_output = mesh_output.reshape(x_values.shape) Let's plot the boundaries using colored regions: # Plot the output using a colored plot plt.figure() # choose a color scheme plt.pcolormesh(x_values, y_values, mesh_output, cmap=plt.cm.Set1) This is basically a 3D plotter that takes the 2D points and the associated values to draw different regions using a color scheme. You can find all the color scheme options athttp://matplotlib.org/examples/color/colormaps_reference.html. Let's overlay the training points on the plot: plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='black', linewidth=2, cmap=plt.cm.Paired) # specify the boundaries of the figure plt.xlim(x_values.min(), x_values.max()) plt.ylim(y_values.min(), y_values.max()) # specify the ticks on the X and Y axes plt.xticks((np.arange(int(min(X[:, 0])-1), int(max(X[:, 0])+1), 1.0))) plt.yticks((np.arange(int(min(X[:, 1])-1), int(max(X[:, 1])+1), 1.0))) plt.show() Here, plt.scatter plots the points on the 2D graph. TheX[:, 0] specifies that we should take all the values along axis 0 (X-axis in our case), and X[:, 1] specifies axis 1 (Y-axis). The c=y parameter indicates the color sequence. We use the target labels to map to colors using cmap. We basically want different colors based on the target labels; hence, we use y as the mapping. The limits of the display figure are set using plt.xlim and plt.ylim. In order to mark the axes with values, we need to use plt.xticks and plt.yticks. These functions mark the axes with values so that it's easier for us to see where the points are located. In the preceding code, we want the ticks to lie between the minimum and maximum values with a buffer of 1 unit. We also want these ticks to be integers. So, we use theint() function to round off the values. If you run this code, you should see the following output: Let's see how the Cparameter affects our model. The C parameter indicates the penalty for misclassification. If we set this to 1.0, we will get the following figure: If we set C to 10000, we get the following figure: As we increase C, there is a higher penalty for misclassification. Hence, the boundaries get more optimal. Summary We successfully employed supervised learning to build a simple classifier. We subsequently went on to construct a logistic-regression classifier and saw different results of tweaking C—the regularization strength parameter. Resources for Article: Further resources on this subject: Python Scripting Essentials [article] Web scraping with Python (Part 2) [article] Web Server Development [article]

0
0
2323

Packt

01 Jun 2016

7 min read

Linking Data to Shapes

Packt

01 Jun 2016

7 min read

0
0
7293

Packt

31 May 2016

19 min read

Holistic View on Spark

Packt

31 May 2016

19 min read

0
0
2732

article-image-splunks-input-methods-and-data-feeds

Packt

30 May 2016

13 min read

Splunk's Input Methods and Data Feeds

Packt

30 May 2016

13 min read

0
0
16491

article-image-security-considerations-multitenant-environment

Packt

24 May 2016

8 min read

Security Considerations in Multitenant Environment

Packt

24 May 2016

8 min read

In this article by Zoran Pavlović and Maja Veselica, authors of the book, Oracle Database 12c Security Cookbook, we will be introduced to common privileges and learn how to grant privileges and roles commonly. We'll also study the effects of plugging and unplugging operations on users, roles, and privileges. (For more resources related to this topic, see here.) Granting privileges and roles commonly Common privilege is a privilege that can be exercised across all containers in a container database. Depending only on the way it is granted, a privilege becomes common or local. When you grant privilege commonly (across all containers) it becomes common privilege. Only common users or roles can have common privileges. Only common role can be granted commonly. Getting ready For this recipe, you will need to connect to the root container as an existing common user who is able to grant a specific privilege or existing role (in our case – create session, select any table, c##role1, c##role2) to another existing common user (c##john). If you want to try out examples in the How it works section given ahead, you should open pdb1 and pdb2. You will use: Common users c##maja and c##zoran with dba role granted commonly Common user c##john Common roles c##role1 and c##role2 How to do it... You should connect to the root container as a common user who can grant these privileges and roles (for example, c##maja or system user). SQL> connect c##maja@cdb1 Grant a privilege (for example, create session) to a common user (for example, c##john) commonly c##maja@CDB1> grant create session to c##john container=all; Grant a privilege (for example, select any table) to a common role (for example, c##role1) commonly c##maja@CDB1> grant select any table to c##role1 container=all; Grant a common role (for example, c##role1) to a common role (for example, c##role2) commonly c##maja@CDB1> grant c##role1 to c##role2 container=all; Grant a common role (for example, c##role2) to a common user (for example, c##john) commonly c##maja@CDB1> grant c##role2 to c##john container=all; How it works... Figure 16 You can grant privileges or common roles commonly only to a common user. You need to connect to the root container as a common user who is able to grant a specific privilege or role. In step 2, system privilege, create session, is granted to common user c##john commonly, by adding a container=all clause to the grant statement. This means that user c##john can connect (create session) to root or any pluggable database in this container database (including all pluggable databases that will be plugged-in in the future). N.B. container = all clause is NOT optional, even though you are connected to the root. Unlike during creation of common users and roles (if you omit container=all, user or role will be created in all containers – commonly), If you omit this clause during privilege or role grant, privilege or role will be granted locally and it can be exercised only in root container. SQL> connect c##john/oracle@cdb1 c##john@CDB1> connect c##john/oracle@pdb1 c##john@PDB1> connect c##john/oracle@pdb2 c##john@PDB2> In the step 3, system privilege, select any table, is granted to common role c##role1 commonly. This means that role c##role1 contains select any table privilege in all containers (root and pluggable databases). c##zoran@CDB1> select * from role_sys_privs where role='C##ROLE1'; ROLE PRIVILEGE ADM COM ------------- ----------------- --- --- C##ROLE1 SELECT ANY TABLE NO YES c##zoran@CDB1> connect c##zoran/oracle@pdb1 c##zoran@PDB1> select * from role_sys_privs where role='C##ROLE1'; ROLE PRIVILEGE ADM COM -------------- ------------------ --- --- C##ROLE1 SELECT ANY TABLE NO YES c##zoran@PDB1> connect c##zoran/oracle@pdb2 c##zoran@PDB2> select * from role_sys_privs where role='C##ROLE1'; ROLE PRIVILEGE ADM COM -------------- ---------------- --- --- C##ROLE1 SELECT ANY TABLE NO YES In step 4, common role c##role1, is granted to another common role c##role2 commonly. This means that role c##role2 has granted role c##role1 in all containers. c##zoran@CDB1> select * from role_role_privs where role='C##ROLE2'; ROLE GRANTED ROLE ADM COM --------------- --------------- --- --- C##ROLE2 C##ROLE1 NO YES c##zoran@CDB1> connect c##zoran/oracle@pdb1 c##zoran@PDB1> select * from role_role_privs where role='C##ROLE2'; ROLE GRANTED_ROLE ADM COM ------------- ----------------- --- --- C##ROLE2 C##ROLE1 NO YES c##zoran@PDB1> connect c##zoran/oracle@pdb2 c##zoran@PDB2> select * from role_role_privs where role='C##ROLE2'; ROLE GRANTED_ROLE ADM COM ------------- ------------- --- --- C##ROLE2 C##ROLE1 NO YES In step 5, common role c##role2, is granted to common user c##john commonly. This means that user c##john has c##role2 in all containers. Consequently, user c##john can use select any table privilege in all containers in this container database. c##john@CDB1> select count(*) from c##zoran.t1; COUNT(*) ---------- 4 c##john@CDB1> connect c##john/oracle@pdb1 c##john@PDB1> select count(*) from hr.employees; COUNT(*) ---------- 107 c##john@PDB1> connect c##john/oracle@pdb2 c##john@PDB2> select count(*) from sh.sales; COUNT(*) ---------- 918843 Effects of plugging/unplugging operations on users, roles, and privileges Purpose of this recipe is to show what is going to happen to users, roles, and privileges when you unplug a pluggable database from one container database (cdb1) and plug it into some other container database (cdb2). Getting ready To complete this recipe, you will need: Two container databases (cdb1 and cdb2) One pluggable database (pdb1) in container database cdb1 Local user mike in pluggable database pdb1 with local create session privilege Common user c##john with create session common privilege and create synonym local privilege on pluggable database pdb1 How to do it... Connect to the root container of cdb1 as user sys: SQL> connect sys@cdb1 as sysdba Unplug pdb1 by creating XML metadata file: SQL> alter pluggable database pdb1 unplug into '/u02/oradata/pdb1.xml'; Drop pdb1 and keep datafiles: SQL> drop pluggable database pdb1 keep datafiles; Connect to the root container of cdb2 as user sys: SQL> connect sys@cdb2 as sysdba Create (plug) pdb1 to cdb2 by using previously created metadata file: SQL> create pluggable database pdb1 using '/u02/oradata/pdb1.xml' nocopy; How it works... By completing previous steps, you unplugged pdb1 from cdb1 and plugged it into cdb2. After this operation, all local users and roles (in pdb1) are migrated with pdb1 database. If you try to connect to pdb1 as a local user: SQL> connect mike@pdb1 It will succeed. All local privileges are migrated, even if they are granted to common users/roles. However, if you try to connect to pdb1 as a previously created common user c##john, you'll get an error SQL> connect c##john@pdb1 ERROR: ORA-28000: the account is locked Warning: You are no longer connected to ORACLE. This happened because after migration, common users are migrated in a pluggable database as locked accounts. You can continue to use objects in these users' schemas, or you can create these users in root container of a new CDB. To do this, we first need to close pdb1: sys@CDB2> alter pluggable database pdb1 close; Pluggable database altered. sys@CDB2> create user c##john identified by oracle container=all; User created. sys@CDB2> alter pluggable database pdb1 open; Pluggable database altered. If we try to connect to pdb1 as user c##john, we will get an error: SQL> conn c##john/oracle@pdb1 ERROR: ORA-01045: user C##JOHN lacks CREATE SESSION privilege; logon denied Warning: You are no longer connected to ORACLE. Even though c##john had create session common privilege in cdb1, he cannot connect to the migrated PDB. This is because common privileges are not migrated! So we need to give create session privilege (either common or local) to user c##john. sys@CDB2> grant create session to c##john container=all; Grant succeeded. Let's try granting a create synonym local privilege to the migrated pdb2: c##john@PDB1> create synonym emp for hr.employees; Synonym created. This proves that local privileges are always migrated. Summary In this article, we learned about common privileges and the methods to grant common privileges and roles to users. We also studied what happens to users, roles, and privileges when you unplug a pluggable database from one container database and plug it into some other container database. Resources for Article: Further resources on this subject: Oracle 12c SQL and PL/SQL New Features[article] Oracle GoldenGate 12c — An Overview[article] Backup and Recovery for Oracle SOA Suite 12C[article]

0
0
1671

Packt

20 May 2016

28 min read

Visualizations Using CCC

Packt

20 May 2016

28 min read

0
0
6801

Packt

26 Apr 2016

10 min read

Advanced Shell Topics

Packt

26 Apr 2016

10 min read

In this article by Thomas Bitterman, the author of the book Mastering IPython 4.0, we will look at the tools the IPython interactive shell provides. With the split of the Jupyter and IPython projects, the command line provided by IPython will gain importance. This article covers the following topics: What is IPython? Installing IPython Starting out with the terminal IPython beyond Python Magic commands (For more resources related to this topic, see here.) What is IPython? IPython is an open source platform for interactive and parallel computing. It started with the realization that the standard Python interpreter was too limited for sustained interactive use, especially in the areas of scientific and parallel computing. Overcoming these limitations resulted in a three-part architecture: An enhanced, interactive shell Separation of the shell from the computational kernel A new architecture for parallel computing This article will provide a brief overview of the architecture before introducing some basic shell commands. Before proceeding further, however, IPython needs to be installed. Those readers with experience in parallel and high-performance computing but new to IPython will find the following sections useful in quickly getting up to speed. Those experienced with IPython may skim the next few sections, noting where things have changed now that the notebook is no longer an integral part of development. Installing IPython The first step in installing IPython is to install Python. Instructions for the various platforms differ, but the instructions for each can be found on the Python home page at http://www.python.org. IPython requires Python 2.7 or ≥ 3.3. This article will use 3.5. Both Python and IPython are open source software, so downloading and installation are free. A standard Python installation includes the pip package manager. pip is a handy command-line tool that can be used to download and install various Python libraries. Once Python is installed, IPython can be installed with this command: pip install ipython IPython comes with a test suite called iptest. To run it, simply issue the following command: iptest A series of tests will be run. It is possible (and likely on Windows) that some libraries will be missing, causing the associated tests to fail. Simply use pip to install those libraries and rerun the test until everything passes. It is also possible that all tests pass without an important library being installed. This is the readline library (also known as PyReadline). IPython will work without it but will be missing some features that are useful for the IPython terminal, such as command completion and history navigation. To install readline, use pip: pip install readline pip install gnureadline At this point, issuing the ipython command will start up an IPython interpreter: ipython IPython beyond Python No one would use IPython if it were not more powerful than the standard terminal. Much of IPython's power comes from two features: Shell integration Magic commands Shell integration Any command starting with ! is passed directly to the operating system to be executed, and the result is returned. By default, the output is then printed out to the terminal. If desired, the result of the system command can be assigned to a variable. The result is treated as a multiline string, and the variable is a list containing one string element per line of output. For example: In [22]: myDir = !dir In [23]: myDir Out[23]: [' Volume in drive C has no label.', ' Volume Serial Number is 1E95-5694', '', ' Directory of C:\Program Files\Python 3.5', '', '10/04/2015 08:43 AM <DIR> .', '10/04/2015 08:43 AM <DIR> ..',] While this functionality is not entirely absent in straight Python (the OS and subprocess libraries provide similar abilities), the IPython syntax is much cleaner. Additional functionalities such as input and output caching, directory history, and automatic parentheses are also included. History The previous examples have had lines that were prefixed by elements such as In[23] and Out[15]. In and Out are arrays of strings, where each element is either an input command or the resulting output. They can be referred to using the arrays notation, or "magic" commands can accept the subscript alone. Magic commands IPython also accepts commands that control IPython itself. These are called "magic" commands, and they start with % or %%. A complete list of magic commands can be found by typing %lsmagic in the terminal. Magics that start with a single % sign are called "line" magics. They accept the rest of the current line for arguments. Magics that start with %% are called "cell" magics. They accept not only the rest of the current line but also the following lines. There are too many magic commands to go over in detail, but there are some related families to be aware of: OS equivalents: %cd, %env, %pwd Working with code: %run, %edit, %save, %load, %load_ext, %%capture Logging: %logstart, %logstop, %logon, %logoff, %logstate Debugging: %debug, %pdb, %run, %tb Documentation: %pdef, %pdoc, %pfile, %pprint, %psource, %pycat, %%writefile Profiling: %prun, %time, %run, %time, %timeit Working with other languages: %%script, %%html, %%javascript, %%latex, %%perl, %%ruby With magic commands, IPython becomes a more full-featured development environment. A development session might include the following steps: Set up the OS-level environment with the %cd, %env, and ! commands. Set up the Python environment with %load and %load_ext. Create a program using %edit. Run the program using %run. Log the input/output with %logstart, %logstop, %logon, and %logoff. Debug with %pdb. Create documentation with %pdoc and %pdef. This is not a tenable workflow for a large project, but for exploratory coding of smaller modules, magic commands provide a lightweight support structure. Creating custom magic commands IPython supports the creation of custom magic commands through function decorators. Luckily, one does not have to know how decorators work in order to use them. An example will explain. First, grab the required decorator from the appropriate library: In [1]: from IPython.core.magic import(register_line_magic) Then, prepend the decorator to a standard IPython function definition: In [2]: @register_line_magic ...: def getBootDevice(line): ...: sysinfo = !systeminfo ...: for ln in sysinfo: ...: if ln.startswith("Boot Device"): ...: return(ln.split()[2]) ...: Your new magic is ready to go: In [3]: %getBootDevice Out[3]: '\Device\HarddiskVolume1' Some observations are in order: Note that the function is, for the most part, standard Python. Also note the use of the !systeminfo shell command. One can freely mix both standard Python and IPython in IPython. The name of the function will be the name of the line magic. The parameter, "line," contains the rest of the line (in case any parameters are passed). A parameter is required, although it need not be used. The Out associated with calling this line magic is the return value of the magic. Any print statements executed as part of the magic are displayed on the terminal but are not part of Out (or _). Cython We are not limited to writing custom magic commands in Python. Several languages are supported, including R and Octave. We will look at one in particular, Cython. Cython is a language that can be used to write C extensions for Python. The goal for Cython is to be a superset of Python, with support for optional static type declarations. The driving force behind Cython is efficiency. As a compiled language, there are performance gains to be had from running C code. The downside is that Python is much more productive in terms of programmer hours. Cython can translate Python code into compiled C code, achieving more efficient execution at runtime while retaining the programmer-friendliness of Python. The idea of turning Python into C is not new to Cython. The default and most widely used interpreter (CPython) for Python is written in C. In some sense then, running Python code means running C code, just through an interpreter. There are other Python interpreter implementations as well, including those in Java (Jython) and C# (IronPython). CPython has a foreign function interface to C. That is, it is possible to write C language functions that interface with CPython in such a way that data can be exchanged and functions invoked from one to the other. The primary use is to call C code from Python. There are, however, two primary drawbacks: writing code that works with the CPython foreign function interface is difficult in its own right; and doing so requires knowledge of Python, C, and CPython. Cython aims to remedy this problem by doing all the work of turning Python into C and interfacing with CPython internally to Cython. The programmer writes Cython code and leaves the rest to the Cython compiler. Cython is very close to Python. The primary difference is the ability to specify C types for variables using the cdef keyword. Cython then handles type checking and conversion between Python values and C values, scoping issues, marshalling and unmarshalling of Python objects into C structures, and other cross-language issues. Cython is enabled in IPython by loading an extension. In order to use the Cython extension, do this: In [1]: %load_ext Cython At this point, the cython cell magic can be invoked: In [2]: %%cython ...: def sum(int a, int b): ...: cdef int s = a+b ...: return s And the Cython function can now be called just as if it were a standard Python function: In [3]: sum(1, 1) Out[3]: 2 While this may seem like a lot of work for something that could have been written more easily in Python in the first place, that is the price to be paid for efficiency. If, instead of simply summing two numbers, a function is expensive to execute and is called multiple times (perhaps in a tight loop), it can be worth it to use Cython for a reduction in runtime. There are other languages that have merited the same treatment, GNU Octave and R among them. Summary In this article, we covered many of the basics of using IPython for development. We started out by just getting an instance of IPython running. The intrepid developer can perform all the steps by hand, but there are also various all-in-one distributions available that will include popular modules upon installation. By default, IPython will use the pip package managers. Again, the all-in-one distributions provide added value, this time in the form of advanced package management capability. At that point, all that is obviously available is a terminal, much like the standard Python terminal. IPython offers two additional sources of functionality, however: configuration and magic commands. Magic commands fall into several categories: OS equivalents, working with code, logging, debugging, documentation, profiling, and working with other languages among others. Add to this the ability to create custom magic commands (in IPython or another language) and the IPython terminal becomes a much more powerful alternative to the standard Python terminal. Also included in IPython is the debugger—ipdb. It is very similar to the Python pdb debugger, so it should be familiar to Python developers. All this is supported by the IPython architecture. The basic idea is that of a Read-Eval-Print loop in which the Eval section has been separated out into its own process. This decoupling allows different user interface components and kernels to communicate with each other, making for a flexible system. This flexibility extends to the development environment. There are IDEs devoted to IPython (for example, Spyder and Canopy) and others that originally targeted Python but also work with IPython (for example, Eclipse). There are too many Python IDEs to list, and many should work with an IPython kernel "dropped in" as a superior replacement to a Python interpreter. Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Scientific Computing APIs for Python [article] Overview of Process Management in Microsoft Visio 2013 [article]

0
0
1964

article-image-getting-started-apache-hadoop-and-apache-spark

Packt

22 Apr 2016

12 min read

Getting Started with Apache Hadoop and Apache Spark

Packt

22 Apr 2016

12 min read

0
0
4265

Packt

15 Apr 2016

17 min read

Finding Patterns in the Noise – Clustering and Unsupervised Learning

Packt

15 Apr 2016

17 min read

In this article by, Joseph J, author of Mastering Predictive Analytics with Python, we will cover one of the natural questions to ask about a dataset is if it contains groups. For example, if we examine financial markets as a time series of prices over time, are there groups of stocks that behave similarly over time? Likewise, in a set of customer financial transactions from an e-commerce business, are there user accounts distinguished by patterns of similar purchasing activity? By identifying groups using the methods described in this article, we can understand the data as a set of larger patterns rather than just individual points. These patterns can help in making high-level summaries at the outset of a predictive modeling project, or as an ongoing way to report on the shape of the data we are modeling. Likewise, the groupings produced can serve as insights themselves, or they can provide starting points for the models. For example, the group to which a datapoint is assigned can become a feature of this observation, adding additional information beyond its individual values. Additionally, we can potentially calculate statistics (such as mean and standard deviation) for other features within these groups, which may be more robust as model features than individual entries. (For more resources related to this topic, see here.) In contrast to the methods, grouping or clustering algorithms are known as unsupervised learning, meaning we have no response, such as a sale price or click-through rate, which is used to determine the optimal parameters of the algorithm. Rather, we identify similar datapoints, and as a secondary analysis might ask whether the clusters we identify share a common pattern in their responses (and thus suggest the cluster is useful in finding groups associated with the outcome we are interested in). The task of finding these groups, or clusters, has a few common ingredients that vary between algorithms. One is a notion of distance or similarity between items in the dataset, which will allow us to compare them. A second is the number of groups we wish to identify; this can be specified initially using domain knowledge, or determined by running an algorithm with different choices of initial groups to identify the best number of groups that describes a dataset, as judged by numerical variance within the groups. Finally, we need a way to measure the quality of the groups we've identified; this can be done either visually or through the statistics that we will cover. In this article we will dive into: How to normalize data for use in a clustering algorithm and to compute similarity measurements for both categorical and numerical data How to use k-means to identify an optimal number of clusters by examining the loss function How to use agglomerative clustering to identify clusters at different scales Using affinity propagation to automatically identify the number of clusters in a dataset How to use spectral methods to cluster data with nonlinear boundaries Similarity and distance The first step in clustering any new dataset is to decide how to compare the similarity (or dissimilarity) between items. Sometimes the choice is dictated by what kinds of similarity we are trying to measure, in others it is restricted by the properties of the dataset. In the following we illustrate several kinds of distance for numerical, categorical, time series, and set-based data—while this list is not exhaustive, it should cover many of the common use cases you will encounter in business analysis. We will also cover normalizations that may be needed for different data types prior to running clustering algorithms. Numerical distances Let's begin by looking at an example contained in the wine.data file. It contains a set of chemical measurements that describe the properties of different kinds of wines, and the class of quality (I-III) to which the wine is assigned (Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation, Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.). Open the file in an iPython notebook and look at the first few rows: Notice that in this dataset we have no column descriptions. We need to parse these from the dataset description file wine.data. With the following code, we generate a regular expression that will match a header name (we match a pattern where a number followed by a parenthesis has a column name after it, as you can see in the list of column names listed in the file), and add these to an array of column names along with the first column, which is the class label of the wine (whether it belongs to category I-III). We then assign this list to the dataframe column names: Now that we have appended the column names, we can look at a summary of the dataset: How can we calculate a similarity between wines based on this data? One option would be to consider each of the wines as a point in a thirteen-dimensional space specified by its dimensions (for example, each of the properties other than the class). Since the resulting space has thirteen dimensions, we can't directly visualize the datapoints using a scatterplot to see if they are nearby, but we can calculate distances just the same as with a more familiar 2- or 3-dimensional space using the Euclidean distance formula, which is simply the length of the straight line between two points. This formula for this length can be used whether the points are in a 2-dimensional plot or a more complex space such as this example, and is given by: Here aand bare rows of the dataset and nis the number of columns. One feature of the Euclidean distance is that columns whose scale is much different from others can distort it. In our example, the values describing the magnesium content of each wine are ~100 times greater than the magnitude of features describing the alcohol content or ash percentage. If we were to calculate the distance between these datapoints, it would largely be determined by the magnesium concentration (as even small differences on this scale overwhelmingly determine the value of the distance calculation), rather than any of its other properties. While this might sometimes be desirable, in most applications we do not favour one feature over another and want to give equal weight to all columns. To get a fair distance comparison between these points, we need to normalize the columns so that they fall into the same numerical range (have similar maxima and minima values). We can do so using the scale()function in scikit-learn: This function will subtract the mean value of a column from each element and then divide each point by the standard deviation of the column. This normalization centers each column at 0 with variance 1, and in the case of normally distributed data this would make a standard normal distribution. Also note that the scale() function returns a numpy dataframe, which is why we must call dataframe on the output to use the pandas function describe(). Now that we've scaled the data, we can calculate Euclidean distances between the points: We've now converted our dataset of 178 rows and 13 columns into a square matrix, giving the distance between each of these rows. In other words, row I, column j in this matrix represents the Euclidean distance between rows I and j in our dataset. This 'distance matrix' is the input we will use for clustering inputs in the following section. If we just want to get a visual sense of how the points compare to each other, we could use multidimensional scaling (MDS)—Modern Multidimensional Scaling - Theory and Applications Borg, I., Groenen P., Springer Series in Statistics (1997), Nonmetric multidimensional scaling: a numerical method, Kruskal, J. Psychometrika, 29 (1964), and Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Kruskal, J. Psychometrika, 29, (1964)—to create a visualization. Multidimensional scaling attempts to find the set of lower dimensional coordinates (here, two dimensions) that best represents the distances in the higher dimensions of a dataset (here, the pairwise Euclidean distances we calculated from the 13 dimensions). It does this by minimizing the coordinates (x, y) according to the strain function: Strain(x1…..xn) = (1 – Sum(ijdij*<xi,xj>)2/Sum(ij(dij**2)Sumij<xi,x,j>**2))1/2 Where d are the distances we've calculated between points. In other words, we find coordinates that best capture the variation in the distance through the variation in dot product the coordinates. We can then plot the resulting coordinates, using the wine class to label points in the diagram. Note that the coordinates themselves have no interpretation (in fact, they could change each time we run the algorithm). Rather, it is the relative position of points that we are interested in: Given that there are many ways we could have calculated the distance between datapoints, is the Euclidean distance a good choice here? Visually, based on the multidimensional scaling plot, we can see there is separation between the classes based on the features we've used to calculate distance, so conceptually it appears that this is a reasonable choice in this case. However, the decision also depends on what we are trying to compare; if we are interested in detecting wines with similar attributes in absolute values, then it is a good metric. However, what if we're not interested so much in the absolute composition of the wine, but whether its variables follow similar trends among wines with different alcohol contents? In this case, we wouldn't be interested in the absolute difference in values, but rather the correlationbetween the columns. This sort of comparison is common for time series, which we turn to next. Correlations and time series For time series data, we are often concerned with whether the patterns between series exhibit the same variation over time, rather than their absolute differences in value. For example, if we were to compare stocks, we might want to identify groups of stocks whose prices move up and down in similar patterns over time. The absolute price is of less interest than this pattern of increase and decrease. Let's look at an example of the Dow Jones industrial average over time (Brown, M. S., Pelosi, M., and Dirska, H. (2013). Dynamic-radius Species-conserving Genetic Algorithm for the Financial Forecasting of Dow Jones Index Stocks and Machine Learning and Data Mining in Pattern Recognition, 7988, 27-41.): This data contains the daily stock price (for 6 months) for a set of 30 stocks. Because all of the numerical values (the prices) are on the same scale, we won't normalize this data as with the wine dimensions. We notice two things about this data. First, the closing price per week (the variable we will use to calculate correlation) is presented as a string. Second, the date is not in the current format for plotting. We will process both columns to fix this, converting the columns to a float and datetime object, respectively: With this transformation, we can now make a pivot table to place the closing prices for week as columns and individual stocks as rows: As we can see, we only need columns 2 and onwards to calculate correlations between rows. Let's calculate the correlation between these time series of stock prices by selecting the second column to end columns of the data frame, calculating the pairwise correlations distance metric, and visualizing it using MDS, as before: It is important to note that the Pearson coefficient, which we've calculated here, is a measure of linearcorrelation between these time series. In other words, it captures the linear increase (or decrease) of the trend in one price relative to another, but won't necessarily capture nonlinear trends. We can see this by looking at the formula for the Pearson correlation, which is given by: P(a,b) = cov(a,b)/sd(a)/sd(b) = Sum(a-mean(b))*Sum(b-mean(b))/Sqrt(Sum(a-mean(a))2* Sqrt(Sum(b-mean(b)) This value varies from 1 (highly correlated) to -1 (inversely correlated), with 0 representing no correlation (such as a cloud of points). You might recognize the numerator of this equation as the covariance, which is a measure of how much two datasets, a and b, vary with one another. You can understand this by considering that the numerator is maximized when corresponding points in both datasets are above or below their mean value. However, whether this accurately captures the similarity in the data depends upon the scale. In data that is distributed in regular intervals between a maximum and minimum, with roughly the same difference between consecutive values (which is essentially how a trend line appears), it captures this pattern well. However, consider a case in which the data is exponentially distributed, with orders of magnitude differences between the minimum and maximum, and the difference between consecutive datapoints also varyies widely. Here, the Pearson correlation would be numerically dominated by only the largest terms, which might or might not represent the overall similarity in the data. This numerical sensitivity also occurs in the numerator, which represents the product of the standard deviations of both datasets. Thus, the value of the correlation is maximized when the variation in the two datasets is roughly explained by the product of their individual variations; there is no 'left over' variation between the datasets that is not explained by their respective standard deviations. Looking at the first two stocks in this dataset, this assumption of linearity appears to be a valid one for comparing datapoints: In addition to verifying that these stocks have a roughly linear correlation, this command introduces some new functions in pandas you may find useful. The first is iloc, which allows you to select indexed rows from a dataframe. The second is transpose, which inverts the rows and columns. Here, we select the first two rows, transpose, and then select all rows (prices) after the first (since the first is the Ticker symbol) Despite the trend we see in this example, we could imagine a nonlinear trend between prices. In these cases, it might be better to measure, not the linear correlation of the prices themselves, but whether the high prices for one stock coincide with another. In other words, the rank of market days by price should be the same, even if the prices are nonlinearly related. We can also calculate this rank correlation, also known as the Spearman's Rho, using scipy, with the following formula: Rho(a,b) = 6 * sum(d^2) / n (n2-1) Where n is the number of datapoints in each of two sets a and b, and d is the difference in ranks between each pair of datapoints ai and bi. Because we only compare the ranks of the data, not their actual values, this measure can capture variations up and down between two datasets, even if they vary over wide numerical ranges. Let's see if plotting the results using the Spearman correlation metric generates any differences in the pairwise distance of the stocks: The Spearman correlation distances, based on the x and y axes, appear closer to each other, suggesting from the perspective of rank correlation that the time series appear more similar. Though they differ in their assumptions about how the two compared datasets are distributed numerically, Pearson and Spearman correlations share the requirement that the two sets are of the same length. This is usually a reasonable assumption, and will be true of most of the examples we consider in this book. However, for cases where we wish to compare time series of unequal lengths, we can use Dynamic Time Warping (DTW). Conceptually, the idea of DTW is to warp one time series to align with a second, by allowing us to open gaps in either dataset so that it becomes the same size as the second. What the algorithm needs to resolve is where the most similar areas of the two series are, so that gaps can be places in the appropriate locations. In the simplest implementation, DTW consists of the following steps: For a dataset a of length n and a dataset n of length m, construct a matrix m by n. Set the top row and the leftmost column of this matrix to both be infinity. For each point i in set a, and point j in set b, compare their similarity using a cost function. To this cost function, add the minimum of the element (i-1, j-1), (i-1, j), and (j-1, i)—moving up and left, left, or up). These conceptually represent the costs of opening a gap in one of the series, versus aligning the same element in both. At the end of step 3, we will have traced the minimum cost path to align the two series, and the DTW distance will be represented by the bottommost corner of the matrix, (n.m). A negative aspect of this algorithm is that step 3 involves computing a value for every element of series a and b. For large time series or large datasets, this can be computationally prohibitive. While a full discussion of algorithmic improvements is beyond the scope of our present examples, we refer interested readers to FastDTW (which we will use in our example) and SparseDTW as examples of improvements that can be evaluated using many fewer calculations (Al-Naymat, G., Chawla, S., & Taheri, J. (2012), SparseDTW: A Novel Approach to Speed up Dynamic Time Warping and Stan Salvador and Philip Chan, FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space. KDD Workshop on Mining Temporal and Sequential Data, pages 70-80, 20043). We can use the FastDTW algorithm to compare the stocks data as well, and to plot the resulting coordinates. First we will compare pairwise each pair of stocks and record their DTW distance in a matrix: For computational efficiency (because the distance between i and j equals the distance between stocks j and i), we calculate only the upper triangle of this matrix. We then add the transpose (for example, the lower triangle) to this result to get the full distance matrix. Finally, we can use MDS again to plot the results: Compared to the distribution of coordinates along the x and y axis for Pearson correlation and rank correlation, the DTW distances appear to span a wider range, picking up more nuanced differences between the time series of stock prices. Now that we've looked at numerical and time series data, as a last example let's examine calculating similarity in categorical datasets. Summary In this section, we learned how to identify groups of similar items in a dataset, an exploratory analysis that we might frequently use as a first step in deciphering new datasets. We explored different ways of calculating the similarity between datapoints and described what kinds of data these metrics might best apply to. We examined both divisive clustering algorithms, which split the data into smaller components starting from a single group, and agglomerative methods, where every datapoint starts as its own cluster. Using a number of datasets, we showed examples where these algorithms will perform better or worse, and some ways to optimize them. We also saw our first (small) data pipeline, a clustering application in PySpark using streaming data. Resources for Article: Further resources on this subject: Python Data Structures[article] Big Data Analytics[article] Data Analytics[article]

0
0
12166

How-To Tutorials - Data

Recommendation Systems

Data Science with R

Getting Started with TensorFlow: an API Primer

How to use SQLite with Ionic to store data?

Setting up Spark

Clustering Methods

Classifier Construction

Linking Data to Shapes

Holistic View on Spark

Splunk's Input Methods and Data Feeds

Trending Topics

Security Considerations in Multitenant Environment

Visualizations Using CCC

Advanced Shell Topics

Getting Started with Apache Hadoop and Apache Spark

Finding Patterns in the Noise – Clustering and Unsupervised Learning

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading