How-To Tutorials

article-image-how-to-perform-data-exploration-with-rethinkdb

14 Feb 2018

5 min read

How to perform Data exploration with RethinkDB

14 Feb 2018

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will let you master the capabilities of RethinkDB and implement them to develop efficient real-time web applications.[/box] In this article, we will learn to do data exploration in RethinkDB with the help of few use case. Executing data exploration use cases We have imported our database i.e. our mock data into our RethinkDB instance. Now it's time to run a use case query and make use of it. But before we do so, we need to figure out one data alteration. We have made a mistake while generating mock data (on purpose actually) we have a $ sign before ctc. Hence, it becomes tough to perform salary-level queries. Before we move ahead, we need to figure out this problem, and basically get rid of the $ sign and update the ctc value to an integer instead of a string. In order to do this, we need to perform the following operation: Traverse through each document in the database Split the ctc string into two parts, containing $ and the other value Update the ctc value in the document with a new data type and value Since we require the chaining of queries, I have written a small snippet in Node.js to achieve the previous scenario as follows: var rethinkdb = require('rethinkdb'); var connection = null; rethinkdb.connect({host : 'localhost', port : 28015},function(err,conn) { if(err) { throw new Error('Connection error'); } connection = conn; rethinkdb.db("company").table("employees") .run(connection,function(err,cursor) { if(err) { throw new Error(err); } cursor.each(function(err,data) { data.ctc = parseInt(data.ctc.split("$")[1]); rethinkdb.db("company").table("employees") .get(data.id) .update({ctc : data.ctc}) .run(connection,function(err,response) { if(err) { throw new Error(err); } console.log(response); }); }); }); }); As you can see in the preceding code, we first fetch all the documents and traverse them using cursor, one document at a time. We use the split() method as a $ separator and convert the outcome, which is salary, into an integer using the parseInt() method. We update each document at a time using the id value of the document: After selecting all the documents again, we can see an updated ctc value as an integer, as shown in the following figure: This is one of the practical examples where we perform some data manipulation before moving ahead with complex queries. Similarly, you can look for errors such as blank spaces in a specific field or duplicate elements in your record. Finding duplicate elements We can use distinct() to find out whether there is any duplicate element present in the table. Say you have 1,000 rows and there are 10 duplicates. In order to determine that, we just need to find out the unique rows (of course excluding the ID key, as that's unique by nature). Here is the query for the same: r.db("company").table('employees').without('id').distinct().count() As shown in the following screenshot, this query returns the count of unique rows, which should be 1,000 if there are no duplicates: This implies that our records contain no duplicate documents. Finding the list of countries We can write a query to find all the countries we have in our record and also use distinct again by just selecting the country field. Here is the query: r.db("company").table('employees')("country").distinct() As shown in this image, we have 124 countries in our records: Finding the top 10 employees with the highest salary In this use case, we need to evaluate all the records and find the top 10 employees with the highest to lowest pay. Here is the query for the same: r.db("company").table("employees").orderBy(r.desc("ctc")).limit(10) Here we are using orderBy, which by default orders the record in ascending order. To get the highest pay at the first document, we need to use descending ordering; we did it using the desc() ReQL command. As shown in the following image, the query returns 10 rows: You can modify the same query by just by limiting the number of users to one to get thehighest-paid employee. Displaying employee records with a specific name and location To extract such records from our table, we need to again perform a filter on the "first_name" and "country" fields. Here is the query to return those records: r.db("company").table('employees').filter({"first_name" : "John","country" : "Sweden"}) We are just performing a basic filter and comparing both fields. ReQL queries are really easy for solving such queries due to their chaining feature. After executing the preceding query, we show the following output: To summarize, we looked over a few use cases where we had to perform alteration and filtering of records in order to meet exploration task, like stripping the $ sign from ctc, or converting base 256 ip addresses into base 10 values and then performing a query on them. We also covered a general use case in order to get a practical feel of ReQL. If you are interested to learn about RethinkDB Query Language, Extending RethinkDB, and more you may check out this book Mastering RethinkDB.

0
0
2354

article-image-how-convert-java-code-into-kotlin

Aanand Shekhar

14 Feb 2018

2 min read

How to convert Java code into Kotlin

Aanand Shekhar

14 Feb 2018

2 min read

This Kotlin programming tutorial has been taken from Kotlin Programming Cookbook. One of the best things about Kotlin is its interoperability with Java. If you're a Java programmer, that should be a reason to start learning alone. If you're using an IntelliJ-based IDE, it's actually incredibly easy to convert Java code to Kotlin. In this step-by-step recipe, you'll find out how to do it. What you need to convert your Java code to Kotlin All you need to follow this recipe is an IntelliJ-based IDE installed, which compiles and runs Kotlin and Java. How to do it... Here are the steps you need to follow to convert a Java file to a Kotlin file: In your IntelliJ IDE, open the Java file that you want to convert to Kotlin. Note that it has a .java extension. Now, in the main menu, click on Code menu and choose the Convert Java File to Kotlin File option. Your Java file will be converted into Kotlin, and the extension will now be .kt. Here is an example of a Java file: After converting to Kotlin, this is what we have: A Kotlin file can be converted into Java, but it's better if you can avoid it or find an alternative way to do it. If you have to absolutely convert your Kotlin code to Java, click on Tools | Kotlin | Show Kotlin Bytecode in the menu: After clicking on Show Kotlin Bytecode, a window will open with the title Kotlin Bytecode: Click on Decompile and a .java file will be generated, containing a decompiled Java bytecode from Kotlin code: Yes, it has a lot of unnecessary code that was not present in the original Java code, but that is the case with decompiled bytecode. At the moment, this is the only way to convert Kotlin code to Java. Copy the decompiled file into a .java file and remove the unnecessary code. How it works Kotlin is a statically-typed programming language that works on Java Virtual Machine and compiles into JVM compatible bytecode. This is the reason we can convert Java code to Kotlin and mix Java and Kotlin code together. This is also the reason why you can, in a way, get Java code back from Kotlin (although the output is not completely desired).

0
0
42088

How-To Tutorials

article-image-deploy-rethinkdb-using-docker

Vijin Boricha

14 Feb 2018

7 min read

How to deploy RethinkDB using Docker

Vijin Boricha

14 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will help you develop efficient and real-time applications in RethinkDB with ease.[/box] In today’s tutorial, we will learn to install Docker, create an Docker image and deploy RethinkDB using Docker. Your code is not working in Production? But it's working on the QA (quality analysis server)! I am sure you have heard statements like these in your team during the deployment phase. Well no more of that, Docker everything and forget about the infrastructure of different environments, say, QA, Staging and Production, because your code is going to run Docker container not in those machines, hence write once, run everywhere. In this section, we will learn how to use Docker to deploy a RethinkDB Server or PaaS services. I am going to cover a few docker basics too; if you are already aware of them, please skip to the next section. Installing Docker Docker is available for all major platforms, such as, Linux-based distributions, Mac, and Windows. Visit the official website at h t t p s ://w w w . d o c k e r . c o m / and download the package suitable for your platform. We are installing Docker in our machine to create a new Docker image. Docker images are independent of platform and should not be confused with Docker for Mac or Docker for Windows. It's referred to as a Docker client too. Once you have installed the Docker, you need to start the Daemon process first; I am using a Mac so I can view this in the launchpad, as shown here: Upon clicking that, it will open up a nice console showing the Docker official logo and an indication that Docker is successfully booted, as shown in the following screenshot: Now we can begin creating our Docker image that in turn will run RethinkDB. Creating a Docker image For installing our RethinkDB on Ubuntu inside the Docker, we need to install the Ubuntu operating system. Run the following command to install a Ubuntu image from the official Docker hub repository: docker pull ubuntu This will download and install the Ubuntu image in our system. We will later use this Ubuntu image and install our RethinkDB instance; you can choose different operating systems as well. Before going to the Docker configuration code, I would like to point out the steps we require to install RethinkDB on a fresh Ubuntu installation: Update the system Add the RethinkDB repository to the known repository list Install RethinkDB Set the data folder Expose the port We are going to do this using Docker. To create a Docker image, we require Dockerfile. Create a file called Dockerfile with no extension and apply the code shown here: FROM ubuntu:latest # Install RethinkDB. RUN apt-get update && echo "deb http://download.rethinkdb.com/apt `lsb_release -cs` main" > /etc/apt/sources.list.d/rethinkdb.list && apt-get install -y wget && wget -O- http://download.rethinkdb.com/apt/pubkey.gpg | apt-key add - && apt-get update && apt-get install -y rethinkdb python-pip && rm -rf /var/lib/apt/lists/* # Install python driver for rethinkdb RUN pip install rethinkdb # Define mountable directories. VOLUME ["/data"] # Define working directory. WORKDIR /data # Define default command. CMD ["rethinkdb", "--bind", "all"] # Expose ports. # - 8080: web UI # - 28015: process # - 29015: cluster EXPOSE 8080 EXPOSE 28015 EXPOSE 29015 The first line is our entry point to the Ubuntu operating system, then we are performing an update of the system and using the installation commands recommended by RethinkDB here: h t t p s ://w w w . r e t h i n k d b . c o m /d o c s /i n s t a l l /u b u n t u /. Once the installation is complete, we install the rethinkdb python driver to perform the import/export operation. The next two commands mount a new volume in Ubuntu and telling RethinkDB to use that volume. The next command runs rethinkdb by binding all the ports and exposing the ports to be used by the client driver and web console. In order to make this a docker image, save the file and run the following command within the project directory: docker build -t docker-rethinkdb. Here, we are building our docker image and giving it a name docker-rethinkdb; upon running this command, Docker will execute the Dockerfile and you're on. The representation of the previous steps is shown here: Once everything works, and I am sure it will, you will see a success message in the console, as shown here: Congratulations! You have successfully created a docker image for RethinkDB. If you want to see your image and its properties, run the following command: docker images And this will list all the images of Docker, as shown in the following screenshot: Awesome! Now let's run it. To access the web portal, we need to run our docker image and bind port 8080 of the docker image to some port of our machine; here is the command to do so: docker run -p 3000:8080 -d docker-rethinkdb As per the command above, -p is used to specify port binding, the first is the target and second port is source, that is, Docker port and -d is used to run it in the background or Daemon. This will run the docker image in the background; to extract more information about this process, we need to run the following command: docker ps This will list all the running images called as a container, along with the information, as shown in the following screenshot: You can also check the logs of specific containers using the following command: docker logs <container id> Now, in order to access the RethinkDB web console from our machine, we need to find out the IP address on which the Docker machine is running. To get that, we need to run the following command: docker-machine ip default This will print out the IP. Copy the IP and hit IP:3000 from the browser to view the RethinkDB web console, as shown here: So we have docker running and accessible from the browser. In order to import and export the data, we need to log in to our Docker image. To do that, run the following command: docker exec -i -t <container-id> /bin/bash This will log in to the docker image running Ubuntu; refer to the following screenshot: You can now run the rethinkdb command to perform the data import to the existing RethinkDB cluster. Deploying the Docker image Almost every PaaS service we have covered in earlier sections provides support for Docker. You can submit your Dockerfile to git and clone it anywhere if you want to create Docker image. You can submit the whole docker image (not Dockerfile) to Dockerhub and pull your docker image directly using the docker pull command, which is no doubt an easy way because you will be directly working on the image running on the server. We covered RethinkDB deployment using Docker and learned how to create our own RethinkDB image. You can learn more about RethinkDB Query Language and Performance Tuning in RethinkDB from this book Mastering RethinkDB.

0
0
57987

article-image-getting-inside-c-plus-plus-multithreaded-application

Maya Posch

13 Feb 2018

8 min read

Getting Inside a C++ Multithreaded Application

Maya Posch

13 Feb 2018

8 min read

This C++ programming tutorial is taken from Maya Posch's Mastering C++ Multithreading. In its most basic form, a multithreaded application consists of a singular process with two or more threads. These threads can be used in a variety of ways, for example, to allow the process to respond to events in an asynchronous manner by using one thread per incoming event or type of event, or to speed up the processing of data by splitting the work across multiple threads. Examples of asynchronous response to events include the processing of user interface (GUI) and network events on separate threads so that neither type of event has to wait on the other, or can block events from being responded to in time. Generally, a single thread performs a single task, such as the processing of GUI or network events, or the processing of data. For this basic example, the application will start with a singular thread, which will then launch a number of threads, and wait for them to finish. Each of these new threads will perform its own task before finishing. Let's start with the includes and global variables for our application: #include <iostream> #include <thread> #include <mutex> #include <vector> #include <random> using namespace std; // --- Globals mutex values_mtx; mutex cout_mtx; Both the I/O stream and vector headers should be familiar to anyone who has ever used C++: the former is here used for the standard output (cout), and vector for storing a sequence of values. The random header is new in c++11, and as the name suggests, it offers classes and methods for generating random sequences. We use it here to make our threads do something interesting. Finally, the thread and mutex includes are the core of our multithreaded application; they provide the basic means for creating threads, and allow for thread-safe interactions between them. Moving on, we create two mutexes: one for the global vector and one for cout, since the latter is not thread-safe. Next we create the main function as follows: int main() { values.push_back(42); We then push a fixed value onto the vector instance; this one will be used by the threads we create in a moment: thread tr1(threadFnc, 1); thread tr2(threadFnc, 2); thread tr3(threadFnc, 3); thread tr4(threadFnc, 4); We create new threads, and provide them with the name of the method to use, passing along any parameters--in this case, just a single integer: tr1.join(); tr2.join(); tr3.join(); tr4.join(); Next, we wait for each thread to finish before we continue by calling join() on each thread instance: cout << "Input: " << values[0] << ", Result 1: " << values[1] << ", Result 2: " << values[2] << ", Result 3: " << values[3] << ", Result 4: " << values[4] << "n"; return 1; } At this point, we expect that each thread has done whatever it's supposed to do, and added the result to the vector, which we then read out and show the user. Of course, this shows almost nothing of what really happens in the application, mostly just the essential simplicity of using threads. Next, let's see what happens inside this method that we pass to each thread instance: void threadFnc(int tid) { cout_mtx.lock(); cout << "Starting thread " << tid << ".n"; cout_mtx.unlock(); When we obtain the initial value set in the vector, we copy it to a local variable so that we can immediately release the mutex for the vector to enable other threads to use it: int rval = randGen(0, 10); val += rval; These last two lines contain the essence of what the threads created do: they take the initial value, and add a randomly generated value to it. The randGen() method takes two parameters, defining the range of the returned value: cout_mtx.lock(); cout << "Thread " << tid << " adding " << rval << ". New value: " << val << ".n"; cout_mtx.unlock(); values_mtx.lock(); values.push_back(val); values_mtx.unlock(); } Finally, we (safely) log a message informing the user of the result of this action before adding the new value to the vector. In both cases, we use the respective mutex to ensure that there can be no overlap with any of the other threads. Once the method reaches this point, the thread containing it will terminate, and the main thread will have one fewer thread to wait for to rejoin. Lastly, we'll take a look at the randGen() method. Here we can see some multithreaded specific additions as well: int randGen(const int& min, const int& max) { static thread_local mt19937 generator(hash<thread::id>()(this_thread::get_id())); uniform_int_distribution<int> distribution(min, max); return distribution(generator) } This preceding method takes a minimum and maximum value as explained earlier, which limit the range of the random numbers this method can return. At its core, it uses a mt19937-based generator, which employs a 32-bit Mersenne Twister algorithm with a state size of 19937 bits. This is a common and appropriate choice for most applications. Of note here is the use of the thread_local keyword. What this means is that even though it is defined as a static variable, its scope will be limited to the thread using it. Every thread will thus create its own generator instance, which is important when using the random number API in the STL. A hash of the internal thread identifier (not our own) is used as seed for the generator. This ensures that each thread gets a fairly unique seed for its generator instance, allowing for better random number sequences. Finally, we create a new uniform_int_distribution instance using the provided minimum and maximum limits, and use it together with the generator instance to generate the random number which we return. Makefile In order to compile the code described earlier, one could use an IDE, or type the command on the command line. As mentioned in the beginning of this chapter, we'll be using makefiles for the examples in this book. The big advantages of this are that one does not have to repeatedly type in the same extensive command, and it is portable to any system which supports make. The makefile for this example is rather basic: GCC := g++ OUTPUT := ch01_mt_example SOURCES := $(wildcard *.cpp) CCFLAGS := -std=c++11 all: $(OUTPUT) $(OUTPUT): clean: rm $(OUTPUT) .PHONY: all From the top down, we first define the compiler that we'll use (g++), set the name of the output binary (the .exe extension on Windows will be post-fixed automatically), followed by the gathering of the sources and any important compiler flags. The wildcard feature allows one to collect the names of all files matching the string following it in one go without having to define the name of each source file in the folder individually. For the compiler flags, we're only really interested in enabling the c++11 features, for which GCC still requires one to supply this compiler flag. For the all method, we just tell make to run g++ with the supplied information. Next we define a simple clean method which just removes the produced binary, and finally, we tell make to not interpret any folder or file named all in the folder, but to use the internal method with the .PHONY section. When we run this makefile, we see the following command-line output: $ make Afterwards, we find an executable file called ch01_mt_example (with the .exe extension attached on Windows) in the same folder. Executing this binary will result in a command-line output akin to the following: $ ./ch01_mt_example.exe Starting thread 1. Thread 1 adding 8. New value: 50. Starting thread 2. Thread 2 adding 2. New value: 44. Starting thread 3. Starting thread 4. Thread 3 adding 0. New value: 42. Thread 4 adding 8. New value: 50. Input: 42, Result 1: 50, Result 2: 44, Result 3: 42, Result 4: 50 What one can see here already is the somewhat asynchronous nature of threads and their output. While threads 1 and 2 appear to run synchronously, threads 3 and 4 clearly run asynchronously. For this reason, and especially in longer-running threads, it's virtually impossible to say in which order the log output and results will be returned. While we use a simple vector to collect the results of the threads, there is no saying whether Result 1 truly originates from the thread which we assigned ID 1 in the beginning. If we need this information, we need to extend the data we return by using an information structure with details on the processing thread or similar. One could, for example, use struct like this: struct result { int tid; int result; }; The vector would then be changed to contain result instances rather than integer instances. One could pass the initial integer value directly to the thread as part of its parameters, or pass it via some other way. Want to learn C++ multithreading in detail? You can find Mastering C++ Multithreading here, or explore all our latest C++ eBooks and videos here.

0
0
16318

article-image-how-to-build-a-music-recommendation-system-with-pagerank-algorithm

Vijin Boricha

13 Feb 2018

6 min read

How to Build a music recommendation system with PageRank Algorithm

Vijin Boricha

13 Feb 2018

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering Spark for Data Science written by Andrew Morgan and Antoine Amend. In this book, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms to scale them linearly.[/box] In today’s tutorial, we will learn to build a recommender with PageRank algorithm. The PageRank algorithm Instead of recommending a specific song, we will recommend playlists. A playlist would consist of a list of all our songs ranked by relevance, most to least relevant. Let's begin with the assumption that people listen to music in a similar way to how they browse articles on the web, that is, following a logical path from link to link, but occasionally switching direction, or teleporting, and browsing to a totally different website. Continuing with the analogy, while listening to music one can either carry on listening to music of a similar style (and hence follow their most expected journey), or skip to a random song in a totally different genre. It turns out that this is exactly how Google ranks websites by popularity using a PageRank algorithm. For more details on the PageRank algorithm visit: h t t p ://i l p u b s . s t a n f o r d . e d u :8090/422/1/1999- 66. p d f . The popularity of a website is measured by the number of links it points to (and is referred from). In our music use case, the popularity is built as the number hashes a given song shares with all its neighbors. Instead of popularity, we introduce the concept of song commonality. Building a Graph of Frequency Co-occurrence We start by reading our hash values back from Cassandra and re-establishing the list of song IDs for each distinct hash. Once we have this, we can count the number of hashes for each song using a simple reduceByKey function, and because the audio library is relatively small, we collect and broadcast it to our Spark executors: val hashSongsRDD = sc.cassandraTable[HashSongsPair]("gzet", "hashes") val songHashRDD = hashSongsRDD flatMap { hash => hash.songs map { song => ((hash, song), 1) } } val songTfRDD = songHashRDD map { case ((hash,songId),count) => (songId, count) } reduceByKey(_+_) val songTf = sc.broadcast(songTfRDD.collectAsMap()) Next, we build a co-occurrence matrix by getting the cross product of every song sharing a same hash value, and count how many times the same tuple is observed. Finally, we wrap the song IDs and the normalized (using the term frequency we just broadcast) frequency count inside of an Edge class from GraphX: implicit class Crossable[X](xs: Traversable[X]) { def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y) val crossSongRDD = songHashRDD.keys .groupByKey() .values .flatMap { songIds => songIds cross songIds filter { case (from, to) => from != to }.map(_ -> 1) }.reduceByKey(_+_) .map { case ((from, to), count) => val weight = count.toDouble / songTfB.value.getOrElse(from, 1) Edge(from, to, weight) }.filter { edge => edge.attr > minSimilarityB.value } val graph = Graph.fromEdges(crossSongRDD, 0L) We are only keeping edges with a weight (meaning a hash co-occurrence) greater than a predefined threshold in order to build our hash frequency graph. Running PageRank Contrary to what one would normally expect when running a PageRank, our graph is undirected. It turns out that for our recommender, the lack of direction does not matter, since we are simply trying to find similarities between Led Zeppelin and Spirit. A possible way of introducing direction could be to look at the song publishing date. In order to find musical influences, we could certainly introduce a chronology from the oldest to newest songs giving directionality to our edges. In the following pageRank, we define a probability of 15% to skip, or teleport as it is known, to any random song, but this can be obviously tuned for different needs: val prGraph = graph.pageRank(0.001, 0.15) Finally, we extract the page ranked vertices and save them as a playlist in Cassandra via an RDD of the Song case class: case class Song(id: Long, name: String, commonality: Double) val vertices = prGraph .vertices .mapPartitions { vertices => val songIds = songIdsB .value .vertices .map { case (songId, pr) => val songName = songIds.get(vId).get Song(songId, songName, pr) } } vertices.saveAsCassandraTable("gzet", "playlist") The reader may be pondering the exact purpose of PageRank here, and how it could be used as a recommender? In fact, our use of PageRank means that the highest ranking songs would be the ones that share many frequencies with other songs. This could be due to a common arrangement, key theme, or melody; or maybe because a particular artist was a major influence on a musical trend. However, these songs should be, at least in theory, more popular (by virtue of the fact they occur more often), meaning that they are more likely to have mass appeal. On the other end of the spectrum, low ranking songs are ones where we did not find any similarity with anything we know. Either these songs are so avant-garde that no one has explored these musical ideas before, or alternatively are so bad that no one ever wanted to copy them! Maybe they were even composed by that up-and-coming artist you were listening to in your rebellious teenage years. Either way, the chance of a random user liking these songs is treated as negligible. Surprisingly, whether it is a pure coincidence or whether this assumption really makes sense, the lowest ranked song from this particular audio library is Daft Punk's–Motherboard it is a title that is quite original (a brilliant one though) and a definite unique sound. To summarize, we have learnt how to build a complete recommendation system for a song playlist. You can check out the book Mastering Spark for Data Science to deep dive into Spark and deliver other production grade data science solutions. Read our post on how deep learning is revolutionizing the music industry. And here is how you can analyze big data using the pagerank algorithm.

0
0
47410

article-image-tableau-data-handling-engine-offer

Amarabha Banerjee

13 Feb 2018

6 min read

What Tableau Data Handling Engine has to offer

Amarabha Banerjee

13 Feb 2018

6 min read

[box type="note" align="" class="" width=""]This article is taken from the book Mastering Tableau, written by David Baldwin. This book will equip you with all the information needed to create effective dashboards and data visualization solutions using Tableau.[/box] In today’s tutorial, we shall explore the Tableau data handling engine and a real world example of how to use it. Tableau's data-handling engine is usually not well comprehended by even advanced authors because it's not an overt part of day-to-day activities; however, for the author who wants to truly grasp how to ready data for Tableau, this understanding is indispensable. In this section, we will explore Tableau's data-handling engine and how it enables structured yet organic data mining processes in the enterprise. To begin, let's clarify a term. The phrase Data-Handling Engine (DHE) in this context references how Tableau interfaces with and processes data. This interfacing and processing is comprised of three major parts: Connection, Metadata, and VizQL. Each part is described in detail in the following section. In other publications, Tableau's DHE may be referred to as a metadata model or the Tableau infrastructure. I've elected not to use either term because each is frequently defined differently in different contexts, which can be quite confusing. Tableau's DHE (that is, the engine for interfacing with and processing data) differs from other broadly considered solutions in the marketplace. Legacy business intelligence solutions often start with structuring the data for an entire enterprise. Data sources are identified, connections are established, metadata is defined, a model is created, and more. The upfront challenges this approach presents are obvious: highly skilled professionals, time-intensive rollout, and associated high startup costs. The payoff is a scalable, structured solution with detailed documentation and process control. Many next generation business intelligence platforms claim to minimize or completely do away with the need for structuring data. The upfront challenges are minimized: specialized skillsets are not required and the rollout time and associated startup costs are low. However, the initial honeymoon is short-lived, since the total cost of ownership advances significantly when difficulties are encountered trying to maintain and scale the solution. Tableau's infrastructure represents a hybrid approach, which attempts to combine the advantages of legacy business intelligence solutions with those of next-generation platforms, while minimizing the shortcomings of both. The philosophical underpinnings of Tableau's hybrid approach include the following: Infrastructure present in current systems should be utilized when advantageous Data models should be accessible by Tableau but not required DHE components as represented in Tableau should be easy to modify DHE components should be adjustable by business users The Tableau Data-Handling Engine The preceding diagram shows that the DHE consists of a run time module (VizQL) and two layers of abstraction (Metadata and Connection). Let's begin at the bottom of the graphic by considering the first layer of abstraction, Connection. The most fundamental aspect of the Connection is a path to the data source. The path should include attributes for the database, tables, and views as applicable. The Connection may also include joins, custom SQL, data-source filters, and more. In keeping with Tableau's philosophy of easy to modify and adjustable by business users (see the previous section), each of these aspects of the Connection is easily modifiable. For example, an author may choose to add an additional table to a join or modify a data-source filter. Note that the Connection does not contain any of the actual data. Although an author may choose to create a data extract based on data accessed by the Connection, that extract is separate from the connection. The next layer of abstraction is the metadata. The most fundamental aspect of the Metadata layer is the determination of each field as a measure or dimension. When connecting to relational data, Tableau makes the measure/dimension determination based on heuristics that consider the data itself as well as the data source's data types. Other aspects of the metadata include aliases, data types, defaults, roles, and more. Additionally, the Metadata layer encompasses author-generated fields such as calculations, sets, groups, hierarchies, bins, and so on. Because the Metadata layer is completely separate from the Connection layer, it can be used with other Connection layers; that is, the same metadata definitions can be used with different data sources. VizQL is generated when a user places a field on a shelf. The VizQL is then translated into Structured Query Language (SQL), Multidimensional Expressions(MDX), or Tableau Query Language (TQL) and passed to the backend data source via a driver. The following two aspects of the VizQL module are of primary importance: VizQL allows the author to change field attributions on the fly VizQL enables table calculations Let's consider each of these aspects of VizQL via examples: Changing field attribution example An analyst is considering infant mortality rates around the world. Using data from h t t p://d a t a . w o r l d b a n k . o r g /, they create the following worksheet by placing AVG(Infant Mortality Rate) and Country on the Columns and Rows shelves, respectively. AVG(Infant Mortality Rate) is, of course, treated as a measure in this case: Next they create a second worksheet to analyze the relationship between Infant Mortality Rate and Health Exp/Capita (that is, health expenditure per capita). In order to accomplish this, they define Infant Mortality Rate as a dimension, as shown in the following Screenshot: Studying the SQL generated by VizQL to create the preceding visualization is particularly Insightful: SELECT ['World Indicators$'].[Infant Mortality Rate] AS [Infant Mortality Rate], AVG(['World Indicators$'].[Health Exp/Capita]) AS [avg:Health Exp/Capita:ok] FROM [dbo].['World Indicators$'] ['World Indicators$'] GROUP BY ['World Indicators$'].[Infant Mortality Rate] The Group By clause clearly communicates that Infant Mortality Rate is treated as a dimension. The takeaway is to note that VizQL enabled the analyst to change the field usage from measure to dimension without adjusting the source metadata. This on-the-fly ability enables creative exploration of the data not possible with other tools and avoids lengthy exercises attempting to define all possible uses for each field. If you liked our article, be sure to check out Mastering Tableau which consists of more useful data visualization and data analysis techniques.

0
0
11096

article-image-how-to-maintain-apache-mesos

Vijin Boricha

13 Feb 2018

6 min read

How to maintain Apache Mesos

Vijin Boricha

13 Feb 2018

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by David Blomquist and Tomasz Janiszewski, titled Apache Mesos Cookbook. Throughout the course of the book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In this article, we will learn about configuring logging options, setting up monitoring ecosystem, and upgrading your Mesos cluster. Logging and debugging Here we will configure logging options that will allow us to debug the state of Mesos. Getting ready We will assume Mesos is available on localhost port 5050. The steps provided here will work for either master or agents. How to do it... When Mesos is installed from pre-built packages, the logs are by default stored in /var/log/mesos/. When installing from a source build, storing logs is disabled by default. To change the log store location, we need to edit /etc/default/mesos and set the LOGS variable to the desired destination. For some reason, mesos-init-wrapper does not transfer the contents of /etc/mesos/log_dir to the --log_dir flag. That's why we need to set the log's destination in the environment variable. Remember that only Mesos logs will be stored there. Logs from third-party applications (for example, ZooKeeper) will still be sent to STDERR. Changing the default logging level can be done in one of two ways: by specifying the -- logging_level flag or by sending a request and changing the logging level at runtime for a specific period of time. For example, to change the logging level to INFO, just put it in the following code: /etc/mesos/logging_level echo INFO > /etc/mesos/logging_level The possible levels are INFO, WARNING, and ERROR. For example, to change the logging level to the most verbose for 15 minutes for debug purposes, we need to send the following request to the logging/toggle endpoint: curl -v -X POST localhost:5050/logging/toggle?level=3&duration=15mins How it works... Mesos uses the Google-glog library for debugging, but third-party dependencies such as ZooKeeper have their own logging solution. All configuration options are backed by glog and apply only to Mesos core code. Monitoring Now, we will set up monitoring for Mesos. Getting ready We must have a running monitoring ecosystem. Metrics storage could be a simple time- series database such as graphite, influxdb, or prometheus. In the following example, we are using graphite and our metrics are published with http://diamond.readthedocs.io/en/latest/. How to do it... Monitoring is enabled by default. Mesos does not provide any way to automatically push metrics to the registry. However, it exposes them as a JSON that can be periodically pulled and saved into the metrics registry: Install Diamond using following command: pip install diamond If additional packages are required to install them, run: sudo apt-get install python-pip python-dev build-essential. pip (Pip Installs Packages) is a Python package manager used to install software written in Python. Configure the metrics handler and interval. Open /etc/diamond/diamond.conf and ensure that there is a section for graphite configuration: [handler_graphite] class = handlers.GraphiteHandler host = <graphite.host> port = <graphite.port> Remember to replace graphite.host and graphite.port with real graphite details. Enable the default Mesos Collector. Create configuration files diamond-setup - C MesosCollector. Check whether the configuration has proper values and edit them if needed. The configuration can be found in /etc/diamond/collectors/MesosCollector.conf. On master, this file should look like this: enabled = True host = localhost port = 5050 While on agent, the port could be different (5051), as follows: enabled = True host = localhost port = 5051 How it works... Mesos exposes metrics via the HTTP API. Diamond is a small process that periodically pulls metrics, parses them, and sends them to the metrics registry, in this case, graphite. The default implementation of Mesos Collector does not store all the available metrics so it's recommended to write a custom handler that will collect all the interesting information. See also... Metrics could be read from the following endpoints: http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/ http://mesos.apache.org/documentation/latest/endpoints/slave/monitor/statistics/ http://mesos.apache.org/documentation/latest/endpoints/slave/state/ Upgrading Mesos In this recipe, you will learn how to upgrade your Mesos cluster. How to do it... Mesos release cadence is at least one release per quarter. Minor releases are backward compatible, although there could be some small incompatibilities or the dropping of deprecated methods. The recommended method of upgrading is to apply all intermediate versions. For example, to upgrade from 0.27.2 to 1.0.0, we should apply 0.28.0, 0.28.1, 0.28.2, and finally 1.0.0. If the agent's configuration changes, clearing the metadata directory is required. You can do this with the following code: rm -rv {MESOS_DIR}/metadata Here, {MESOS_DIR} should be replaced with the configured Mesos directory. Rolling upgrades is the preferred method of upgrading clusters, starting with masters and then agents. To minimize the impact on running tasks, if an agent's configuration changes and it becomes inaccessible, then it should be switched to maintenance mode. How it works... Configuration changes may require clearing the metadata because the changes may not be backward compatible. For example, when an agent runs with different isolators, it shouldn't attach to the already running processes without this isolator. The Mesos architecture will guarantee that the executors that were not attached to the Mesos agent will commit suicide after a configurable amount of time (--executor_registration_timeout). Maintenance mode allows you to declare the time window during which the agent will be inaccessible. When this occurs, Mesos will send a reverse offer to all the frameworks to drain that particular agent. The frameworks are responsible for shutting down its task and spawning it on another agent. The Maintenance mode is applied, even if the framework does not implement the HTTP API or is explicitly declined. Using maintenance mode can prevent restarting tasks multiple times. Consider the following example with five agents and one task, X. We schedule the rolling upgrade of all the agents. Task X is deployed on agent 1. When it goes down, it's moved to 2, then to 3, and so on. This approach is extremely inefficient because the task is restarted five times, but it only needs to be restarted twice. Maintenance mode enables the framework to optimally schedule the task to run on agent 5 when 1 goes down, and then return to 1 when 5 goes down: Worst case scenario of rolling upgrade without maintenance mode legend optimal solution of rolling upgrade with maintenance mode. We have learnt about running and maintaining Mesos. To know more about managing containers and understanding the scheduler API you may check out this book, Apache Mesos Cookbook.

0
0
3686

Richa Tripathi

13 Feb 2018

8 min read

Hypothesis testing with R

Richa Tripathi

13 Feb 2018

8 min read

0
0
15203

article-image-getting-started-with-data-visualization-in-tableau

Amarabha Banerjee

13 Feb 2018

5 min read

Getting started with Data Visualization in Tableau

Amarabha Banerjee

13 Feb 2018

5 min read

[box type="note" align="" class="" width=""]This article is an book extract from Mastering Tableau, written by David Baldwin. Tableau has emerged as one of the most popular Business Intelligence solutions in recent times, thanks to its powerful and interactive data visualization capabilities. This book will empower you to become a master in Tableau by exploiting the many new features introduced in Tableau 10.0.[/box] In today’s post, we shall explore data visualization basics with Tableau and explore a real world example using these techniques. Tableau Software has a focused vision resulting in a small product line. The main product (and hence the center of the Tableau universe) is Tableau Desktop. Assuming you are a Tableau author, that's where almost all your time will be spent when working with Tableau. But of course you must be able to connect to data and output the results. Thus, as shown in the following figure, the Tableau universe encompasses data sources, Tableau Desktop, and output channels, which include the Tableau Server family and Tableau Reader: Worksheet and dashboard creation At the heart of Tableau are worksheets and dashboards. Worksheets contain individual visualizations and dashboards contain one or more worksheets. Additionally, worksheets and dashboards may be combined into stories to communicate specific insights to the end user via a presentation environment. Lastly, all worksheets, dashboards, and stories are organized in workbooks that can be accessed via Tableau Desktop, Server, or Reader. In this section, we will look at worksheet and dashboard creation with the intent of not only communicating the basics, but also providing some insight that may prove helpful to even more seasoned Tableau authors. Worksheet creation At the most fundamental level, a visualization in Tableau is created by placing one or more fields on one or more shelves. To state this as a pseudo-equation: Field(s) + shelf(s) = Viz As an example, note that the visualization created in the following screenshot is generated by placing the Sales field on the Text shelf. Although the results are quite simple – a single number – this does qualify as a view. In other words, a Field (Sales) placed on a shelf (Text) has generated a Viz: Exercise – fundamentals of visualizations Let's explore the basics of creating a visualization via an exercise: Navigate to h t t p s ://p u b l i c . t a b l e a u . c o m /p r o f i l e /d a v i d 1. . b a l d w i n #!/ to locate and download the workbook associated with this chapter. In the workbook, find the tab labeled Fundamentals of Visualizations: Locate Region within the Dimensions portion of the Data pane: Drag Region to the Color shelf; that is, Region + Color shelf = what is shown in the following screenshot: Click on the Color shelf and then on Edit Colors… to adjust colors as desired: Next, move Region to the Size, Label/Text, Detail, Columns, and Rows shelves. After placing Region on each shelf, click on the shelf to access additional options. Lastly, choose other fields to drop on various shelves to continue exploring Tableau's behavior. As you continue exploring Tableau's behavior by dragging and dropping different fields onto different shelves, you will notice that Tableau responds with default behaviors. These defaults, however, can be overridden, which we will explore in the following section. Dashboard Creation Although, as stated previously, a dashboard contains one or more worksheets, dashboards are much more than static presentations. They are an essential part of Tableau's interactivity. In this section, we will populate a dashboard with worksheets and then deploy actions for interactivity. Exercise – building a dashboard In the workbook associated with this chapter, navigate to the tab entitled Building a Dashboard. Within the Dashboard pane located on the left-hand portion of the screen, double-click on each of the following worksheets (in the order in which they are listed) to add them to the dashboard: US Sales Customer Segment Scatter Plot Customers In the lower right-hand corner of the dashboard, click in the blank area below Profit Ratio to select the vertical container: After clicking in the blank area, you should see a blue border around the filter and the legends. This indicates that the vertical container is selected. As shown in the following screenshot, select the vertical container handle and drag it to the left-hand side of the Customers worksheet. Note the gray shading, which communicates where the container will be placed: The gray shading (provided by Tableau when dragging elements such as worksheets and containers onto a dashboard) helpfully communicates where the element will be placed. Take your time and observe carefully when placing an element on a dashboard or the results may be Unexpected. 5. Format the dashboard as desired. The following tips may prove helpful: Adjust the sizes of the elements on the screen by hovering over the edges between each element and then clicking and dragging as Desired. 2. Note that the Sales and Profit legends in the following screenshot are floating elements. Make an element float by right-clicking on the element handle and selecting Floating. (See the previous screenshot and note that the handle is located immediately above Region, in the upper-right-hand corner). 3. Create Horizontal and Vertical containers by dragging those objects from the bottom portion of the Dashboard pane. 4. Drag the edges of containers to adjust the size of each worksheet. 5. Display the dashboard title via Dashboard | Show Title…: If you enjoyed our post, be sure to check out Mastering Tableau which consists of many useful data visualization and data analysis techniques.

0
0
36691

article-image-neural-network-architectures-101-understanding-perceptrons

Kunal Chaudhari

12 Feb 2018

9 min read

Neural Network Architectures 101: Understanding Perceptrons

Kunal Chaudhari

12 Feb 2018

9 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Neural Network Programming with Java Second Edition written by Fabio M. Soares and Alan M. F. Souza. This book is for Java developers who want to master developing smarter applications like weather forecasting, pattern recognition etc using neural networks. [/box] In this article we will discuss about perceptrons along with their features, applications and limitations. Perceptrons are a very popular neural network architecture that implements supervised learning. Projected by Frank Rosenblatt in 1957, it has just one layer of neurons, receiving a set of inputs and producing another set of outputs. This was one of the first representations of neural networks to gain attention, especially because of their simplicity. In our Java implementation, this is illustrated with one neural layer (the output layer). The following code creates a perceptron with three inputs and two outputs, having the linear function at the output layer: int numberOfInputs=3; int numberOfOutputs=2; Linear outputAcFnc = new Linear(1.0); NeuralNet perceptron = new NeuralNet(numberOfInputs,numberOfOutputs, outputAcFnc); Applications and limitations However, scientists did not take long to conclude that a perceptron neural network could only be applied to simple tasks, according to that simplicity. At that time, neural networks were being used for simple classification problems, but perceptrons usually failed when faced with more complex datasets. Let's illustrate this with a very basic example (an AND function) to understand better this issue. Linear separation The example consists of an AND function that takes two inputs, x1 and x2. That function can be plotted in a two-dimensional chart as follows: And now let's examine how the neural network evolves the training using the perceptron rule, considering a pair of two weights, w1 and w2, initially 0.5, and bias valued 0.5 as well. Assume learning rate η equals 0.2: Epoch x1 x2 w1 w2 b y t E Δw1 Δw2 Δb 1 0 0 0.5 0.5 0.5 0.5 0 -0.5 0 0 -0.1 1 0 1 0.5 0.5 0.4 0.9 0 -0.9 0 -0.18 -0.18 1 1 0 0.5 0.32 0.22 0.72 0 -0.72 -0.144 0 -0.144 1 1 1 0.356 0.32 0.076 0.752 1 0.248 0.0496 0.0496 0.0496 2 0 0 0.406 0.370 0.126 0.126 0 -0.126 0.000 0.000 -0.025 2 0 1 0.406 0.370 0.100 0.470 0 -0.470 0.000 -0.094 -0.094 2 1 0 0.406 0.276 0.006 0.412 0 -0.412 -0.082 0.000 -0.082 2 1 1 0.323 0.276 -0.076 0.523 1 0.477 0.095 0.095 0.095 … … 89 0 0 0.625 0.562 -0.312 -0.312 0 0.312 0 0 0.062 89 0 1 0.625 0.562 -0.25 0.313 0 -0.313 0 -0.063 -0.063 89 1 0 0.625 0.500 -0.312 0.313 0 -0.313 -0.063 0 -0.063 89 1 1 0.562 0.500 -0.375 0.687 1 0.313 0.063 0.063 0.063 After 89 epochs, we find the network to produce values near to the desired output. Since in this example the outputs are binary (zero or one), we can assume that any value produced by the network that is below 0.5 is considered to be 0 and any value above 0.5 is considered to be 1. So, we can draw a function , with the final weights and bias found by the learning algorithm w1=0.562, w2=0.5 and b=-0.375, defining the linear boundary in the chart: This boundary is a definition of all classifications given by the network. You can see that the boundary is linear, given that the function is also linear. Thus, the perceptron network is really suitable for problems whose patterns are linearly separable. The XOR case Now let's analyze the XOR case: We see that in two dimensions, it is impossible to draw a line to separate the two patterns. What would happen if we tried to train a single layer perceptron to learn this function? Suppose we tried, let's see what happened in the following table: Epoch x1 x2 w1 w2 b y t E Δw1 Δw2 Δb 1 0 0 0.5 0.5 0.5 0.5 0 -0.5 0 0 -0.1 1 0 1 0.5 0.5 0.4 0.9 1 0.1 0 0.02 0.02 1 1 0 0.5 0.52 0.42 0.92 1 0.08 0.016 0 0.016 1 1 1 0.516 0.52 0.436 1.472 0 -1.472 -0.294 -0.294 -0.294 2 0 0 0.222 0.226 0.142 0.142 0 -0.142 0.000 0.000 -0.028 2 0 1 0.222 0.226 0.113 0.339 1 0.661 0.000 0.132 0.132 2 1 0 0.222 0.358 0.246 0.467 1 0.533 0.107 0.000 0.107 2 1 1 0.328 0.358 0.352 1.038 0 -1.038 -0.208 -0.208 -0.208 … … 127 0 0 -0.250 -0.125 0.625 0.625 0 -0.625 0.000 0.000 -0.125 127 0 1 -0.250 -0.125 0.500 0.375 1 0.625 0.000 0.125 0.125 127 1 0 -0.250 0.000 0.625 0.375 1 0.625 0.125 0.000 0.125 127 1 1 -0.125 0.000 0.750 0.625 0 -0.625 -0.125 -0.125 -0.125 The perceptron just could not find any pair of weights that would drive the following error 0.625. This can be explained mathematically as we already perceived from the chart that this function cannot be linearly separable in two dimensions. So what if we add another dimension? Let's see the chart in three dimensions: In three dimensions, it is possible to draw a plane that would separate the patterns, provided that this additional dimension could properly transform the input data. Okay, but now there is an additional problem: how could we derive this additional dimension since we have only two input variables? One obvious, but also workaround, answer would be adding a third variable as a derivation from the two original ones. And being this third variable a (derivation), our neural network would probably get the following shape: Okay, now the perceptron has three inputs, one of them being a composition of the other. This also leads to a new question: how should that composition be processed? We can see that this component could act as a neuron, so giving the neural network a nested architecture. If so, there would another new question: how would the weights of this new neuron be trained, since the error is on the output neuron? Multi-layer perceptrons As we can see, one simple example in which the patterns are not linearly separable has led us to more and more issue using the perceptron architecture. That need led to the application of multilayer perceptrons. The fact that the natural neural network is structured in layers as well, and each layer captures pieces of information from a specific environment is already established. In artificial neural networks, layers of neurons act in this way, by extracting and abstracting information from data, transforming them into another dimension or shape. In the XOR example, we found the solution to be the addition of a third component that would make possible a linear separation. But there remained a few questions regarding how that third component would be computed. Now let's consider the same solution as a two-layer perceptron: Now we have three neurons instead of just one, but in the output the information transferred by the previous layer is transformed into another dimension or shape, whereby it would be theoretically possible to establish a linear boundary on those data points. However, the question on finding the weights for the first layer remains unanswered, or can we apply the same training rule to neurons other than the output? We are going to deal with this issue in the Generalized delta rule section. MLP properties Multi-layer perceptrons can have any number of layers and also any number of neurons in each layer. The activation functions may be different on any layer. An MLP network is usually composed of at least two layers, one for the output and one hidden layer. There are also some references that consider the input layer as the nodes that collect input data; therefore, for those cases, the MLP is considered to have at least three layers. For the purpose of this article, let's consider the input layer as a special type of layer which has no weights, and as the effective layers, that is, those enabled to be trained, we'll consider the hidden and output layers. A hidden layer is called that because it actually hides its outputs from the external world. Hidden layers can be connected in series in any number, thus forming a deep neural network. However, the more layers a neural network has, the slower would be both training and running, and according to mathematical foundations, a neural network with one or two hidden layers at most may learn as well as deep neural networks with dozens of hidden layers. But it depends on several factors. MLP weights In an MLP feedforward network, one particular neuron i receives data from a neuron j of the previous layer and forwards its output to a neuron k of the next layer: The mathematical description of a neural network is recursive: Here, yo is the network output (should we have multiple outputs, we can replace yo with Y, representing a vector); fo is the activation function of the output; l is the number of hidden layers; nhi is the number of neurons in the hidden layer i; wi is the weight connecting the i th neuron of the last hidden layer to the output; fi is the activation function of the neuron i; and bi is the bias of the neuron i. It can be seen that this equation gets larger as the number of layers increases. In the last summing operation, there will be the inputs xi. Recurrent MLP The neurons on an MLP may feed signals not only to neurons in the next layers (feedforward network), but also to neurons in the same or previous layers (feedback or recurrent). This behavior allows the neural network to maintain state on some data sequence, and this feature is especially exploited when dealing with time series or handwriting recognition. Recurrent networks are usually harder to train, and eventually the computer may run out of memory while executing them. In addition, there are recurrent network architectures better than MLPs, such as Elman, Hopfield, Echo state, Bidirectional RNNs (recurrent neural networks). But we are not going to dive deep into these architectures. Coding an MLP Bringing these concepts into the OOP point of view, we can review the classes already designed so far: One can see that the neural network structure is hierarchical. A neural network is composed of layers that are composed of neurons. In the MLP architecture, there are three types of layers: input, hidden, and output. So suppose that in Java, we would like to define a neural network consisting of three inputs, one output (linear activation function) and one hidden layer (sigmoid function) containing five neurons. The resulting code would be as follows: int numberOfInputs=3; int numberOfOutputs=1; int[] numberOfHiddenNeurons={5}; Linear outputAcFnc = new Linear(1.0); Sigmoid hiddenAcFnc = new Sigmoid(1.0); NeuralNet neuralnet = new NeuralNet(numberOfInputs, numberOfOutputs, numberOfHiddenNeurons, hiddenAcFnc, outputAcFnc); To summarize, we saw how perceptrons can be applied to solve linear separation problems, their limitations in classifying nonlinear data and how to suppress those limitations with multi-layer perceptrons (MLPs). If you enjoyed this excerpt, check out the book Neural Network Programming with Java Second Edition for a better understanding of neural networks and how they fit in different real-world projects.

0
0
23853

article-image-estimating-population-statistics-point-estimation

Aaron Lazar

12 Feb 2018

5 min read

Estimating population statistics with Point Estimation

Aaron Lazar

12 Feb 2018

5 min read

[box type="note" align="" class="" width=""]This article is an extract from the book Principles of Data Science, written by Sinan Ozdemir. The book is a great way to get into the field of data science. It takes a unique approach that bridges the gap between mathematics and computer science, taking you through the entire data science pipeline.[/box] In this extract, we’ll learn how to estimate population means, variances and other statistics using the Point Estimation method. For the code samples, we’ve used Python 2.7. A point estimate is an estimate of a population parameter based on sample data. To obtain these estimates, we simply apply the function that we wish to measure for our population to a sample of the data. For example, suppose there is a company of 9,000 employees and we are interested in ascertaining the average length of breaks taken by employees in a single day. As we probably cannot ask every single person, we will take a sample of the 9,000 people and take a mean of the sample. This sample mean will be our point estimate. The following code is broken into three parts: We will use the probability distribution, known as the Poisson distribution, to randomly generate 9,000 answers to the question: for how many minutes in a day do you usually take breaks? This will represent our "population". We will take a sample of 100 employees (using the Python random sample method) and find a point estimate of a mean (called a sample mean). Compare our sample mean (the mean of the sample of 100 employees) to our population mean. Let's take a look at the following code: np.random.seed(1234) long_breaks = stats.poisson.rvs(loc=10, mu=60, size=3000) # represents 3000 people who take about a 60 minute break The long_breaks variable represents 3000 answers to the question: how many minutes on an average do you take breaks for?, and these answers will be on the longer side. Let's see a visualization of this distribution, shown as follows: pd.Series(long_breaks).hist() We see that our average of 60 minutes is to the left of the distribution. Also, because we only sampled 3000 people, our bins are at their highest around 700-800 people. Now, let's model 6000 people who take, on an average, about 15 minutes' worth of breaks. Let's again use the Poisson distribution to simulate 6000 people, as shown: short_breaks = stats.poisson.rvs(loc=10, mu=15, size=6000) # represents 6000 people who take about a 15 minute break pd.Series(short_breaks).hist() Okay, so we have a distribution for the people who take longer breaks and a distribution for the people who take shorter breaks. Again, note how our average break length of 15 minutes falls to the left-hand side of the distribution, and note that the tallest bar is about 1600 people. breaks = np.concatenate((long_breaks, short_breaks)) # put the two arrays together to get our "population" of 9000 people The breaks variable is the amalgamation of all the 9000 employees, both long and short break takers. Let's see the entire distribution of people in a single visualization: pd.Series(breaks).hist() We see how we have two humps. On the left, we have our larger hump of people who take about a 15 minute break, and on the right, we have a smaller hump of people who take longer breaks. Later on, we will investigate this graph further. We can find the total average break length by running the following code: breaks.mean() # 39.99 minutes is our parameter Our average company break length is about 40 minutes. Remember that our population is the entire company's employee size of 9,000 people, and our parameter is 40 minutes. In the real world, our goal would be to estimate the population parameter because we would not have the resources to ask every single employee in a survey their average break length for many reasons. Instead, we will use a point estimate. So, to make our point, we want to simulate a world where we ask 100 random people about the length of their breaks. To do this, let's take a random sample of 100 employees out of the 9,000 employees we simulated, as shown: sample_breaks = np.random.choice(a = breaks, size=100) # taking a sample of 100 employees Now, let's take the mean of the sample and subtract it from the population mean and see how far off we were: breaks.mean() - sample_breaks.mean() # difference between means is 4.09 minutes, not bad! This is extremely interesting, because with only about 1% of our population (100 out of 9,000), we were able to get within 4 minutes of our population parameter and get a very accurate estimate of our population mean. Not bad! Here, we calculated a point estimate for the mean, but we can also do this for proportion parameters. By proportion, I am referring to a ratio of two quantitative values. Let's suppose that in a company of 10,000 people, our employees are 20% white, 10% black, 10% Hispanic, 30% Asian, and 30% identify as other. We will take a sample of 1,000 employees and see if their race proportions are similar. employee_races = (["white"]*2000) + (["black"]*1000) + (["hispanic"]*1000) + (["asian"]*3000) + (["other"]*3000) employee_races represents our employee population. For example, in our company of 10,000 people, 2,000 people are white (20%) and 3,000 people are Asian (30%). Let's take a random sample of 1,000 people, as shown: demo_sample = random.sample(employee_races, 1000) # Sample 1000 values for race in set(demo_sample): print( race + " proportion estimate:" ) print( demo_sample.count(race)/1000. ) The output obtained would be as follows: hispanic proportion estimate: 0.103 white proportion estimate: 0.192 other proportion estimate: 0.288 black proportion estimate: 0.1 asian proportion estimate: 0.317 We can see that the race proportion estimates are very close to the underlying population's proportions. For example, we got 10.3% for Hispanic in our sample and the population proportion for Hispanic was 10%. To summarize we can say that you’re familiar with point estimation method to estimate population means, variances and other statistics, and implement them in Python. If you found our post useful, you can check out Principles of Data Science for more interesting Data Science tips and techniques.

0
0
101823

Packt

09 Feb 2018

13 min read

Lets build applications for wear 2.0

Packt

09 Feb 2018

13 min read

In this article Ashok Kumar S, the author of the article Android Wear Projects will get you started on writing android wear applications. You probably already know that by the title of the article that we will be building wear applications, But you can also expect a little bit of story on every project and comprehensive explanation on the components and structure of the application. We will be covering most of the wear 2.0 standards and development practices in all the projects we are building. Why building wear applications? The culture of wearing a utility that helps to do certain actions have always been part of a modern civilization. Wrist watches for human beings have become an augmented helping tool for checking time and date. Wearing a watch lets you check time with just a glance. Technology has taken this wearing watch experience to next level, The first modern wearable watch was a combination of calculator and watch introduced to the world in 1970. Over the decades of advancement in microprocessors and wireless technology lead to introduce a concept called "ubiquitous computing". During this time most of the leading electronic industry, start-ups have started to work on their ideas which made wearable devices very popular. Going forward we will be building five projects enlisted below. Note taking application Fitness application Wear Maps application Chat messenger Watch face (For more resources related to this topic, see here.) This article will also introduce you to setting up your wear application development environment and the best practices for wear application development and new user interface components and we will also be exploring firebase technologies for chatting and notifications in one of the projects. Publishing wear application to play store will follow completely the similar procedure to mobile application publishing with a little change. Moving forward the article will help you to set your mind so resolutely with determined expectation towards accomplishing wear applications which are introduced in the article. If you are beginning to write the wear app for the first time or you have a fair bit of information on wear application development and struggling to how to get started, this article is going to be a lot of helpful resource for sure. Note taking application There are numerous ways to take notes. You could carry a notebook and pen in your pocket, or scribble thoughts on a piece of paper. Or, better yet, you could use your Android Wear to take notes, so one can always have a way to store thoughts even if there's not a pen and note nearby. The Note Taking App provides convenient access to store and retrieve notes within Android Wear device. There are many Android smartphone notes taking app which is popular for their simplicity and elegant functionality. For the scope of a wear device, it is necessary to keep the design simple and glanceable. As a software developer we need to understand how important it is to target the different device sizes and reaching out for various types devices, To solve this android wearable support library has component called BoxInsetLayout. And having an animated feedback through DelayedConfirmationView is implemented to give the user a feedback on task completion. Thinking of a good wear application design, Google recommends using dark color to wear application for the best battery efficiency, Light color schemes used in typical material designed mobile applications are not energy efficient in wear devices. Light colors are less energy efficient in OLED display's. Light colors need to light up the pixels with brighter intensity, White colors need to light up the RGB diodes in your pixels at 100%, the more white and light color in application the less battery efficient application will be. Using custom font’s, In the world of digital design, making your application's visuals easy on users eyes is important. The Lora font from google collections has well-balanced contemporary serif with roots in calligraphy. It is a text typeface with moderate contrast well suited for body text. A paragraph set in Lora will make a memorable appearance because of its brushed curves in contrast with driving serifs. The overall typographic voice of Lora perfectly conveys the mood of a modern-day story, or an art essay. Technically Lora is optimized for screen appearance. We will also make the application to have list item animation so that users will love to use the application often. Fitness application We are living in the realm of technology! It is definitely not the highlight here, We are also living in the realm of intricate lifestyles that driving everyone's health into some sort of illness. Our existence leads us back to roots of the ocean, we all know that we are beings who have been evolved from water. If we trace back we clearly understand our body composition is made of sixty percent water and rest are muscled water. When we talk about taking care of our health we miss simple things, Considering taking care and self-nurturing us, should start from drinking sufficient water. Adequate regular water consumption will ensure great metabolism and healthy functional organs. New millennium's Advancement in technologies is an expression of how one can make use of technology for doing right things. Android wear integrates numerous sensors which can be used to help android wear users to measure their heart rate and step counts and etc. Having said that how about writing an application that reminds us to drink water every thirty minutes and measures our heart rate, step counts and few health tips. Material design will drive the mobile application development in vertical heights in this article we will be learning the wear navigation drawer and other material design components that makes the application stand out. The application will track the step counts through step counter sensor, application also has the ability to check the heart pulse rate with a animated heart beat projection, Application reminds user on hydrate alarms which in tern make the app user to drink the water often. In this article we will build a map application with a quick note taking ability on the layers of map. We humans travel to different cities, It could be domestic or international cities. How about having a track of places visited. We all use maps for different reasons but in most cases, we use maps to plan a particular activity like outdoor tours and cycling and other similar activities. Maps influence human's intelligence to find the fastest route from the source location to destination location. Fetching the address from latitude and longitude using Geocoder class is comprehensively explained. The map applications needs to have certain visual attractions and that is carried out in this project and the story is explained comprehensively. Chatting application We could state that the belief system of Social media has been advanced and wiped out many difficulties of communication. Just about a couple of decades back, the communication medium was letters and a couple of centuries back trained birds and if we still look back we will definitely get few more stories to comprehend the way people use to communicate back those days. Now we are in the generation of IoT, wearable smart devices and era of smartphones where the communication happens across the planet in the fraction of a second. We will build a mobile and wear application that exhibits the power of google wear messaging API's to assist us in building the chat application. with a wear companion application to administer and respond to the messages being received. To help the process of chatting, the article introduces Firebase technologies for accomplishing the chatting application. We will build a wear and mobile app together and we will receive the message typed from wear device to mobile and update it to firebase. The article comprehensively explains the firebase real-time database and for notification purpose article introduces firebase functions as well. The application will have the user login page and list users to chat and chatting screen, The project is conceptualised in such a way that the article reader will be able to leverage all these techniques in his mastering skills and he can use the same ability in production applications. Data layer establishes the communication channel between two android nodes, And the article talks about the process in detail, in this project reader will be able to find whether the device has the google play services if not how to install or go forward to use the application. A brief explanation about capability API and some of the best use cases in chatting application context. Notification have always been the important component in the chatting application to know who texted them. In this article the reader will be able to understand the firebase functions for sending push notifications. Firebase functions offers triggers for all the firebase technologies and we will explore realtime database triggers from firebase functions. Reader will also learn how to work with input method framework and voice inpute in the wear device. After all the reader will be able to understand the essentials of writing a chat application with wear app companion. Watch Face A watch face, also known as the dial is part of the clock that displays the time through fixed numbers with moving hands. This expression of checking time can be designed with various artistic approaches and creativity. In this article reader will be able to start writing their own watch face. The article comprehensively explains the CanvaswatchFaceService with all the callbacks for constructing a digital watchface. The article also talks about registering watch face to the manifest similar to the wallpaperservice class. Keeping in mind that the watch face is going to be used in different form factors in wear device the watch face is written. The wear 2.0 offers watch face picker feature for setting up the watch face from the available list of watch faces. Reader will understand the watch face elements of Analog and digital watch faces. There are certain common issues when we talk about watch faces like how the watch face is going to get the data if at all it has any complications. Battery efficiency is one of the major concern while choose to write watch face. How well the network related operations is done in the watch face and what are the sensors if at all watch face is using how often the watch face have the access. Custom assets in watch face like complex SVG animations and graphical animations and how much CPU and GPU cycles is used, user’s like the more visual attractive watch faces rather than a simple analog and digital watch faces android wear 2.0 allows complications straight from the wear 2.0 SDK developers need not do the logical part of getting the data. The article also talks about Interactive watch face, the trend changes every time, In wear 2.0 the new interactive watch faces which can have unique interaction and style expression is a great update. And all watch face developers for wear might have to start thinking of interactive watch faces. The idea is to have the user to like and love watch face by giving them a delightful and useful information on a timely basis which changes the user experience of the watch face. Data integrated watch faces. Making watch face is an excellent artistic engineering, What data we should express in the watch face and how time data and date data is being displayed. More about wear 2.0 In this article reader will be able to seek the features that wear 2.0 offers and they will also be able to know wear 2.0 is the prominent update with the plenty of new features bundled, including google assistant, stand-alone applications, new watch faces and support for the third party complications. Wear 2.0 offers to give more with the happening market research and Google is working with partner companies to build the powerful ecosystem for wear. Stand-alone applications in wear is a brilliant feature which will create a lot of buzz in wear developers and wear users. Afterall who wants always to carry phone and who wants a paired devices to do some simple tasks. Stand-alone application is the powerful feature of the wear ecosystem. How cool it will be using wear apps without your phone nearby. There are various scenarios that wear devices use to be phone dependent, for example to receive new email notification wear needs to be connected to phone for the internet, Now wear device can independently connect to wifi and can sync all the apps for new updates. User can now complete more tasks with wear apps without a phone paired to it. The article explains how to identify whether the application is stand-alone or it is dependent on a companion app and installing stand-alone applications from google playstore and other wear 2.0 related new changes. Like Watch face complications and watch face picker. Brief understanding about the storage mechanism of the stand-alone wear applications. Wear device needs to talk with phone in many use cases and the article talks about the advertising the availability of and the article also talks about advertising the capabilities of the device and retrieving the capable nodes for the requested capability. If the wear app as a companion app we need to detect the companion app in the phone or in the wear device if neither the app installed we can guide the user to playstore to install the companion application. Wear 2.0 supports cloud messaging and cloud based push notifications and the article have the comprehensive explanation about the notifications. Android wear is evolving in every way, in wear 1.0 switching between screens use to be tedious and confusing to wear users. Now, google has introduced material design and interactive drawers. Which includes single and multipage navigation drawer and action drawer and more. Typing in the tiny little one and half inches screen is pretty challenging task to the wear users so wear 2.0 introduces input method framework quick types and swipes for entering the input in the wear device directly. This article is a resourceful journey to those who are planning to seek the wear development and wear 2.0 standards. Everything in the article projects a usual task that most of the developers trying to accomplish. Resources for Article: Further resources on this subject: Getting started with Android Development [article] Building your first Android Wear Application [article] The Art of Android Development Using Android Studio [article]

0
0
26731

How-To Tutorials

article-image-implementing-simple-time-series-data-analysis-r

Amarabha Banerjee

09 Feb 2018

4 min read

Implementing a simple Time Series Data Analysis in R

Amarabha Banerjee

09 Feb 2018

4 min read

[box type="note" align="" class="" width=""]This article is extracted from the book Machine Learning with R written by Brett Lantz. This book will methodically take you through stages to apply machine learning for data analysis using R.[/box] In this article, we will explore the popular time series analysis method and its practical implementation using R. Introduction When we think about time, we think about years, days, months, hours, minutes, and seconds. Think of any datasets and you will find some attributes which will be in the form of time, especially data related to stock, sales, purchase, profit, and loss. All these have time associated with them. For example, the price of stock in the stock exchange at different points on a given day or month or year. Think of any industry domain, and sales are an important factor; you can see time series in sales, discounts, customers, and so on. Other domains include but are not limited to statistics, economics and budgets, processes and quality control, finance, weather forecasting, or any kind of forecasting, transport, logistics, astronomy, patient study, census analysis, and the list goes on. In simple words, it contains data or observations in time order, spaced at equal intervals. Time series analysis means finding the meaning in the time-related data to predict what will happen next or forecast trends on the basis of observed values. There are many methods to fit the time series, smooth the random variation, and get some insights from the dataset. When you look at time series data you can see the following: Trend: Long term increase or decrease in the observations or data. Pattern: Sudden spike in sales due to christmas or some other festivals, drug consumption increases due to some condition; this type of data has a fixed time duration and can be predicted for future time also. Cycle: Can be thought of as a pattern that is not fixed; it rises and falls without any pattern. Such time series involve a great fluctuation in data. How to do There are many datasets available with R that are of the time series types. Using the command class, one can know if the dataset is time series or not. We will look into the AirPassengers dataset that shows monthly air passengers in thousands from 1949 to 1960. We will also create new time series to represent the data. Perform the following commands in RStudio or R Console: > class(AirPassengers) Output: [1] "ts" > start(AirPassengers) Output: [1] 1949 1 > end(AirPassengers) Output: [1] 1960 12 > summary(AirPassengers) Output: Min. 1st Qu. Median Mean 3rd Qu. Max. 104.0 180.0 265.5 280.3 360.5 622.0 Analyzing Time Series Data [ 89 ] In the next recipe, we will create the time series and print it out. Let's think of the share price of some company in the range of 2,500 to 4,000 from 2011 to be recorded monthly. Perform the following coding in R: > my_vector = sample(2500:4000, 72, replace=T) > my_series = ts(my_vector, start=c(2011,1), end=c(2016,12), frequency = 12) > my_series Output: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2011 2888 3894 3675 3113 3421 3870 2644 2677 3392 2847 2543 3147 2012 2973 3538 3632 2695 3475 3971 2695 2963 3217 2836 3525 2895 2013 3984 3811 2902 3602 3812 3631 2625 3887 3601 2581 3645 3324 2014 3830 2821 3794 3942 3504 3526 3932 3246 3787 2894 2800 2732 2015 3326 3659 2993 2765 3881 3983 3813 3172 2667 3517 3445 2805 2016 3668 3948 2779 2881 3285 2733 3203 3329 3854 3285 3800 2563 How it works In the first recipe, we used the AirPassengers dataset, using the class function. We saw that it is ts (ts stands for time series). The start and end functions will give the starting year and ending year of the dataset with the values. The frequency function tells us the interval of observations; 1 means annually, 4 means quarterly, 12 means yearly, and so on. In the next recipe, we want to generate samples between 2,500 to 40,000 to represent the price of a share. Using a sample function, we can create a sample; it takes the range as the first argument, and the number of samples required as the second argument. The last argument decides whether duplication is to be allowed in the sample or not. We stored the sample in the my_vector. Now we create a time series using the ts function. The ts function takes the vector as an argument followed by the start and end to show the period for which the time series is being constructed. The frequency specifies the number of observations in the start and end to be recorded. 12. To summarize we talked about how R can be utilized to perform time series analysis in different ways. If you would like to learn more useful machine learning techniques in R, be sure to check out Machine Learning with R.

0
0
26086

article-image-explaining-data-exploration-in-under-a-minute

Amarabha Banerjee

08 Feb 2018

5 min read

Explaining Data Exploration in under a minute

Amarabha Banerjee

08 Feb 2018

5 min read

[box type="note" align="" class="" width=""]Below given article is taken from the book Machine Learning with R written by Brett Lantz. This book will help you harness the power of R for statistical computing and data science.[/box] Today, we shall explore different data exploration techniques and a real world example of using these techniques. Introduction Data Exploration is a term used for finding insightful information from data. To find insights from data various steps such as data munging, data analysis, data modeling, and model evaluation are taken. In any real data exploration project, commonly six steps are involved in the exploration process. They are as follows: Asking the right questions: Asking the right questions will help in understanding the objective and target information sought from the data. Questions can be asked such as What are my expected findings after the exploration is finished?, or What kind of information can I extract through the exploration? Data collection: Once the right questions have been asked the target of exploration is cleared. Data collected from various sources is in unorganized and diverse format. Data may come from various sources such as files, databases, internet, and so on. Data collected in this way is raw data and needs to be processed to extract meaningful information. Most of the analysis and visualizing tools or applications expect data to be in a certain format to generate results and hence the raw data is of no use for them. Data munging: Raw data collected needs to be converted into the desired format of the tools to be used. In this phase, raw data is passed through various processes such as parsing the data, sorting, merging, filtering, dealing with missing values, and so on. The main aim is to transform raw data in the format that the analyzing and visualizing tools understand. Once the data is compatible with the tools, analysis and visualizing tools are used to generate the different results. Basic exploratory data analysis: Once the data munging is done and data is formating for the tools, it can be used to perform data exploration and analysis. Tools provide various methods and techniques to do the same. Most analyzing tools allow statistical functions to be performed on the data. Visualizing tools help in visualizing the data in different ways. Using basic statistical operations and visualizing the same data can be understood in better way. Advanced exploratory data analysis: Once the basic analysis is done it's time to look at an advanced stage of analysis. In this stage, various prediction models are formed on basis of requirement. Machine learning algorithms are utilized to train the model and generate the inferences. Various tuning on the model is also done to ensure correctness and effectiveness of the model. Model assessment: When the models are mare, they are evaluated to find the best model from the given different models. The major factor to decide the best model is to see how perfect or closely it can predict the values. Models are tuned here also for increasing the accuracy and effectiveness. Various plots and graphs are used to see the model’s prediction. Real world example - using Air Quality Dataset Air quality datasets come bundled with R. They contain data about the New York Air Quality Measurements of 1973 for five months from May to September recorded daily. To view all the available datasets use the data() function, it will display all the datasets available with R installation. How to do it Perform the following step to see all the datasets in R and using airquality: > data() > str(airquality) Output 'data.frame': 153 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ... $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ... $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ... > head(airquality) Output Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 How it works The str command is used to display the structure of the dataset, as you can see it contains the information about the observation of ozone, solar, wind, and temp attributes recorded each day for five months. Using the head function, you can see the first few lines of actual data. The dataset is very basic and is enough to start processing and analyzing data at a very basic level. Kaggle website, which has various diverse kinds of datasets. Apart from datasets it also holds many competitions in data science fields to solve real-world problems. You can find the competitions, datasets, kernels, and jobs at https://www. kaggle.com/. Many competitions are organized by large corporate bodies, government agencies, or from academia. Many of the competitions have prize money associated with them. The following screenshot shows competitions and prize money. You can simply create an account and start participating in competitions by submitting code and the output and the same will be assessed. Assessment or evaluation criteria is available on the detail page of each competition. By participating and using https:/ / www.kaggle. com/ one gains experience in solving real-world problems. It gives you a taste of what data scientist do. On the jobs page various jobs for data scientists and analysis is listed and you can apply if the profile is suitable or matches with your interests. If you liked our post, be sure to check out Machine Learning with R which consists of more useful machine learning techniques with R.

0
0
2110

article-image-building-linear-regression-model-python-developers

Pravin Dhandre

07 Feb 2018

7 min read

Building a Linear Regression Model in Python for developers

Pravin Dhandre

07 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rodolfo Bonnin titled Machine Learning for Developers. This book is a systematic guide for developers to implement various Machine Learning techniques and develop efficient and intelligent applications.[/box] Let’s start using one of the most well-known toy datasets, explore it, and select one of the dimensions to learn how to build a linear regression model for its values. Let's start by importing all the libraries (scikit-learn, seaborn, and matplotlib); one of the excellent features of Seaborn is its ability to define very professional-looking style settings. In this case, we will use the whitegrid style: import numpy as np from sklearn import datasets import seaborn.apionly as sns %matplotlib inline import matplotlib.pyplot as plt sns.set(style='whitegrid', context='notebook') The Iris Dataset It’s time to load the Iris dataset. This is one of the most well-known historical datasets. You will find it in many books and publications. Given the good properties of the data, it is useful for classification and regression examples. The Iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris) contains 50 records for each of the three types of iris, 150 lines in a total over five fields. Each line is a measurement of the following: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm The final field is the type of flower (setosa, versicolor, or virginica). Let’s use the load_dataset method to create a matrix of values from the dataset: iris2 = sns.load_dataset('iris') In order to understand the dependencies between variables, we will implement the covariance operation. It will receive two arrays as parameters and will return the covariance(x,y) value: def covariance (X, Y): xhat=np.mean(X) yhat=np.mean(Y) epsilon=0 for x,y in zip (X,Y): epsilon=epsilon+(x-xhat)*(y-yhat) return epsilon/(len(X)-1) Let's try the implemented function and compare it with the NumPy function. Note that we calculated cov (a,b), and NumPy generated a matrix of all the combinations cov(a,a), cov(a,b), so our result should be equal to the values (1,0) and (0,1) of that matrix: print (covariance ([1,3,4], [1,0,2])) print (np.cov([1,3,4], [1,0,2])) 0.5 [[ 2.33333333 0.5 ] [ 0.5 1. ]] Having done a minimal amount of testing of the correlation function as defined earlier, receive two arrays, such as covariance, and use them to get the final value: def correlation (X, Y): return (covariance(X,Y)/(np.std(X, ddof=1)*np.std(Y, ddof=1))) ##We have to indicate ddof=1 the unbiased std Let’s test this function with two sample arrays, and compare this with the (0,1) and (1,0) values of the correlation matrix from NumPy: print (correlation ([1,1,4,3], [1,0,2,2])) print (np.corrcoef ([1,1,4,3], [1,0,2,2])) 0.870388279778 [[ 1. 0.87038828] [ 0.87038828 1. ]] Getting an intuitive idea with Seaborn pairplot A very good idea when starting worke on a problem is to get a graphical representation of all the possible variable combinations. Seaborn’s pairplot function provides a complete graphical summary of all the variable pairs, represented as scatterplots, and a representation of the univariate distribution for the matrix diagonal. Let’s look at how this plot type shows all the variables dependencies, and try to look for a linear relationship as a base to test our regression methods: sns.pairplot(iris2, size=3.0) <seaborn.axisgrid.PairGrid at 0x7f8a2a30e828> Pairplot of all the variables in the dataset. Lets' select two variables that, from our initial analysis, have the property of being linearly dependent. They are petal_width and petal_length: X=iris2['petal_width'] Y=iris2['petal_length'] Let’s now take a look at this variable combination, which shows a clear linear tendency: plt.scatter(X,Y) This is the representation of the chosen variables, in a scatter type graph: This is the current distribution of data that we will try to model with our linear prediction function. Creating the prediction function First, let's define the function that will abstractedly represent the modeled data, in the form of a linear function, with the form y=beta*x+alpha: def predict(alpha, beta, x_i): return beta * x_i + alpha Defining the error function It’s now time to define the function that will show us the difference between predictions and the expected output during training. We have two main alternatives: measuring the absolute difference between the values (or L1), or measuring a variant of the square of the difference (or L2). Let’s define both versions, including the first formulation inside the second: def error(alpha, beta, x_i, y_i): #L1 return y_i - predict(alpha, beta, x_i) def sum_sq_e(alpha, beta, x, y): #L2 return sum(error(alpha, beta, x_i, y_i) ** 2 for x_i, y_i in zip(x, y)) Correlation fit Now, we will define a function implementing the correlation method to find the parameters for our regression: def correlation_fit(x, y): beta = correlation(x, y) * np.std(y, ddof=1) / np.std(x,ddof=1) alpha = np.mean(y) - beta * np.mean(x) return alpha, beta Let’s then run the fitting function and print the guessed parameters: alpha, beta = correlation_fit(X, Y) print(alpha) print(beta) 1.08355803285 2.22994049512 Let’s now graph the regressed line with the data in order to intuitively show the appropriateness of the solution: plt.scatter(X,Y) xr=np.arange(0,3.5) plt.plot(xr,(xr*beta)+alpha) This is the final plot we will get with our recently calculated slope and intercept: Final regressed line Polynomial regression and an introduction to underfitting and overfitting When looking for a model, one of the main characteristics we look for is the power of generalizing with a simple functional expression. When we increase the complexity of the model, it's possible that we are building a model that is good for the training data, but will be too optimized for that particular subset of data. Underfitting, on the other hand, applies to situations where the model is too simple, such as this case, which can be represented fairly well with a simple linear model. In the following example, we will work on the same problem as before, using the scikit- learn library to search higher-order polynomials to fit the incoming data with increasingly complex degrees. Going beyond the normal threshold of a quadratic function, we will see how the function looks to fit every wrinkle in the data, but when we extrapolate, the values outside the normal range are clearly out of range: from sklearn.linear_model import Ridge from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline ix=iris2['petal_width'] iy=iris2['petal_length'] # generate points used to represent the fitted function x_plot = np.linspace(0, 2.6, 100) # create matrix versions of these arrays X = ix[:, np.newaxis] X_plot = x_plot[:, np.newaxis] plt.scatter(ix, iy, s=30, marker='o', label="training points") for count, degree in enumerate([3, 6, 20]): model = make_pipeline(PolynomialFeatures(degree), Ridge()) model.fit(X, iy) y_plot = model.predict(X_plot) plt.plot(x_plot, y_plot, label="degree %d" % degree) plt.legend(loc='upper left') plt.show() The combined graph shows how the different polynomials' coefficients describe the data population in different ways. The 20 degree polynomial shows clearly how it adjusts perfectly for the trained dataset, and after the known values, it diverges almost spectacularly, going against the goal of generalizing for future data. Curve ﬁtting of the initial dataset, with polynomials of increasing values With this, we successfully explored how to develop an efficient linear regression model in Python and how you can make predictions using the designed model. We've reviewed ways to identify and optimize the correlation between the prediction and the expected output using simple and definite functions. If you enjoyed our post, you must check out Machine Learning for Developers to uncover advanced tools for building machine learning applications on your fingertips.

0
1
7566

How to perform Data exploration with RethinkDB

How to convert Java code into Kotlin

How to deploy RethinkDB using Docker

Getting Inside a C++ Multithreaded Application

How to Build a music recommendation system with PageRank Algorithm

What Tableau Data Handling Engine has to offer

How to maintain Apache Mesos

Hypothesis testing with R

Getting started with Data Visualization in Tableau

Neural Network Architectures 101: Understanding Perceptrons

Trending Topics

Estimating population statistics with Point Estimation

Lets build applications for wear 2.0

Implementing a simple Time Series Data Analysis in R

Explaining Data Exploration in under a minute

Building a Linear Regression Model in Python for developers

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access