Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-getting-started-with-linear-and-logistic-regression

03 Jan 2018

7 min read

Getting started with Linear and logistic regression

03 Jan 2018

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Alberto Boschetti and Luca Massaron, titled Python Data Science Essentials - Second Edition. This book provides the fundamentals of data science with Python by leveraging the latest tools and libraries such as Jupyter notebooks, NumPy, pandas and scikit-learn.[/box] In this article, we will learn about two easy and effective classifiers known as linear and logistic regressors. Linear and logistic regressions are the two methods that can be used to linearly predict a target value or a target class, respectively. Let's start with an example of linear regression predicting a target value. In this article, we will again use the Boston dataset, which contains 506 samples, 13 features (all real numbers), and a (real) numerical target (which renders it ideal for regression problems). We will divide our dataset into two sections by using a train/test split cross- validation to test our methodology (in the example, 80 percent of our dataset goes in training and 20 percent in test): In: from sklearn.datasets import load_boston boston = load_boston() from sklearn.cross_validation import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=0) The dataset is now loaded and the train/test pairs have been created. In the next few steps, we're going to train and fit the regressor in the training set and predict the target variable in the test dataset. We are then going to measure the accuracy of the regression task by using the MAE score. As for the scoring function, we decided on the mean absolute error in order to penalize errors just proportionally to the size of the error itself (using the more common mean squared error would have emphasized larger errors more, since errors are squared): In: from sklearn.linear_model import LinearRegression regr = LinearRegression() regr.fit(X_train, Y_train) Y_pred = regr.predict(X_test) from sklearn.metrics import mean_absolute_error print ("MAE", mean_absolute_error(Y_test, Y_pred)) Out: MAE 3.84281058945 Great! We achieved our goal in the simplest possible way. Now, let's take a look at the time needed to train the system: In: %timeit regr.fit(X_train, y_train) Out: 1000 loops, best of 3: 381 µs per loop That was really quick! The results, of course, are not all that great. However, linear regression offers a very good trade-off between performance and speed of training and simplicity. Now, let's take a look under the hood of the algorithm. Why is it so fast but not that accurate? The answer is somewhat expected-this is so because it's a very simple linear method. Let's briefly dig into a mathematical explanation of this technique. Let's name X(i) the ith sample (it is actually a row vector of numerical features) and Y(i) its target. The goal of linear regression is to find a good weight (column) vector W, which is best suited for approximating the target value when multiplied by the observation vector, that is, X(i) * W ≈ Y(i) (note that this is a dot product). W should be the same, and the best for every observation. Thus, solving the following equation becomes easy: W can be found easily with the help of a matrix inversion (or, more likely, a pseudo- inversion, which is a computationally efficient way) and a dot product. Here's the reason linear regression is so fast. Note that this is a simplistic explanation—the real method adds another virtual feature to compensate for the bias of the process. However, this does not change the complexity of the regression algorithm much. We progress now to logistic regression. In spite of what the name suggests, it is a classifier and not a regressor. It must be used in classification problems where you are dealing with only two classes (binary classification). Typically, target labels are Boolean; that is, they have values as either True/False or 0/1 (indicating the presence or absence of the expected outcome). In our example, we keep on using the same dataset. The target is to guess whether a house value is over or under the average of a threshold value we are interested in. In essence, we moved from a regression problem to a binary classification one because now our target is to guess how likely an example is to be a part of a group. We start preparing the dataset by using the following commands: In: import numpy as np avg_price_house = np.average(boston.target) high_priced_idx = (Y_train >= avg_price_house) Y_train[high_priced_idx] = 1 Y_train[np.logical_not(high_priced_idx)] = 0 Y_train = Y_train.astype(np.int8) high_priced_idx = (Y_test >= avg_price_house) Y_test[high_priced_idx] = 1 Y_test[np.logical_not(high_priced_idx)] = 0 Y_test = Y_test.astype(np.int8) Now, we will train and apply the classifier. To measure its performance, we will simply print the classification report: In: from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(X_train, Y_train) Y_pred = clf.predict(X_test) from sklearn.metrics import classification_report print (classification_report(Y_test, Y_pred)) Out: precision recall f1-score support 0 0.81 0.90 0.85 61 1 0.82 0.68 0.75 41 avg / total 0.83 0.81 0.81 102 The output of this command can change on your machine depending on the optimization process of the LogisticRegression classifier (no seed has been set for replicability of the results). The precision and recall values are over 80 percent. This is already a good result for a very simple method. The training speed is impressive, too. Thanks to Jupyter Notebook, we can have a comparison of the algorithm with a more advanced classifier in terms of performance and speed: In: %timeit clf.fit(X_train, y_train) 100 loops, best of 3: 2.54 ms per loop What's under the hood of a logistic regression? The simplest classifier a person could imagine (apart from a mean) is a linear regressor followed by a hard threshold: Here, sign(a) = +1 if a is greater or equal than zero, and 0 otherwise. To smooth down the hardness of the threshold and predict the probability of belonging to a class, logistic regression resorts to the logit function. Its output is a (0 to 1] real number (0.0 and 1.0 are attainable only via rounding, otherwise the logit function just tends toward them), which indicates the probability that the observation belongs to class 1. Using a formula, that becomes: Here Why the logistic function instead of some other function? Well, because it just works pretty well in most real cases. In the remaining cases, if you're not completely satisfied with its results, you may want to try some other nonlinear functions instead (there is limited variety of suitable ones, though). To summarize, we learned about two classic algorithms used in machine learning namely linear and logistic regression. With the help of an example, we put the theory into practice by predicting a target value which helped us understand the trade-offs and benefits. If you enjoyed this excerpt, check out the book Python Data Science Essentials - Second Edition to know more about other popular machine learning algorithms such as Naive Bayes, k-Nearest Neighbors (kNN), Support Vector Machines (SVM) etc.

0
0
14524

article-image-2018-new-year-resolutions-algorithmic-world-part-1-of-3

Sugandha Lahoti

03 Jan 2018

6 min read

2018 new year resolutions to thrive in an Algorithmic World - Part 1 of 3

Sugandha Lahoti

03 Jan 2018

6 min read

We often think of Data science and machine learning as skills essential to a niche group of researchers, data scientists, and developers. But the world as we know today revolves around data and algorithms, just as it used to revolve around programming a decade back. As data science and algorithms get integrated into all aspects of businesses across industries, data science like Microsoft Excel will become ubiquitous and will serve as a handy tool which makes you better at your job no matter what your job is. Knowing data science is key to having a bright career in this algoconomy (algorithm driven economy). If you are big on new year resolutions, make yourself a promise to carve your place in the algorithm-powered world by becoming data science savvy. Follow these three resolutions to set yourself up for a bright data-driven career. Get the foundations right: Start with the building blocks of data science, i.e. developing your technical skills. Stay relevant: Keep yourself updated on the latest developments in your field and periodically invest in reskilling and upskilling. Be mindful of your impact: Finally, always remember that your work has real-world implications. Choose your projects wisely and your project goals, hypotheses, and contributors with even more care. In this three-part series, we expand on how data professionals could go about achieving these three resolutions. But the principles behind the ideas are easily transferable to anyone in any job. Think of them as algorithms that can help you achieve your desired professional outcome! You simply need to engineer the features and fine-tune the hyperparameters specific to your industry and job role. 1st Resolution: Learn the building blocks of data science If you are interested in starting a career in data science or in one that involves data, here is a simple learning roadmap for you to develop your technical skills. Start off with learning a data-friendly programming language, one that you find easy and interesting. Next, brush up your statistics skills. Nothing fancy, just your high school math and stats would do nicely. Next, learn about algorithms - what they do, what questions they answer, how many types are there and how to write one. Finally, you can put all that learning to practice by building models on top of your choice of Machine Learning framework. Now let’s see, how you can accomplish each of these tasks 1. Learn Python or any another popular data friendly programming language you find interesting (Learning period: 1 week - 2 months) If you see yourself as a data scientist in the near future, knowing a programming language is one of the first things to check off your list. We suggest you learn a data-friendly programming language like Python or R. Python is a popular choice because of its strong, fast, and easy computational capabilities for the Data Science workflow. Moreover, because of a large and active community, the likelihood of finding someone in your team or your organization who knows Python is quite high, which is an added advantage. “Python has become the most popular programming language for data science because it allows us to forget about the tedious parts of programming and offers us an environment where we can quickly jot down our ideas and put concepts directly into action.” - Sebastian Raschka We suggest learning the basics from the book Learn Python in 7 days by Mohit, Bhaskar N. Das. Then you can move on to learning Python specifically for data science with Python Data Science Essentials by Alberto Boschetti. Additionally, you can learn R, which is a highly useful language when it comes to statistics and data. For learning R, we recommend R Data science Essentials by Raja B. Koushik. You can learn more about how Python and R stand against each other in the data science domain here. Although R and Python are the most popular choices for new developers and aspiring data scientists, you can also use Java for data science, if that is your cup of tea. Scala is another alternative. 2. Brush up on Statistics (Learning period: 1 week - 3 weeks) While you are training your programming muscle, we recommend that you brush through basic mathematics (probability and statistics). Remember, you already know everything to get started with data science from your high school days. You just need to refresh your memory with a little practice. A good place to start is to understand concepts like standard deviation, probability, mean, mode, variance, kurtosis among others. Now, your normal high-school books should be enough to get started, however, an in-depth understanding is required to leverage the power of data science. We recommend the book Statistics for Data Science by James D. Miller for this. 3. Learn what machine learning algorithms do and which ones to learn (Learning period: 1 month - 3 months) Machine Learning is a powerful tool to make predictions based on huge amounts of data. According to a recent study, in the next ten years, ML algorithms are expected to replace a quarter of the jobs across the world, in fields like transport, manufacturing, architecture, healthcare and many others. So the next step in your data science journey is learning about machine learning algorithms. There are new algorithms popping up almost every day. We’ve collated a list of top ten algorithms that you should learn to effectively design reliable and robust ML systems. But fear not, you don’t need to know all of them to get started. Start with some basic algorithms that are majorly used in the real world applications like linear regression, naive bayes, and decision trees. 4. Learn TensorFlow, Keras, or any other popular machine learning framework (Learning period: 1 month - 3 months) After you have familiarized yourself with some of the machine learning algorithms, it is time you put that learning to practice by building models based on those algorithms. While there are many cloud-based machine learning options that have click-based model building features available, the best way to learn a skill is to get your hands dirty. There is a growing range of frameworks that make it easy to build complex models while allowing for high degrees of customization. Here is a list of top 10 deep learning frameworks at your disposal to choose from. Our favorite pick is TensorFlow. It’s Python-based, backed by Google, has a very good documentation, and there are tons of tutorials and videos available on the internet to guide you. You can find a comprehensive list of books for learning Tensorflow here. We also recommend learning Keras, which is a good option if you have some knowledge of Python programming and want to get started with deep learning. Try the book Deep Learning with Keras, by Antonio Gulli and Sujit Pal, to get you started. If you find learning from multiple sources daunting, just learn from Sebastian Raschka’s Python machine learning book. Once you have got your fundamentals right, it is important to stay relevant through continuous learning and reskilling. Check out part 2 where we explore how you could about doing this in a systematic and time efficient manner. In part 3, we look at ways you can own your work and become aware of its outcome.

0
0
10111

article-image-popular-data-sources-and-models-in-sap-analytics-cloud

Kunal Chaudhari

03 Jan 2018

12 min read

Popular Data sources and models in SAP Analytics Cloud

Kunal Chaudhari

03 Jan 2018

12 min read

0
0
17659

article-image-write-first-blockchain-learning-solidity-programming-15-minutes

Aaron Lazar

03 Jan 2018

15 min read

Write your first Blockchain: Learning Solidity Programming in 15 minutes

Aaron Lazar

03 Jan 2018

15 min read

[box type="note" align="" class="" width=""]This post is a book extract from the title Mastering Blockchain, authored by Imran Bashir. The book begins with the technical foundations of blockchain, teaching you the fundamentals of cryptography and how it keeps data secure.[/box] Our article aims to quickly get you up to speed with Blockchain development using the Solidity Programming language. Introducing solidity Solidity is a domain-specific language of choice for programming contracts in Ethereum. There are, however, other languages, such as serpent, Mutan, and LLL but solidity is the most popular at the time of writing this. Its syntax is closer to JavaScript and C. Solidity has evolved into a mature language over the last few years and is quite easy to use, but it still has a long way to go before it can become advanced and feature-rich like other well established languages. Nevertheless, this is the most widely used language available for programming contracts currently. It is a statically typed language, which means that variable type checking in solidity is carried out at compile time. Each variable, either state or local, must be specified with a type at compile time. This is beneficial in the sense that any validation and checking is completed at compile time and certain types of bugs, such as interpretation of data types, can be caught earlier in the development cycle instead of at run time, which could be costly, especially in the case of the blockchain/smart contracts paradigm. Other features of the language include inheritance, libraries, and the ability to define composite data types. Solidity is also a called contract-oriented language. In solidity, contracts are equivalent to the concept of classes in other object-oriented programming languages. Types Solidity has two categories of data types: value types and reference types. Value types These are explained in detail here. Boolean This data type has two possible values, true or false, for example: bool v = true; This statement assigns the value true to v. Integers This data type represents integers. A table is shown here, which shows various keywords used to declare integer data types. For example, in this code, note that uint is an alias for uint256: uint256 x; uint y; int256 z; These types can also be declared with the constant keyword, which means that no storage slot will be reserved by the compiler for these variables. In this case, each occurrence will be replaced with the actual value: uint constant z=10+10; State variables are declared outside the body of a function, and they remain available throughout the contract depending on the accessibility assigned to them and as long as the contract persists. Address This data type holds a 160-bit long (20 byte) value. This type has several members that can be used to interact with and query the contracts. These members are described here: Balance The balance member returns the balance of the address in Wei. Send This member is used to send an amount of ether to an address (Ethereum's 160-bit address) and returns true or false depending on the result of the transaction, for example, the following: address to = 0x6414cc08d148dce9ebf5a2d0b7c220ed2d3203da; address from = this; if (to.balance < 10 && from.balance > 50) to.send(20); Call functions The call, callcode, and delegatecall are provided in order to interact with functions that do not have Application Binary Interface (ABI). These functions should be used with caution as they are not safe to use due to the impact on type safety and security of the contracts. Array value types (fixed size and dynamically sized byte arrays) Solidity has fixed size and dynamically sized byte arrays. Fixed size keywords range from bytes1 to bytes32, whereas dynamically sized keywords include bytes and strings. bytes are used for raw byte data and string is used for strings encoded in UTF-8. As these arrays are returned by the value, calling them will incur gas cost. length is a member of array value types and returns the length of the byte array. An example of a static (fixed size) array is as follows: bytes32[10] bankAccounts; An example of a dynamically sized array is as follows: bytes32[] trades; Get length of trades: trades.length; Literals These are used to represent a fixed value. Integer literals Integer literals are a sequence of decimal numbers in the range of 0-9. An example is shown as follows: uint8 x = 2; String literals String literals specify a set of characters written with double or single quotes. An example is shown as follows: 'packt' "packt” Hexadecimal literals Hexadecimal literals are prefixed with the keyword hex and specified within double or single quotation marks. An example is shown as follows: (hex'AABBCC'); Enums This allows the creation of user-defined types. An example is shown as follows: enum Order{Filled, Placed, Expired }; Order private ord; ord=Order.Filled; Explicit conversion to and from all integer types is allowed with enums. Function types There are two function types: internal and external functions. Internal functions These can be used only within the context of the current contract. External functions External functions can be called via external function calls. A function in solidity can be marked as a constant. Constant functions cannot change anything in the contract; they only return values when they are invoked and do not cost any gas. This is the practical implementation of the concept of call as discussed in the previous chapter. The syntax to declare a function is shown as follows: function <nameofthefunction> (<parameter types> <name of the variable>) {internal|external} [constant] [payable] [returns (<return types> <name of the variable>)] Reference types As the name suggests, these types are passed by reference and are discussed in the following section. Arrays Arrays represent a contiguous set of elements of the same size and type laid out at a memory location. The concept is the same as any other programming language. Arrays have two members named length and push: uint[] OrderIds; Structs These constructs can be used to group a set of dissimilar data types under a logical group. These can be used to define new types, as shown in the following example: Struct Trade { uint tradeid; uint quantity; uint price; string trader; } Data location Data location specifies where a particular complex data type will be stored. Depending on the default or annotation specified, the location can be storage or memory. This is applicable to arrays and structs and can be specified using the storage or memory keywords. As copying between memory and storage can be quite expensive, specifying a location can be helpful to control the gas expenditure at times. Calldata is another memory location that is used to store function arguments. Parameters of external functions use calldata memory. By default, parameters of functions are stored in memory, whereas all other local variables make use of storage. State variables, on the other hand, are required to use storage. Mappings Mappings are used for a key to value mapping. This is a way to associate a value with a key. All values in this map are already initialized with all zeroes, for example, the following: mapping (address => uint) offers; This example shows that offers is declared as a mapping. Another example makes this clearer: mapping (string => uint) bids; bids["packt"] = 10; This is basically a dictionary or a hash table where string values are mapped to integer values. The mapping named bids has a packt string value mapped to value 10. Global variables Solidity provides a number of global variables that are always available in the global namespace. These variables provide information about blocks and transactions. Additionally, cryptographic functions and address-related variables are available as well. A subset of available functions and variables is shown as follows: keccak256(...) returns (bytes32) This function is used to compute the keccak256 hash of the argument provided to the Function: ecrecover(bytes32 hash, uint8 v, bytes32 r, bytes32 s) returns (address) This function returns the associated address of the public key from the elliptic curve signature: block.number This returns the current block number. Control structures Control structures available in solidity are if - else, do, while, for, break, continue, return. They work in a manner similar to how they work in C-language or JavaScript. Events Events in solidity can be used to log certain events in EVM logs. These are quite useful when external interfaces are required to be notified of any change or event in the contract. These logs are stored on the blockchain in transaction logs. Logs cannot be accessed from the contracts but are used as a mechanism to notify change of state or the occurrence of an event (meeting a condition) in the contract. In a simple example here, the valueEvent event will return true if the x parameter passed to function Matcher is equal to or greater than 10: contract valueChecker { uint8 price=10; event valueEvent(bool returnValue); function Matcher(uint8 x) returns (bool) { if (x>=price) { valueEvent(true); return true; } } } Inheritance Inheritance is supported in solidity. The is keyword is used to derive a contract from another contract. In the following example, valueChecker2 is derived from the valueChecker contract. The derived contract has access to all nonprivate members of the parent contract: contract valueChecker { uint8 price=10; event valueEvent(bool returnValue); function Matcher(uint8 x) returns (bool) { if (x>=price) { valueEvent(true); return true; } } } contract valueChecker2 is valueChecker { function Matcher2() returns (uint) { return price + 10; } } In the preceding example, if uint8 price = 10 is changed to uint8 private price = 10, then it will not be accessible by the valuechecker2 contract. This is because now the member is declared as private, it is not allowed to be accessed by any other contract. Libraries Libraries are deployed only once at a specific address and their code is called via CALLCODE/DELEGATECALL Opcode of the EVM. The key idea behind libraries is code reusability. They are similar to contracts and act as base contracts to the calling contracts. A library can be declared as shown in the following example: library Addition { function Add(uint x,uint y) returns (uint z) { return x + y; } } This library can then be called in the contract, as shown here. First, it needs to be imported and it can be used anywhere in the code. A simple example is shown as follows: Import "Addition.sol" function Addtwovalues() returns(uint) { return Addition.Add(100,100); } There are a few limitations with libraries; for example, they cannot have state variables and cannot inherit or be inherited. Moreover, they cannot receive Ether either; this is in contrast to contracts that can receive Ether. Functions Functions in solidity are modules of code that are associated with a contract. Functions are declared with a name, optional parameters, access modifier, optional constant keyword, and optional return type. This is shown in the following example: function orderMatcher(uint x) private constant returns(bool returnvalue) In the preceding example, function is the keyword used to declare the function. orderMatcher is the function name, uint x is an optional parameter, private is the access modifier/specifier that controls access to the function from external contracts, constant is an optional keyword used to specify that this function does not change anything in the contract but is used only to retrieve values from the contract instead, and returns (bool returnvalue) is the optional return type of the function. How to define a function: The syntax of defining a function is shown as follows: function <name of the function>(<parameters>) <visibility specifier> returns (<return data type> <name of the variable>) { <function body> } Function signature: Functions in solidity are identified by its signature, which is the first four bytes of the keccak-256 hash of its full signature string. This is also visible in browser solidity, as shown in the following screenshot. D99c89cb is the first four bytes of 32 byte keccak-256 hash of the function named Matcher. In this example function, Matcher has the signature hash of d99c89cb. This information is useful in order to build interfaces. Input parameters of a function: Input parameters of a function are declared in the form of <data type> <parameter name>. This example clarifies the concept where uint x and uint y are input parameters of the checkValues function: contract myContract { function checkValues(uint x, uint y) { } } Output parameters of a function: Output parameters of a function are declared in the form of <data type> <parameter name>. This example shows a simple function returning a uint value: contract myContract { Function getValue() returns (uint z) { z=x+y; } } A function can return multiple values. In the preceding example function, getValue only returns one value, but a function can return up to 14 values of different data types. The names of the unused return parameters can be omitted optionally. Internal function calls: Functions within the context of the current contract can be called internally in a direct manner. These calls are made to call the functions that exist within the same contract. These calls result in simple JUMP calls at the EVM byte code level. External function calls: External function calls are made via message calls from a contract to another contract. In this case, all function parameters are copied to the memory. If a call to an internal function is made using the this keyword, it is also considered an external call. The this variable is a pointer that refers to the current contract. It is explicitly convertible to an address and all members for a contract are inherited from the address. Fall back functions: This is an unnamed function in a contract with no arguments and return data. This function executes every time ether is received. It is required to be implemented within a contract if the contract is intended to receive ether; otherwise, an exception will be thrown and ether will be returned. This function also executes if no other function signatures match in the contract. If the contract is expected to receive ether, then the fall back function should be declared with the payable modifier. The payable is required; otherwise, this function will not be able to receive any ether. This function can be called using the address.call() method as, for example, in the following: function () { throw; } In this case, if the fallback function is called according to the conditions described earlier; it will call throw, which will roll back the state to what it was before making the call. It can also be some other construct than throw; for example, it can log an event that can be used as an alert to feed back the outcome of the call to the calling application. Modifier functions: These functions are used to change the behavior of a function and can be called before other functions. Usually, they are used to check some conditions or verification before executing the function. _(underscore) is used in the modifier functions that will be replaced with the actual body of the function when the modifier is called. Basically, it symbolizes the function that needs to be guarded. This concept is similar to guard functions in other languages. Constructor function: This is an optional function that has the same name as the contract and is executed once a contract is created. Constructor functions cannot be called later on by users, and there is only one constructor allowed in a contract. This implies that no overloading functionality is available. Function visibility specifiers (access modifiers): Functions can be defined with four access specifiers as follows: External: These functions are accessible from other contracts and transactions. They cannot be called internally unless the this keyword is used. Public: By default, functions are public. They can be called either internally or using messages. Internal: Internal functions are visible to other derived contracts from the parent contract. Private: Private functions are only visible to the same contract they are declared in. Other important keywords/functions throw: throw is used to stop execution. As a result, all state changes are reverted. In this case, no gas is returned to the transaction originator because all the remaining gas is consumed. Layout of a solidity source code file Version pragma In order to address compatibility issues that may arise from future versions of the solidity compiler version, pragma can be used to specify the version of the compatible compiler as, for example, in the following: pragma solidity ^0.5.0 This will ensure that the source file does not compile with versions smaller than 0.5.0 and versions starting from 0.6.0. Import Import in solidity allows the importing of symbols from the existing solidity files into the current global scope. This is similar to import statements available in JavaScript, as for example, in the following: Import "module-name"; Comments Comments can be added in the solidity source code file in a manner similar to C-language. Multiple line comments are enclosed in /* and */, whereas single line comments start with //. An example solidity program is as follows, showing the use of pragma, import, and comments: To summarize, we went through a brief introduction to the solidity language. Detailed documentation and coding guidelines are available online. If you found this article useful, and would like to learn more about building blockchains, go ahead and grab the book Mastering Blockchain, authored by Imran Bashir.

0
0
16842

article-image-getting-started-with-web-scraping-with-beautiful-soup

Richa Tripathi

02 Jan 2018

5 min read

Getting started with Web Scraping with Beautiful Soup

Richa Tripathi

02 Jan 2018

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Alberto Boschetti and Luca Massaron, titled Python Data Science Essentials - Second Edition. This book will take you beyond the fundamentals of data science by diving into the world of data visualizations, web development, and deep learning.[/box] In this article, we will learn to scrape a web page and learn to download it manually. This process happens more often than you might expect; and it's a very popular topic of interest in data science. For example: Financial institutions scrape the Web to extract fresh details and information about the companies in their portfolio. Newspapers, social networks, blogs, forums, and corporate websites are the ideal targets for these analyses. Advertisement and media companies analyze sentiment and the popularity of many pieces of the Web to understand people's reactions. Companies specialized in insight analysis and recommendation scrape the Web to understand patterns and model user behaviors. Comparison websites use the web to compare prices, products, and services, offering the user an updated synoptic table of the current situation. Unfortunately, understanding websites is a very hard work since each website is built and maintained by different people, with different infrastructures, locations, languages, and structures. The only common aspect among them is represented by the standard exposed language, which, most of the time, is HTML. That's why the vast majority of the web scrapers, available as of today, are only able to understand and navigate HTML pages in a general-purpose way. One of the most used web parsers is named BeautifulSoup. It's written in Python, and it's very stable and simple to use. Moreover, it's able to detect errors and pieces of malformed code in the HTML page (always remember that web pages are often human-made products and prone to errors). A complete description of Beautiful Soup would require an entire book; here we will see just a few bits. First at all, Beautiful Soup is not a crawler. In order to download a web page, we should use the urllib library, for example. Let's now download the code behind the William Shakespeare page on Wikipedia: In: import urllib.request url = 'https://en.wikipedia.org/wiki/William_Shakespeare' request = urllib.request.Request(url) response = urllib.request.urlopen(request) It's time to instruct Beautiful Soup to read the resource and parse it using the HTML parser: In: from bs4 import BeautifulSoup soup = BeautifulSoup(response, 'html.parser') Now the soup is ready, and can be queried. To extract the title, we can simply ask for the title attribute: In: soup.title Out: <title>William Shakespeare - Wikipedia, the free encyclopedia</title> As you can see, the whole title tag is returned, allowing a deeper investigation of the nested HTML structure. What if we want to know the categories associated with the Wikipedia page of William Shakespeare? It can be very useful to create a graph of the entry, simply recurrently downloading and parsing adjacent pages. We should first manually analyze the HTML page itself to figure out what's the best HTML tag containing the information we're looking for. Remember here the “no free lunch” theorem in data science: there are no auto- discovery functions, and furthermore, things can change if Wikipedia modifies its format. After a manual analysis, we discover that categories are inside a div named “mw-normal- catlinks”; excluding the first link, all the others are okay. Now it's time to program. Let's put into code what we've observed, printing for each category, the title of the linked page and the relative link to it: In: section = soup.find_all(id='mw-normal-catlinks')[0] for catlink in section.find_all("a")[1:]: print(catlink.get("title"), "->", catlink.get("href")) Out: Category:William Shakespeare -> /wiki/Category:William_Shakespeare Category:1564 births -> /wiki/Category:1564_births Category:1616 deaths -> /wiki/Category:1616_deaths Category:16th-century English male actors -> /wiki/Category:16th- century_English_male_actors Category:English male stage actors -> /wiki/Category:English_male_stage_actors Category:16th-century English writers -> /wiki/Category:16th- century_English_writers We've used the find_all method twice to find all the HTML tags with the text contained in the argument. In the first case, we were specifically looking for an ID; in the second case, we were looking for all the "a" tags. Given the output then, and using the same code with the new URLs, it's possible to download recursively the Wikipedia category pages, arriving at this point at the ancestor categories. A final note about scraping: always remember that this practice is not always allowed, and when so, remember to tune down the rate of the download (at high rates, the website's server may think you're doing a small-scale DoS attack and will probably blacklist/ban your IP address). For more information, you can read the terms and conditions of the website, or simply contact the administrators. Downloading data from various sites where there are copyright laws in place is bound to get you into real legal trouble. That’s also why most companies that employ web scraping use external vendors for this task, or have a special arrangement with the site owners. We learned the process of data scraping, its practical use case in the industries and finally we put the theory to practice by learning how to scrape data from web with the help of Beautiful Soup. If you enjoyed this excerpt, check out the book Python Data Science Essentials - Second Edition to know more about different libraries such as pandas and NumPy, that can provide you with all the tools to load and effectively manage your data

0
0
15617

article-image-create-treemap-packed-bubble-chart-tableau

Sugandha Lahoti

02 Jan 2018

4 min read

How to create a Treemap and Packed Bubble Chart in Tableau

Sugandha Lahoti

02 Jan 2018

4 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shweta Sankhe-Savale titled Tableau Cookbook – Recipes for Data Visualization. This cookbook has simple recipes for creating visualizations in Tableau. It covers the fundamentals of data visualization such as getting familiarized with Tableau Desktop and also goes to more complex problems like creating dynamic analytics with parameters, and advanced calculations.[/box] In today’s tutorial, we will learn how to create a Treemap and a packed Bubble chart in Tableau. Treemaps Treemaps are useful for representing hierarchical (tree-structured) data as a part-to-whole relationship. It shows data as a set of nested rectangles, and each branch of the tree is given a rectangle, which represents the amount of data it comprises. These can then be further divided into smaller rectangles that represent sub branches, based on its proportion to the whole. We can show information via the color and size of the rectangles and find out patterns that would be difficult to spot in other ways. They make efficient use of the space and hence can display a lot of items in a single visualization simultaneously. Getting Ready We will create a Treemap to show the sales and profit across various product subcategories. Let's see how to create a Treemap. How to Do it We will first create a new sheet and rename it as Treemap. Next, we will drag Sales from the Measures pane and drop it into the Size shelf. We will then drag Profit from Measures pane and drop it into the Color shelf. Our Mark type will automatically change to show squares. Refer to the following image: 5. Next, we will drop Sub-Category into the Label shelf in the Marks card, and we will get the output as shown in the following image: How it Works In the preceding image, since we have placed Sales in the Size shelf, we are inferring this: the greater the size, the higher the sales value; the smaller the size, the smaller the sales value. Since the Treemap is sorted in descending order of Size, we will see the biggest block in the top left-hand side corner and the smaller block in the bottom right-hand side corner. Further, we placed Profit in the Color shelf. There are some subcategories where the profit is negative and hence Tableau selects the orange/blue diverging color. Thus, when the color blue is the darkest, it indicates the Most profit. However, the orange color indicates that a particular subcategory is in a loss scenario. So, in the preceding chart, Phones has the maximum number of sales. Further, Copiers has the highest profit. Tables, on the other hand, is non-profitable. Packed Bubble Charts A Packed bubble chart is a cluster of circles where we use dimensions to define individual bubbles, and the size and/or color of the individual circles represent measures. Bubble charts have many benefits and one of them is to let us spot categories easily and compare them to the rest of the data by looking at the size of the bubble. This simple data visualization technique can provide insight in a visually attractive format. The Packed Bubble chart in Tableau uses the Circle mark type. Getting Ready To create a packed bubble chart, we will continue with the same example that we saw in the Treemap recipe. In the following section, we will see how we can convert the Treemap we created earlier into a Packed Bubble chart. How to Do it Let us duplicate the Tree Map sheet name and rename it to Packed Bubble chart. Next, change the marks from Square to Circle from the Marks dropdown in the Marks card. The output will be as shown in the following image: How it works In the Packed Bubble chart, there is no specific sort of order for Bubbles. The size and/or color are what defines the chart; the bigger or darker the circle, the greater the value. So, in the preceding example, we have Sales in the Size shelf, Profit in the Color shelf, and Sub-Category in the Label shelf. Thus, when we look at it, we understand that Phones has the most sales. Further, Copiers has the highest profit. Tables, on the other hand, is non-profitable even though the size indicates that the sales are fairly good. We saw two ways to visualize data by using Treemap and Packed Bubble chart types in Tableau. If you found this post is useful, do check out the book Tableau Cookbook – Recipes for Data Visualization to create more such charts, interactive dashboards and other beautiful data visualizations with Tableau.

0
0
10426

article-image-6-popular-regression-techniques-need-know

Amey Varangaonkar

02 Jan 2018

8 min read

6 Popular Regression Techniques You Need To Know

Amey Varangaonkar

02 Jan 2018

8 min read

0
0
5480

article-image-create-box-whisker-plot-tableau

Sugandha Lahoti

30 Dec 2017

5 min read

How to create a Box and Whisker Plot in Tableau

Sugandha Lahoti

30 Dec 2017

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shweta Sankhe-Savale, titled Tableau Cookbook – Recipes for Data Visualization. With the recipes in this book, learn to create beautiful data visualizations in no time on Tableau.[/box] In today’s tutorial, we will learn how to create a Box and Whisker plot in Tableau. The Box plot, or Box and Whisker plot as it is popularly known, is a convenient statistical representation of the variation in a statistical population. It is a great way of showing a number of data points as well as showing the outliers and the central tendencies of data. This visual representation of the distribution within a dataset was first introduced by American mathematician John W. Tukey in 1969. A box plot is significantly easier to plot than say a histogram and it does not require the user to make assumptions regarding the bin sizes and number of bins; and yet it gives significant insight into the distribution of the dataset. The box plot primarily consists of four parts: The median provides the central tendency of our dataset. It is the value that divides our dataset into two parts, values that are either higher or lower than the median. The position of the median within the box indicates the skewness in the data as it shifts either towards the upper or lower quartile. The upper and lower quartiles, which form the box, represent the degree of dispersion or spread of the data between them. The difference between the upper and lower quartile is called the Interquartile Range (IQR) and it indicates the mid-spread within which 50 percentage of the points in our dataset lie. The upper and lower whiskers in a box plot can either be plotted at the maximum and minimum value in the dataset, or 1.5 times the IQR on the upper and lower side. Plotting the whiskers at the maximum and minimum values includes 100 percentage of all values in the dataset including all the outliers. Whereas plotting the whiskers at 1.5 times the IQR on the upper and lower side represents outliers in the data beyond the whiskers. The points lying between the lower whisker and the lower quartile are the lower 25 percent of values in the dataset, whereas the points lying between the upper whisker and the upper quartile are the upper 25 percent of values in the dataset. In a typical normal distribution, each part of the box plot will be equally spaced. However, in most cases, the box plot will quickly show the underlying variations and trends in data and allows for easy comparison between datasets: Getting Ready Create a Box and Whisker plot in a new sheet in a workbook. For this purpose, we will connect to an Excel file named Data for Box plot & Gantt chart, which has been uploaded on https://1drv.ms/f/ s!Av5QCoyLTBpnhkGyrRrZQWPHWpcY. Let us save this Excel file in Documents | My Tableau Repository | Datasources | Tableau Cookbook data folder. The data contains information about customers in terms of their gender and recorded weight. The data contains 100 records, one record per customer. Using this data, let us look at how we can create a Box and Whisker plot. How to do it Once we have downloaded and saved the data from the link provided in the Getting ready section, we will create a new worksheet in our existing workbook and rename it to Box and Whisker plot. Since we haven't connected to the new dataset yet, establish a new data connection by pressing Ctrl + D on our keyboard. Select the Excel option and connect to the Data for Box plot & Gantt chart file, which is saved in our Documents | My Tableau Repository | Datasources | Tableau Cookbook data folder. Next let us select the table named Box and Whisker plot data by doubleclicking on it. Let us go ahead with the Live option to connect to this data. Next let us multi-select the Customer and Gender field from the Dimensions pane and the Weight from the Measures pane by doing a Ctrl + Select. Refer to the following image: 6. Next let us click on the Show Me! button and select the box-and-whisker plot. Refer to the highlighted section in the following image: 7. Once we click on the box-and-whisker plot option, we will see the following view: How it works In the preceding chart, we get two box and whisker plots: one for each gender. The whiskers are the maximum and minimum extent of the data. Furthermore, in each category we can see some circles, which are essentially representing a customer. Thus, within each gender category, the graph is showing the distribution of customers by their respective weights. When we hover over any of these circles, we can see details of the customer in terms of name, gender, and recorded weight in the tooltip. Refer to the following image: However, when we hover over the box (gray section), we will see the details in terms of median, lower quartiles, upper quartiles, and so on. Refer to the following image: Thus, a summary of the box plot that we created is as follows: In more simple terms, for the female category, the majority of the population lies between the weight range of 44 to 75, whereas for the male category, the majority of the population lies between the weight range of 44 to 82. Please note that in our visualization, even though the Row shelf displays SUM(Weight), since we have Customer in the Detail shelf, there's only one entry per customer, so SUM(Weight) is actually the same as MIN(Weight), MAX(Weight), or AVG(Weight). We learnt the basics of Box and Whisker plot and how to create them using Tableau. If you had fun with this recipe, do check out our book Tableau Cookbook – Recipes for Data Visualization to create interactive dashboards and beautiful data visualizations with Tableau.

0
0
7888

article-image-2017-generative-adversarial-networks-gans-research-milestones

Savia Lobo

30 Dec 2017

9 min read

2017 Generative Adversarial Networks (GANs) Research Milestones

Savia Lobo

30 Dec 2017

9 min read

Generative Adversarial Models, introduced by Ian Goodfellow, are the next big revolution in the field of deep learning. Why? Because of their ability to perform semi-supervised learning where there is a vast majority of data is unlabelled. Here, GANs can efficiently carry out image generation tasks and other tasks such as converting sketches to an image, conversion of satellite images to a map, and many other tasks. GANs are capable of generating realistic images in any circumstances, for instance, giving some text written in a particular handwriting as an input to the generative model in order to generate more texts in the similar handwriting. The speciality of these GANs is that as compared to discriminative models, these generative models make use of a joint distribution probability to generate more likely samples. In short, these generative models or GANs are an improvisation to the discriminative models. Let’s explore some of the research papers that are contributing to further advancements in GANs. CycleGAN: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks This paper talks about CycleGANs, a class of generative Adversarial networks that carry out Image-to-Image translation. This means, capturing special characteristics of one image collection and figuring out how these characteristics could be translated into the other image collection, all in the absence of any paired training examples. CycleGANs method can also be applied in variety of applications such as Collection Style Transfer, Object Transfiguration, season transfer and photo enhancement. Cycle GAN architecture Source: GitHub CycleGANs are built upon the advantages of PIX2PIX architecture. The key advantage of CycleGANs model is, it allows to point the model at two discrete, unpaired collection of images.For example, one image collection say Group A, would consist photos of landscapes in summer whereas Group B would include photos of landscapes in winter. The CycleGAN model can learn to translate the images between these two aesthetics without the need to merge tightly correlated matches together into a single X/Y training image. Source: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks The way CycleGANs are able to learn such great translations without having explicit X/Y training images involves introducing the idea of a full translation cycle to determine how good the entire translation system is, thus improving both generators at the same time. Source: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks Currently, the applications of CycleGANs can be seen in Image-to-Image translation and video translations. For example they can be seen used in Animal Transfiguration, Turning portrait faces into doll faces, and so on. Further ahead, we could potentially see its implementations in audio, text, etc., would help us in generating new data for training. Although this method has compelling results, it also has some limitations The geometric changes within an image are not fully successful (for instance, the cat to dog transformation showed minute success). This could be caused by the generator architecture choices, which are tailored for good performance on the appearance changes. Thus, handling more varied and extreme transformations, especially geometric changes, is an important problem. Failure caused by the distribution characteristics of the training datasets. For instance, in the horse to zebra transfiguration, the model got confused as it was trained on the wild horse and zebra synsets of ImageNet, which does not contain images of a person riding a horse or zebra. These and some other limitations are described in the research paper. To read more about CycleGANs in detail visit the link here. Wasserstein GAN In this paper, we get an exposure to Wasserstein GANs and how they overcomes the drawbacks in original GANs. Although GANs have shown a drastic success in realistic image generation, the training however is not that easy as the process is slow and unstable. In the paper proposed for WGANs, it is empirically shown that WGANs cure the training problem. Wasserstein distance, also known as Earth Mover’s (EM) distance, is a measure of distance between two probability distributions. The basic idea in WGAN is to replace the loss function so that there always exists a non-zero gradient. This can be done using Wasserstein distance between the generator distribution and the data distribution. Training these WGANs does not require keeping a balance in training of the discriminator and the generator. It also doesn’t require a design of the network architecture too. One of the most fascinating practical benefits of WGANs is the ability to continuously estimate the EM distance by training the discriminator to an optimal level. The learning curves when used for plotting are useful for debugging and hyperparameter searches. These curves also correlate well with the observed sample quality and improved stability of the optimization process. Thus, Wasserstein GANs are an alternative to traditional GAN training with features such as: Improvement in the stability of learning Elimination of problems like mode collapse Provide meaningful learning curves useful for debugging and hyperparameter searches Furthermore, the paper also showcases that the corresponding optimization problem is sound, and provides extensive theoretical work highlighting the deep connections to other distances between distributions. The Wasserstein GAN has been utilized to train a language translation machine. The condition here is that there is no parallel data between the word embeddings between the two languages. Wasserstein GANs have been used to perform English-Russian and English-Chinese language mappings. Limitations of WGANs: WGANs suffer from unstable training at times, when one uses a momentum based optimizer or when one uses high learning rates. Includes slow convergence after weight clipping, especially when clipping window is too large. It also suffers from the vanishing gradient problem when the clipping window is too small. To have a detailed understanding of WGANs have a look at the research paper here. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets This paper describes InfoGAN (Information-theoretic extension to the Generative Adversarial Network). It can learn disentangled representations in a completely unsupervised manner. In traditional GANs, learned dataset is entangled i.e. encoded in a complex manner within the data space. However, if the representation is disentangled, it would be easy to implement and easy to apply tasks on it. InfoGAN solves the entangled data problem in GANs. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, extracts poses of objects correctly irrespective of the lighting conditions within the 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hairstyles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. InfoGAN does not require any kind of supervision. In comparison to InfoGAN, the only other unsupervised method that learns disentangled representations is hossRBM, a higher-order extension of the spike-and-slab restricted Boltzmann machine which disentangles emotion from identity on the Toronto Face Dataset. However, hossRBM can only disentangle discrete latent factors, and its computation cost grows exponentially in the number of factors. Whereas, InfoGAN can disentangle both discrete and continuous latent factors, scale to complicated datasets, and typically requires no more training time than regular GAN. In the experiments given in the paper, firstly the comparison of InfoGAN with prior approaches on relatively clean datasets is shown. Another experiment shown is, where InfoGAN can learn interpretable representations on complex datasets (here no previous unsupervised approach is known to learn representations of comparable quality.) Thus, InfoGAN is completely unsupervised and learns interpretable and disentangled representations on challenging datasets. Additionally, InfoGAN adds only negligible computation cost on top of GAN and is easy to train. The core idea of using mutual information to induce representation can be applied to other methods like VAE (Variational AutoEncoder) in future. The other possibilities with InfoGAN in future could be,learning hierarchical latent representations, improving semi-supervised learning with better codes, and using InfoGAN as a high-dimensional data discovery tool. To know more about this research paper in detail, visit the link given here. Progressive growing of GANs for improved Quality, Stability, and Variation This paper describes a brand new method for training your Generative Adversarial Networks. The basic idea here is to train both the generator and the discriminator progressively. This means, starting from a low resolution and adding new layers so that the model increases in providing images with finer details as training progresses. Such a method speeds up the training and also stabilizes it to a greater extent, which in turn produces images of unprecedented quality. For instance, a higher quality version of the CELEBA images dataset that provides output resolutions up to 10242 pixels. Source: https://arxiv.org/pdf/1710.10196.pdf When new layers are added to the networks, they fade in smoothly. This helps in avoiding the sudden shocks to the already well-trained, smaller resolution layers. Also, the progressive training has various other benefits. The generation of smaller images is substantially more stable because there is less class information and fewer modes By increasing the resolution little by little, we are continuously asking a much simpler question compared to the end goal of discovering a mapping from latent vectors to e.g. 10242 images Progressive growing of GANs also reduces the training time. In addition to this, most of the iterations are done at lower resolutions, and the quality of the result obtained is upto 2-6 times faster, depending on the resolution of the final output. Thus, by progressively training GANs results into better quality, stability, and variation in images. This may also lead to true photorealism in near future. The paper concludes with the fact that, though there are certain limitations with this training method, which include semantic sensibility and understanding dataset-dependent constraints(such as certain objects being straight rather than curved). This leaves a lot to be desired from GANs and there is also room for improvement in the micro-structure of the images. To have a thorough understanding of this research paper, read the paper here.

0
0
8824

article-image-format-publish-code-using-r-markdown

Savia Lobo

29 Dec 2017

6 min read

How to format and publish code using R Markdown

Savia Lobo

29 Dec 2017

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ahmed Sherif titled Practical Business Intelligence. This book is a complete guide for implementing Business intelligence with the help of powerful tools like D3.js, R, Tableau, Qlikview and Python that are available on the market. It starts off by preparing you for data analytics and then moves on to teach you a range of techniques to fetch important information from various databases.[/box] Today you will explore how to use R Markdown, which is a format that allows reproducible reports with embedded R code that can be published into slideshows, Word documents, PDF files, and HTML web pages. Getting started with R Markdown R Markdown documents have the .RMD extension and are created by selecting R Markdown from the menu bar of RStudio, as seen here: If this is the first time you are creating an R Markdown report, you may be prompted to install some additional packages for R Markdown to work, as seen in the following screenshot: Once the packages are installed, we can define a title, author, and default output format, as follows: For our purposes we will use HTML output. The default output of the R Markdown document will appear like this: R Markdown features and components We can go ahead and delete everything below line 7 in the previous screenshot as we will create our own template with our embedded code and formatting. Header levels can be generated using # in front of a title. The largest font size will have a single # and each subsequent # added will decrease the header level font. Whenever we wish to embed actual R code into the report, we can include it inside of a code chunk by clicking on the icon shown here: Once that icon is selected, a shaded region is created between two ``` characters where R code can be generated identical to that used in RStudio. The first header generated will be for the results, and then the subsequent header will indicate the libraries used to generate the report. This can be generated using the following script: # Results ###### Libraries used are RODBC, plotly, and forecast Executing R code inside of R Markdown The next step is to run the actual R code inside of the chunk snippet that calls the required libraries needed to generate the report. This can be generated using the following script: ```{r} # We will not see the actual libraries loaded # as it is not necessary for the end user library('RODBC') library('plotly') library('forecast') ``` We can then click on the Knit HTML icon on the menu bar to generate a preview of our code results in R Markdown. Unfortunately, this output of library information is not useful to the end user. Exporting tips for R Markdown The report output includes all the messages and potential warnings that are the result of calling a package. This is not information that is useful to the report consumer. Fortunately for R developers, these types of messages can be concealed by tweaking the R chunk snippets to include the following logic in their script: ```{r echo = FALSE, results = 'hide', message = FALSE} ``` We can continue embedding R code into our report to run queries against the SQL Server database and produce summary data of the dataframe as well as the three main plots for the time series plot, observed versus fitted smoothing, and Holt-Winters forecasting: ###### Connectivity to Data Source is through ODBC ```{r echo = FALSE, results = 'hide', message = FALSE} connection_SQLBI<-odbcConnect("SQLBI") #Get Connection Details connection_SQLBI ##query fetching begin## SQL_Query_1<-sqlQuery(connection_SQLBI, 'SELECT [WeekInYear] ,[DiscountCode] FROM [AdventureWorks2014].[dbo].[DiscountCodebyWeek]' ) ##query fetching end## #begin table manipulation colnames(SQL_Query_1)<- c("Week", "Discount") SQL_Query_1$Weeks <- as.numeric(SQL_Query_1$Week) SQL_Query_1<-SQL_Query_1[,-1] #removes first column SQL_Query_1<-SQL_Query_1[c(2,1)] #reverses columns 1 and 2 #end table manipulation ``` ### Preview of First 6 rows of data ```{r echo = FALSE, message= FALSE} head(SQL_Query_1) ``` ### Summary of Table Observations ```{r echo = FALSE, message= FALSE} str(SQL_Query_1) ``` ### Time Series and Forecast Plots ```{r echo = FALSE, message= FALSE} Query1_TS<-ts(SQL_Query_1$Discount) par(mfrow=c(3,1)) plot.ts(Query1_TS, xlab = 'Week (1-52)', ylab = 'Discount', main = 'Time Series of Discount Code by Week') discountforecasts <- HoltWinters(Query1_TS, beta=FALSE, gamma=FALSE) plot(discountforecasts) discountforecasts_8periods <- forecast.HoltWinters(discountforecasts, h=8) plot.forecast(discountforecasts_8periods, ylab='Discount', xlab = 'Weeks (1-60)', main = 'Forecasting 8 periods') ``` The final output Before publishing the output with the results, R Markdown offers the developer opportunities to prettify the end product. One effect I like to add to a report is a logo of some kind. This can be done by applying the following code to any line in R Markdown: ![](http://website.com/logo.jpg) # image is on a website ![](images/logo.jpg) # image is locally on your machine The first option adds an image from a website, and the second option adds an image locally. For my purposes, I will add a PacktPub logo right above the Results section in the R Markdown, as seen in the following screenshot: To learn more about customizing an R Markdown document, visit the following website: http://rmarkdown.rstudio.com/authoring_basics.html. Once we are ready to preview the results of the R Markdown output, we can once again select the Knit to HTML button on the menu. The new report can be seen in this screenshot: As can be seen in the final output, even if the R code is embedded within the R Markdown document, we can suppress the unnecessary technical output and reveal the relevant tables, fields, and charts that will provide the most benefit to end users and report consumers. If you have enjoyed reading this article and want to develop the ability to think along the right lines and use more than one tool to perform analysis depending on the needs of your business, do check out Practical Business Intelligence.

0
0
5215

article-image-saving-backups-on-cloud-services-with-elasticsearch-plugins

Savia Lobo

29 Dec 2017

9 min read

Saving backups on cloud services with ElasticSearch plugins

Savia Lobo

29 Dec 2017

9 min read

[box type="note" align="" class="" width=""]Our article is a book excerpt taken from Mastering Elasticsearch 5.x. written by Bharvi Dixit. This book guides you through the intermediate and advanced functionalities of Elasticsearch, such as querying, indexing, searching, and modifying data. In other words, you will gain all the knowledge necessary to master Elasticsearch and put into efficient use.[/box] This article will explain how Elasticsearch, with the help of additional plugins, allows us to push our data outside of the cluster to the cloud. There are three possibilities where our repository can be located, at least using officially supported plugins: The S3 repository: AWS The HDFS repository: Hadoop clusters The GCS repository: Google cloud services The Azure repository: Microsoft's cloud platform Let's go through these repositories to see how we can push our backup data on the cloud services. The S3 repository The S3 repository is a part of the Elasticsearch AWS plugin, so to use S3 as the repository for snapshotting, we need to install the plugin first on every node of the cluster and each node must be restarted after the plugin installation: sudo bin/elasticsearch-plugin install repository-s3 After installing the plugin on every Elasticsearch node in the cluster, we need to alter their configuration (the elasticsearch.yml file) so that the AWS access information is available. The example configuration can look like this: cloud: aws: access_key: YOUR_ACCESS_KEY secret_key: YOUT_SECRET_KEY To create the S3 repository that Elasticsearch will use for snapshotting, we need to run a command similar to the following one: curl -XPUT 'http://localhost:9200/_snapshot/my_s3_repository' -d '{ "type": "s3", "settings": { "bucket": "bucket_name" } }' The following settings are supported when defining an S3-based repository: bucket: This is the required parameter describing the Amazon S3 bucket to which the Elasticsearch data will be written and from which Elasticsearch will read the data. region: This is the name of the AWS region where the bucket resides. By default, the US Standard region is used. base_path: By default, Elasticsearch puts the data in the root directory. This parameter allows you to change it and alter the place where the data is placed in the repository. server_side_encryption: By default, encryption is turned off. You can set this parameter to true in order to use the AES256 algorithm to store data. chunk_size: By default, this is set to 1GB and specifies the size of the data chunk that will be sent. If the snapshot size is larger than the chunk_size, Elasticsearch will split the data into smaller chunks that are not larger than the size specified in the chunk_size. The chunk size can be specified in size notations such as 1GB, 100mb, and 1024kB. buffer_size: The size of this buffer is set to 100mb by default. When the chunk size is greater than the value of the buffer_size, Elasticsearch will split it into buffer_size fragments and use the AWS multipart API to send it. The buffer size cannot be set lower than 5 MB because it disallows the use of the multipart API. endpoint: This defaults to AWS's default S3 endpoint. Setting a region overrides the endpoint setting. protocol: Specifies whether to use http or https. It default to cloud.aws.protocol or cloud.aws.s3.protocol. compress: Defaults to false and when set to true. This option allows snapshot metadata files to be stored in a compressed format. Please note that index files are already compressed by default. read_only: Makes a repository to be read only. It defaults to false. max_retries: This specifies the number of retries Elasticsearch will take before giving up on storing or retrieving the snapshot. By default, it is set to 3. In addition to the preceding properties, we are allowed to set two additional properties that can overwrite the credentials stored in elasticsearch.yml, which will be used to connect to S3. This is especially handy when you want to use several S3 repositories, each with its own security settings: access_key: This overwrites cloud.aws.access_key from elasticsearch.yml secret_key: This overwrites cloud.aws.secret_key from elasticsearch.yml Note: AWS instances resolve S3 endpoints to a public IP. If the Elasticsearch instances reside in a private subnet in an AWS VPC then all traffic to S3 will go through that VPC's NAT instance. If your VPC's NAT instance is a smaller instance size (for example, a t1.micro) or is handling a high volume of network traffic, your bandwidth to S3 may be limited by that NAT instance's networking bandwidth limitations. So, if you running your Elasticsearch cluster inside a VPC then make sure that you are using instances with a high networking bandwidth and there is no network congestion. Note: Instances residing in a public subnet in an AWS VPC will connect to S3 via the VPC's Internet gateway and not be bandwidth limited by the VPC's NAT instance. The HDFS repository If you use Hadoop and its HDFS (http://wiki.apache.org/hadoop/HDFS) filesystem, a good alternative to back up the Elasticsearch data is to store it in your Hadoop cluster. As with the case of S3, there is a dedicated plugin for this. To install it, we can use the following command: sudo bin/elasticsearch-plugin install repository-hdfs Note : The HDFS snapshot/restore plugin is built against the latest Apache Hadoop 2.x (currently 2.7.1). If your Hadoop distribution is not protocol compatible with Apache Hadoop, you can replace the Hadoop libraries inside the plugin folder with your own (you might have to adjust the security permissions required). Note: Even if Hadoop is already installed on the Elasticsearch nodes, for security reasons, the required libraries need to be placed under the plugin folder. Note that in most cases, if the distribution is compatible, one simply needs to configure the repository with the appropriate Hadoop configuration files. After installing the plugin on each node in the cluster and restarting every node, we can use the following command to create a repository in our Hadoop cluster: curl -XPUT 'http://localhost:9200/_snapshot/es_hdfs_repository' -d '{ "type": "hdfs" "settings": { "uri": "hdfs://namenode:8020/", "path": "elasticsearch_snapshots/es_hdfs_repository" } }' The available settings that we can use are as follows: uri: This is a required parameter that tells Elasticsearch where HDFS resides. It should have a format like hdfs://HOST:PORT/. path: This is the information about the path where snapshot files should be stored. It is a required parameter. load_default: This specifies whether the default parameters from the Hadoop configuration should be loaded and set to false if the reading of the settings should be disabled. This setting is enabled by default. chunk_size: This specifies the size of the chunk that Elasticsearch will use to split the snapshot data. If you want the snapshotting to be faster, you can use smaller chunks and more streams to push the data to HDFS. By default, it is disabled. conf.<key>: This is an optional parameter and tells where a key is in any Hadoop argument. The value provided using this property will be merged with the configuration. As an alternative, you can define your HDFS repository and its settings inside the elasticsearch.yml file of each node as follows: repositories: hdfs: uri: "hdfs://<host>:<port>/" path: "some/path" load_defaults: "true" conf.<key> : "<value>" compress: "false" chunk_size: "10mb" The Azure repository Just like Amazon S3, we are able to use a dedicated plugin to push our indices and metadata to Microsoft cloud services. To do this, we need to install a plugin on every node of the cluster, which we can do by running the following command: sudo bin/elasticsearch-plugin install repository-azure The configuration is also similar to the Amazon S3 plugin configuration. Our elasticsearch.yml file should contain the following section: cloud: azure: storage: my_account: account: your_azure_storage_account key: your_azure_storage_key Do not forget to restart all the nodes after installing the plugin. After Elasticsearch is configured, we need to create the actual repository, which we do by running the following command: curl -XPUT 'http://localhost:9200/_snapshot/azure_repository' -d '{ "type": "azure" }' The following settings are supported by the Elasticsearch Azure plugin: account: Microsoft Azure account settings to be used. container: As with the bucket in Amazon S3, every piece of information must reside in the container. This setting defines the name of the container in the Microsoft Azure space. The default value is elasticsearch-snapshots. base_path: This allows us to change the place where Elasticsearch will put the data. By default, the value for this setting is empty which causes Elasticsearch to put the data in the root directory. compress: This defaults to false and when enabled it allows us to compress the metadata files during the snapshot creation. chunk_size: This is the maximum chunk size used by Elasticsearch (set to 64m by default, and this is also the maximum value allowed). You can change it to change the size when the data should be split into smaller chunks. You can change the chunk size using size value notations such as, 1g, 100m, or 5k. An example of creating a repository using the settings follows: curl -XPUT "http://localhost:9205/_snapshot/azure_repository" -d' { "type": "azure", "settings": { "container": "es-backup-container", "base_path": "backups", "chunk_size": "100m", "compress": true } }' The Google cloud storage repository Similar to Amazon S3 and Microsoft Azure, we can use a GCS repository plugin for snapshotting and restoring of our indices. The settings for this plugin are almost similar to other cloud plugins. To know how to work with the Google cloud repository plugin please refer to the following URL: https://www.elastic.co/guide/en/elasticsearch/plugins/5.0/repository-gcs.htm Thus, in the article we learn how to carry out backup of your data from Elasticsearch clusters to the cloud, i.e. within different cloud repositories by making use of the additional plugin options with Elasticsearch. If you found our excerpt useful, you may explore other interesting features and advanced concepts of Elasticsearch 5.x like aggregation, index control, sharding, replication, and clustering in the book Mastering Elasticsearch 5.x.

0
1
29343

article-image-getting-to-know-and-manipulate-tensors-in-tensorflow

Sunith Shetty

29 Dec 2017

5 min read

Getting to know and manipulate Tensors in TensorFlow

Sunith Shetty

29 Dec 2017

5 min read

[box type="note" align="" class="" width=""]This article is a book excerpt written by Rodolfo Bonnin, titled Building Machine Learning Projects with TensorFlow. In this book, you will learn to build powerful machine learning projects to tackle complex data for gaining valuable insights. [/box] Today, you will learn everything about Tensors, their properties and how they are used to represent data. What are tensors? TensorFlow bases its data management on tensors. Tensors are concepts from the field of mathematics, and are developed as a generalization of the linear algebra terms of vectors and matrices. Talking specifically about TensorFlow, a tensor is just a typed, multidimensional array, with additional operations, modeled in the tensor object. Tensor properties - ranks, shapes, and types TensorFlow uses tensor data structure to represent all data. Any tensor has a static type and dynamic dimensions, so you can change a tensor's internal organization in real-time. Another property of tensors, is that only objects of the tensor type can be passed between nodes in the computation graph. Let's now see what the properties of tensors are (from now on, every time we use the word tensor, we'll be referring to TensorFlow's tensor objects). Tensor rank Tensor ranks represent the dimensional aspect of a tensor, but is not the same as a matrix rank. It represents the quantity of dimensions in which the tensor lives, and is not a precise measure of the extension of the tensor in rows/columns or spatial equivalents. A rank one tensor is the equivalent of a vector, and a rank one tensor is a matrix. For a rank two tensor you can access any element with the syntax t[i, j]. For a rank three tensor you would need to address an element with t[i, j, k], and so on. In the following example, we will create a tensor, and access one of its components: import tensorflow as tf sess = tf.Session() tens1 = tf.constant([[[1,2],[2,3]],[[3,4],[5,6]]]) print sess.run(tens1)[1,1,0] Output: 5 This is a tensor of rank three, because in each element of the containing matrix, there is a vector element: Rank Math entity Code definition example 0 Scalar scalar = 1000 1 Vector Vector vector = [2, 8, 3] 2 Matrix matrix = [[4, 2, 1], [5, 3, 2], [5, 5, 6]] 3 3-tensor tensor = [[[4], [3], [2]], [[6], [100], [4]], [[5], [1], [4]]] n n-tensor … Tensor shape The TensorFlow documentation uses three notational conventions to describe tensor dimensionality: rank, shape, and dimension number. The following table shows how these relate to one another: Rank Shape Dimension number Example 0 [] 0 4 1 [D0] 1 [2] 2 [D0, D1] 2 [6, 2] 3 [D0, D1, D2] 3 [7, 3, 2] n [D0, D1, … Dn-1] n-D A tensor with shape [D0, D1, … Dn-1] In the following example, we create a sample rank three tensor, and print the shape of it: Tensor data types In addition to dimensionality, tensors have a fixed data type. You can assign any one of the following data types to a tensor: Data type Python type Description DT_FLOAT tf.float32 32 bits floating point. DT_DOUBLE tf.float64 64 bits floating point. DT_INT8 tf.int8 8 bits signed integer. DT_INT16 tf.int16 16 bits signed integer. DT_INT32 tf.int32 32 bits signed integer. DT_INT64 tf.int64 64 bits signed integer. DT_UINT8 tf.uint8 8 bits unsigned integer. DT_STRING tf.string Variable length byte arrays. Each element of a tensor is a byte array. DT_BOOL tf.bool Boolean. Creating new tensors We can either create our own tensors, or derivate them from the well-known numpy library. In the following example, we create some numpy arrays, and do some basic math with them: import tensorflow as tf import numpy as np x = tf.constant(np.random.rand(32).astype(np.float32)) y= tf.constant ([1,2,3]) x Y Output: <tf.Tensor 'Const_2:0' shape=(3,) dtype=int32> From numpy to tensors and vice versa TensorFlow is interoperable with numpy, and normally the eval() function calls will return a numpy object, ready to be worked with the standard numerical tools. We must note that the tensor object is a symbolic handle for the result of an operation, so it doesn't hold the resulting values of the structures it contains. For this reason, we must run the eval() method to get the actual values, which is the equivalent to Session.run(tensor_to_eval). In this example, we build two numpy arrays, and convert them to tensors: import tensorflow as tf #we import tensorflow import numpy as np #we import numpy sess = tf.Session() #start a new Session Object x_data = np.array([[1.,2.,3.],[3.,2.,6.]]) # 2x3 matrix x = tf.convert_to_tensor(x_data, dtype=tf.float32) print (x) Output: Tensor("Const_3:0", shape=(2, 3), dtype=float32) Useful method: tf.convert_to_tensor: This function converts Python objects of various types to tensor objects. It accepts tensorobjects, numpy arrays, Python lists, and Python scalars. Getting things done - interacting with TensorFlow As with the majority of Python's modules, TensorFlow allows the use of Python's interactive console: In the previous figure, we call the Python interpreter (by simply calling Python) and create a tensor of constant type. Then we invoke it again, and the Python interpreter shows the shape and type of the tensor. We can also use the IPython interpreter, which will allow us to employ a format more compatible with notebook-style tools, such as Jupyter: When talking about running TensorFlow Sessions in an interactive manner, it's better to employ the InteractiveSession object. Unlike the normal tf.Session class, the tf.InteractiveSession class installs itself as the default session on construction. So when you try to eval a tensor, or run an operation, it will not be necessary to pass a Session object to indicate which session it refers to. To summarize, we have learned about tensors, the key data structure in TensorFlow and simple operations we can apply to the data. To know more about different machine learning techniques and algorithms that can be used to build efficient and powerful projects, you can refer to the book Building Machine Learning Projects with TensorFlow.

0
0
35803

article-image-25-startups-machine-learning-differently-2018

Fatema Patrawala

29 Dec 2017

14 min read

25 Startups using machine learning differently in 2018: From farming to brewing beer to elder care

Fatema Patrawala

29 Dec 2017

14 min read

What really excites me about data science and by extension machine learning is the sheer number of possibilities! You can think of so many applications off the top of your head: robo-advisors, computerized lawyers, digital medicine, even automating VC decisions when they invest in startups. You can even venture into automation of art and music, algorithms writing papers which are indistinguishable from human-written papers. It's like solving a puzzle, but a puzzle that's meaningful and that has real world implications. The things that we can do today weren’t possible 5 years ago, and this is largely thanks to growth in computational power, data availability, and the adoption of the cloud that made accessing these resources economical for everyone, all key enabling factors for the advancement of Machine learning and AI. Having witnessed the growth of data science as discipline, industries like finance, health-care, education, media & entertainment, insurance, retail as well as energy has left no stone unturned to harness this opportunity. Data science has the capability to offer even more; and we will see the wide range of applications in the future in places haven’t even been explored. In the years to come, we will increasingly see data powered/AI enabled products and services take on roles traditionally handled by humans as they required innately human qualities to successfully perform. In this article we have covered some use cases of Data Science being used differently and start-ups who have practically implemented it: The Nurturer: For elder care The world is aging rather rapidly. According to the World Health Organization, nearly two billion people across the world are expected to be over 60 years old by 2050, a figure that’s more than triple what it was in 2000. In order to adapt to their increasingly aging population, many countries have raised the retirement age, reducing pension benefits, and have started spending more on elderly care. Research institutions in countries like Japan, home to a large elderly population, are focusing their R&D efforts on robots that can perform tasks like lifting and moving chronically ill patients, many startups are working on automating hospital logistics and bringing in virtual assistance. They also offer AI-based virtual assistants to serve as middlemen between nurses and patients, reducing the need for frequent in-hospital visits. Dr Ben Maruthappu, a practising doctor, has brought a change to the world of geriatric care with an AI based app Cera. It is an on-demand platform to aid the elderly in need. The Cera app firmly puts itself in the category of Uber & Amazon, whereby it connects elderly people in need of care with a caregiver in a matter of few hours. The team behind this innovation also plans to use AI to track patients’ health conditions and reduce the number of emergency patients admitted in hospitals. A social companion technology - Elliq created by Intuition Robotics helps older adults stay active and engaged with a proactive social robot that overcomes the digital divide. AliveCor, a leading FDA-cleared mobile heart solution helps save lives, money, and has brought modern healthcare alive into the 21st century. The Teacher: Personalized education platform for lifelong learning With children increasingly using smartphones and tablets and coding becoming a part of national curricula around the world, technology has become an integral part of classrooms. We have already witnessed the rise and impact of education technology especially through a multitude of adaptive learning platforms that allow learners to strengthen their skills and knowledge - CBTs, LMSes, MOOCs and more. And now virtual reality (VR) and artificial intelligence (AI) are gaining traction to provide us with lifelong learning companion that can accompany and support individuals throughout their studies - in and beyond school . An AI based educational platform learns the amount of potential held by each particular student. Based on this data, tailored guidance is provided to fix mistakes and improvise on the weaker areas. A detailed report can be generated by the teachers to help them customise lesson plans to best suit the needs of the student. Take Gruff Davies’ Kwiziq for example. Gruff with his team leverage AI to provide a personalised learning experience for students based on their individual needs. Students registered on the platform get an advantage of an AI based language coach which asks them to solve various micro quizzes. Quiz solutions provided by students are then turned into detailed “brain maps”. These brain maps are further used to provide tailored instructions and feedback for improvement. Other startup firms like Blippar specialize in Augmented reality for visual and experiential learning. Unelma Platforms, a software platform development company provides state-of-the-art software for higher-education, healthcare and business markets. The Provider: Farming to be more productive, sustainable and advanced Though farming is considered the backbone of many national economies especially in the developing world, there is often an outdated view of it involving a small, family-owned lands where crops are hand harvested. The reality of modern-day farms have had to overhaul operations to meet demand and remain competitively priced while adapting to the ever-changing ways technology is infiltrating all parts of life. Climate change is a serious environmental threat farmers must deal with every season: Strong storms and severe droughts have made farming even more challenging. Additionally lack of agricultural input, water scarcity, over-chemicalization in fertilizers, water & soil pollution or shortage of storage systems has made survival for farmers all the more difficult. To overcome these challenges, smart farming techniques are the need of an hour for farmers in order to manage resources and sustain in the market. For instance, in a paper published by arXiv, the team explains how they used a technique known as transfer learning to teach the AI how to recognize crop diseases and pest damage.They utilized TensorFlow, to build and train a neural network of their own, which involved showing the AI 2,756 images of cassava leaves from plants in Tanzania. Their efforts were a success, as the AI was able to correctly identify brown leaf spot disease with 98 percent accuracy. WeFarm, SaaS based agritech firm, headquartered in London, aims to bridge the connectivity gap amongst the farmer community. It allows them to send queries related to farming via text message which is then shared online into several languages. The farmer then receives a crowdsourced response from other farmers around the world. In this way, a particular farmer in Kenya can get a solution from someone sitting in Uganda, without having to leave his farm, spend additional money or without accessing the internet. Benson Hill Bio-systems, by Matthew B. Crisp, former President of Agricultural Biotechnology Division, has differentiated itself by bringing the power of Cloud Biology™ to agriculture. It combines cloud computing, big data analytics, and plant biology to inspire innovation in agriculture. At the heart of Benson Hill is CropOS™, a cognitive engine that integrates crop data and analytics with the biological expertise and experience of the Benson Hill scientists. CropOS™ continuously advances and improves with every new dataset, resulting in the strengthening of the system’s predictive power. Firms like Plenty Inc and Bowery Farming Inc are nowhere behind in offering smart farming solutions. Plenty Inc is an agriculture technology company that develops plant sciences for crops to flourish in a pesticide and GMO-free environment. While Bowery Farming uses high-tech approaches such as robotics, LED lighting and data analytics to grow leafy greens indoors. The Saviour: For sustainability and waste management The global energy landscape continues to evolve, sometimes by the nanosecond, sometimes by the day. The sector finds itself pulled to economize and pushed to innovate due to a surge in demand for new power and utilities offerings. Innovations in power-sector technology, such as new storage battery options and smartphone-based thermostat apps, AI enabled sensors etc; are advancing at a pace that has surprised developers and adopters alike. Consumer’s demands for such products have increased. To meet this, industry leaders are integrating those innovations into their operations and infrastructure as rapidly as they can. On the other hand, companies pursuing energy efficiency have two long-standing goals — gaining a competitive advantage and boosting the bottom line — and a relatively new one: environmental sustainability. Realising the importance of such impending situations in the industry, we have startups like SmartTrace offering an innovative cloud-based platform to quickly manage waste at multiple levels. This includes bridging rough data from waste contractors, extrapolating to volume, EWC, finance and Co2 statistics. Data extracted acts as a guide to improve methodology, educate, strengthen oversight and direct improvements to the bottom line, as well as environmental outcomes. One Concern provides damage estimates using artificial intelligence on natural phenomena sciences. Autogrid organizes energy data and employs big data analytics to generate real-time predictions to create actionable data. The Dreamer: For lifestyle and creative product development and design Consumers in our modern world continually make multiple decisions with regard to product choice due to many competing products in the market.Often those choices boil down to whether it provides better value than others either in terms of product quality, price or by aligning with their personal beliefs and values.Lifestyle products and brands operate off ideologies, hoping to attract a relatively high number of people and ultimately becoming a recognized social phenomenon. While ecommerce has leveraged data science to master the price dimension, here are some examples of startups trying to deconstruct the other two dimensions: product development and branding. I wonder if you have ever imagined your beer to be brewed by AI? Well, now you can with IntelligentX. The Intelligent X team claim to have invented the world's first beer brewed by Artificial intelligence. They also plan to craft a premium beer using complex machine learning algorithms which can improve itself from the feedback given by its customers. Customers are given to try one of their four bottled conditioned beers, after the trial they are asked by their AI what they think of the beer, via an online feedback messenger. The data then collected is used by an algorithm to brew the next batch. Because their AI is constantly reacting to user feedback, they can brew beer that matches what customers want, more quickly than anyone else can. What this actually means that the company gets more data and customers get a customized fresh beer! In the lifestyle domain, we have Stitch Fix which has brought a personal touch to the online shopping journey. They are no regular other apparel e-commerce company. They have created a perfect formula for blending human expertise with the right amount of Data Science to serve their customers. According to Katrina Lake, Founder, and CEO, "You can look at every product on the planet, but trying to figure out which one is best for you is really the challenge” and that’s where Stitch Fix has come into the picture. The company is disrupting traditional retail by bridging the gap of personalized shopping, that the former could not achieve. To know how StitchFix uses Full Stack Data Science read our detailed article. The Writer: From content creation to curation to promotion In the publishing industry, we have seen a digital revolution coming in too. Echobox are one of the pioneers in building AI for the publishing industry. Antoine Amann, founder of Echobox, wrote in a blog post that they have "developed an AI platform that takes large quantity of variables into account and analyses them in real time to determine optimum post performance". Echobox pride itself to currently work with Facebook and Twitter for optimizing social media content, perform advanced analytics with A/B testing and also curate content for desired CTRs. With global client base like The Le Monde, The Telegraph, The Guardian etc. they have conveniently ripped social media editors. New York-based startup Agolo uses AI to create real-time summaries of information. It initially use to curate Twitter feeds in order to focus on conversations, tweets and hashtags that were most relevant to its user's preferences. Using natural language processing, Agolo scans content, identifies relationships among the information sources, and picks out the most relevant information, all leading to a comprehensive summary of the original piece of information. Other websites like Grammarly, offers AI-powered solutions to help people write, edit and formulate mistake-free content. Textio came up with augmented writing which means every time you wrote something and you would come to know ahead of time exactly who is going to respond. It basically means writing which is supported by outcomes in real time. Automated Insights, Creator of Wordsmith, the natural language generation platform enables you to produce human-sounding narratives from data. The Matchmaker: Connecting people, skills and other entities AI will make networking at B2B events more fun and highly productive for business professionals. Grip, a London based startup, formerly known as Network, rebranded itself in the month of April, 2016. Grip is using AI as a platform to make networking at events more constructive and fruitful. It acts as a B2B matchmaking engine that accumulates data from social accounts (LinkedIn, Twitter) and smartly matches the event registration data. Synonymous to Tinder for networking, Grip uses advanced algorithms to recommend the right people and presents them with an easy to use swiping interface feature. It also delivers a detailed report to the event organizer on the success of the event for every user or a social Segment. We are well aware of the data scientist being the sexiest job of the 21st century. JamieAi harnessing this fact connects technical talent with data-oriented jobs organizations of all types and sizes. The start-up firm has combined AI insights and human oversight to reduce hiring costs and eliminate bias. Also, third party recruitment agencies are removed from the process to boost transparency and efficiency in the path to employment. Another example is Woo.io, a marketplace for matching tech professionals and companies. The Manager: Virtual assistants of a different kind Artificial Intelligence can also predict how much your household appliance will cost on your electricity bill. Verv, a producer of clever home energy assistance provides intelligent information on your household appliances. It helps its customers with a significant reduction on their electricity bills and carbon footprints. The technology uses machine learning algorithms to provide real-time information by learning how much power and money each device is using. Not only this, it can also suggest eco-friendly alternatives, alert homeowners of appliances in use for a longer duration and warn them of any dangerous activity when they aren’t present at home. Other examples include firms like Maana which manages machines and improves operational efficiencies in order to make fast data driven decisions. Gong.io, acts as a sales representative’s assistant to understand sales conversations resulting into actionable insights. ObEN, creates complete virtual identities for consumers and celebrities in the emerging digital world. The Motivator: For personal and business productivity and growth A super cross-functional company Perkbox, came up with an employee engagement platform. Saurav Chopra founder of Perkbox believes teams perform their best when they are happy and engaged! Hence, Perkbox helps companies boost employee motivation and create a more inspirational atmosphere to work. The platform offers gym services, dental discounts and rewards for top achievers in the team to firms in UK. Perkbox offers a wide range of perks, discounts and tools to help organizations retain and motivate their employees. Technologies like AWS and Kubernetes allow to closely knit themselves with their development team. In order to build, scale and support Perkbox application for the growing number of user base. So, these are some use cases where we found startups using data science and machine learning differently. Do you know of others? Please share them in the comments below.

0
0
16848

article-image-18-striking-ai-trends-to-watch-in-2018-part-2

Sugandha Lahoti

28 Dec 2017

12 min read

18 striking AI Trends to watch in 2018 - Part 2

Sugandha Lahoti

28 Dec 2017

12 min read

0
0
13736

article-image-decrypting-bitcoin-2017-journey

Ashwin Nair

28 Dec 2017

7 min read

There and back again: Decrypting Bitcoin`s 2017 journey from $1000 to $20000

Ashwin Nair

28 Dec 2017

7 min read

Lately, Bitcoin has emerged as the most popular topic of discussion amongst colleagues, friends and family. The conversations more or less have a similar theme - filled with inane theories around what Bitcoin is and and the growing fear of missing out on Bitcoin mania. Well to be fair, who would want to miss out on an opportunity to grow their money by more than 1000%. That`s the return posted by Bitcoin since the start of the year. Bitcoin at the time of writing this article is at $15,000 with a marketcap over $250 billion. To put the hype in context, Bitcoin is now valued higher than 90% of companies in S&P 500 list. Supposedly, invented by an anonymous group or an individual under the alias Satoshi Nakamoto in 2009, Bitcoin has been seen as a digital currency that the internet might not need but one that it deserves. Satoshi`s vision was to create a robust electronic payment system that functions smoothly without a need for a trusted third party. This was achievable with the help of Blockchain, a digital ledger which records transactions in an indelible fashion and is distributed across multiple nodes. This ensured that no transaction is altered or deleted, thus completely eliminating the need for a trusted third party. For all the excitement around Bitcoin and the increase in interest level towards owning one, the thought of dissecting the roller coaster journey from sub $1000 level to $15,000 this year seemed pretty exciting. Started year with a bang / Global uncertainties leading to Bitcoin rally - Oct 2016 to Jan 2017 Global uncertainties played a big role in driving Bitcoin`s price and boy 2016 was full of it! From Brexit to Trump winning the US elections and major economic commotion in terms of Devaluation of China`s Yuan and India`s Demonetization drive all leading to investors seeking shelter in Bitcoin. The first blip - Jan 2017 to March 2017 China as a country has had a major impact in determining Bitcoin`s fate. Early 2017, China contributed to over 80% of Bitcoin transactions indicating the amount of power Chinese traders and investors had in controlling Bitcoin`s price. However, People's Bank of China`s closer inspection of the exchanges revealed several irregularities in the transactions and business practices. Which eventually led to officials halting withdrawals using Bitcoin transactions and taking stringent steps towards cryptocurrency exchanges. Source - Bloomberg On the path tobecoming mainstream and gaining support from Ethereum - Mar 2017 to June 2017 During this phase, we saw the rise of another cryptocurrency, Ethereum, a close digital cousin of Bitcoin. Similar to Bitcoin, Ethereum is also built on top of Blockchain technology allowing it to build a decentralized public network. However, Ethereum’s capabilities extend beyond being a cryptocurrency and help developers to build and deploy any kind of decentralized applications. Ethereum`s valuation in this period rose from $20 to $375 which was quite beneficial for Bitcoin. As every reference of Ethereum had Bitcoin mentioned as well, whether it was to explain what Ethereum is or how it can take the number 1 cryptocurrency spot in the future. This, coupled with the rise in Blockchain`s popularity, increased Bitcoin`s visibility within USA. The media started observing politicians, celebrities and other prominent personalities speaking on Bitcoin as well. Bitcoin also received a major boost from Chinese exchanges, wherein withdrawals of the cryptocurrency resumed after nearly a four-month freeze. All these factors led to Bitcoin crossing an all time high of $2500, up by more than 150% since the start of the year. The Curious case of Fork - June 2017 to September 2017 The month of July saw the Cryptocurrencies market cap witnessing a sharp decline, with questions being raised on the price volatility and whether Bitcoin`s rally for the year was over? We can of-course now confidently debunk that question. Though, there hasn’t been any proven rationale behind the decline, one of the reasons seems to be profit booking following months of steep rally in valuations witnessed by Bitcoin. Another major factor, which might have driven the price collapse may be an unresolved dispute among leading members of the bitcoin community over how to overcome the growing problem of Bitcoin being slow and expensive. With growing usage of Bitcoin, its performance in terms of transaction time has slowed down. Bitcoin`s network due to its limitation in terms of block size could only execute around 7 transactions per second compared to the VISA Network which could do over 1600 transactions. This also led to transaction fees being increased substantially to $5.00 per transaction and settlement time often taking hours and even days. This eventually put Bitcoin`s flaw in the spotlight when compared with services offered by competitors such as Paypal in terms of cost and transaction time. Source - Coinmarketcap The poor performance of Bitcoin led to investors opting for other cryptocurrencies. The above graph, shows how Bitcoin`s dominance fell substantially compared to other cryptocurrencies such as Ethereum and Litecoin during this time. With the core-community still unable to come to a consensus on how to improve the performance and update the software, the prospect of a “fork” was raised. Fork highlights change in underlying software protocol of Bitcoin to make previous rules valid or invalid. There are two types of blockchain forks - soft fork and hard fork. Around August, the community announced to go ahead with a hard fork in the form of Bitcoin Cash. This news was surprisingly taken in a positive manner leading to Bitcoin rebounding strongly and reaching new heights of around $4000 in price. Once bitten, Twice Shy (China) September 2017 to October 2017 Source - Zerohedge The month of September saw another setback for Bitcoin due to measures taken from People`s Bank of China. This time, PBoC banned initial coin offerings (ICO), thus prohibiting the practice of building and selling cryptocurrency to any investors or to finance any startup projects within China. Based on a report by National Committee of Experts on Internet Financial Security Technology, Chinese Investors were involved in 2.6 Billion Yuan worth of ICOs in January-June, 2017 reflecting China`s exposure towards Bitcoin. My Precious (October 2017 to December 2017) Source - Manmanroe During the last quarter Bitcoin`s surge has shocked even hardcore Bitcoin fanatics. Everything seems to be going right for Bitcoin at the moment. While at the start of the year China was the major contributor towards the hike in Bitcoin`s valuation, now the momentum seemed to have shifted to a much sensible and responsible market in terms of Japan who have embraced Bitcoin in quite a positive manner. As you can see from the below graph, Japan now holds more than 50% of transaction compared to USA which is much lesser in size. Besides Japan, we are also seeing positive developments in country such as Russia and India, who are looking to legalize cryptocurrency usage. Moreover, the level of interests towards Bitcoin from institutional investors is at its peak. All these factors have resulted in Bitcoin to cross the 5 digit mark for the first time in Nov, 2017 and touching an all time high figure of close to $20,000 in December, 2017. Post the record high, Bitcoin has been witnessing a crash and rebound phenomenon in the last two weeks of December. From a record high of $20,000 to $11,000 and now at $15,000, Bitcoin is still a volatile digital currency if one is looking for a quick price appreciation. Despite the valuation dilemma and the price volatility, one thing is sure: the world is warming up to the idea of cryptocurrencies and even owning one. There are already several predictions being made on how Bitcoin`s astronomical growth is going to continue in 2018. However, Bitcoin needs to overcome several challenges before it can replace the traditional currency and and be widely accepting in banking practices. Besides, the rise of other cryptocurriencies such as Ethereum, LiteCoin or bitcoin cash who are looking to dethrone Bitcoin from the #1 spot, there are broader issues at hand which the Bitcoin community should prioritize such as how to curb the effect of Bitcoin`s mining activities on the environment and on having smoother reforms as well as building regulatory roadmap from countries before people actually start using instead of just looking it as a tool for making a quick buck.

0
33
37275

How-To Tutorials - Data

Getting started with Linear and logistic regression

2018 new year resolutions to thrive in an Algorithmic World - Part 1 of 3

Popular Data sources and models in SAP Analytics Cloud

Write your first Blockchain: Learning Solidity Programming in 15 minutes

Getting started with Web Scraping with Beautiful Soup

How to create a Treemap and Packed Bubble Chart in Tableau

6 Popular Regression Techniques You Need To Know

How to create a Box and Whisker Plot in Tableau

2017 Generative Adversarial Networks (GANs) Research Milestones

How to format and publish code using R Markdown

Trending Topics

Saving backups on cloud services with ElasticSearch plugins

Getting to know and manipulate Tensors in TensorFlow

25 Startups using machine learning differently in 2018: From farming to brewing beer to elder care

18 striking AI Trends to watch in 2018 - Part 2

There and back again: Decrypting Bitcoin`s 2017 journey from $1000 to $20000

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading