Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-basic-operations-elasticsearch
Packt
16 Jan 2017
10 min read
Save for later

Basic Operations of Elasticsearch

Packt
16 Jan 2017
10 min read
In this article by Alberto Maria Angelo Paro, the author of the book ElasticSearch 5.0 Cookbook - Third Edition, you will learn the following recipes: Creating an index Deleting an index Opening/closing an index Putting a mapping in an index Getting a mapping (For more resources related to this topic, see here.) Creating an index The first operation to do before starting indexing data in Elasticsearch is to create an index--the main container of our data. An index is similar to the concept of database in SQL, a container for types (tables in SQL) and documents (records in SQL). Getting ready To execute curl via the command line you need to install curl for your operative system. How to do it... The HTTP method to create an index is PUT (but also POST works); the REST URL contains the index name: http://<server>/<index_name> For creating an index, we will perform the following steps: From the command line, we can execute a PUT call: curl -XPUT http://127.0.0.1:9200/myindex -d '{ "settings" : { "index" : { "number_of_shards" : 2, "number_of_replicas" : 1 } } }' The result returned by Elasticsearch should be: {"acknowledged":true,"shards_acknowledged":true} If the index already exists, a 400 error is returned: { "error" : { "root_cause" : [ { "type" : "index_already_exists_exception", "reason" : "index [myindex/YJRxuqvkQWOe3VuTaTbu7g] already exists", "index_uuid" : "YJRxuqvkQWOe3VuTaTbu7g", "index" : "myindex" } ], "type" : "index_already_exists_exception", "reason" : "index [myindex/YJRxuqvkQWOe3VuTaTbu7g] already exists", "index_uuid" : "YJRxuqvkQWOe3VuTaTbu7g", "index" : "myindex" }, "status" : 400 } How it works... Because the index name will be mapped to a directory on your storage, there are some limitations to the index name, and the only accepted characters are: ASCII letters [a-z] Numbers [0-9] point ".", minus "-", "&" and "_" During index creation, the replication can be set with two parameters in the settings/index object: number_of_shards, which controls the number of shards that compose the index (every shard can store up to 2^32 documents) number_of_replicas, which controls the number of replica (how many times your data is replicated in the cluster for high availability)A good practice is to set this value at least to 1. The API call initializes a new index, which means: The index is created in a primary node first and then its status is propagated to all nodes of the cluster level A default mapping (empty) is created All the shards required by the index are initialized and ready to accept data The index creation API allows defining the mapping during creation time. The parameter required to define a mapping is mapping and accepts multi mappings. So in a single call it is possible to create an index and put the required mappings. There's more... The create index command allows passing also the mappings section, which contains the mapping definitions. It is a shortcut to create an index with mappings, without executing an extra PUT mapping call: curl -XPOST localhost:9200/myindex -d '{ "settings" : { "number_of_shards" : 2, "number_of_replicas" : 1 }, "mappings" : { "order" : { "properties" : { "id" : {"type" : "keyword", "store" : "yes"}, "date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"}, "customer_id" : {"type" : "keyword", "store" : "yes"}, "sent" : {"type" : "boolea+n", "index":"not_analyzed"}, "name" : {"type" : "text", "index":"analyzed"}, "quantity" : {"type" : "integer", "index":"not_analyzed"}, "vat" : {"type" : "double", "index":"no"} } } } }' Deleting an index The counterpart of creating an index is deleting one. Deleting an index means deleting its shards, mappings, and data. There are many common scenarios when we need to delete an index, such as: Removing the index to clean unwanted/obsolete data (for example, old Logstash indices). Resetting an index for a scratch restart. Deleting an index that has some missing shard, mainly due to some failures, to bring back the cluster in a valid state (if a node dies and it's storing a single replica shard of an index, this index is missing a shard so the cluster state becomes red. In this case, you'll bring back the cluster to a green status, but you lose the data contained in the deleted index). Getting ready To execute curl via command line you need to install curl for your operative system. The index created is required to be deleted. How to do it... The HTTP method used to delete an index is DELETE. The following URL contains only the index name: http://<server>/<index_name> For deleting an index, we will perform the steps given as follows: Execute a DELETE call, by writing the following command: curl -XDELETE http://127.0.0.1:9200/myindex We check the result returned by Elasticsearch. If everything is all right, it should be: {"acknowledged":true} If the index doesn't exist, a 404 error is returned: { "error" : { "root_cause" : [ { "type" : "index_not_found_exception", "reason" : "no such index", "resource.type" : "index_or_alias", "resource.id" : "myindex", "index_uuid" : "_na_", "index" : "myindex" } ], "type" : "index_not_found_exception", "reason" : "no such index", "resource.type" : "index_or_alias", "resource.id" : "myindex", "index_uuid" : "_na_", "index" : "myindex" }, "status" : 404 } How it works... When an index is deleted, all the data related to the index is removed from disk and is lost. During the delete processing, first the cluster is updated, and then the shards are deleted from the storage. This operation is very fast; in a traditional filesystem it is implemented as a recursive delete. It's not possible restore a deleted index, if there is no backup. Also calling using the special _all index_name can be used to remove all the indices. In production it is good practice to disable the all indices deletion by adding the following line to Elasticsearch.yml: action.destructive_requires_name:true Opening/closing an index If you want to keep your data, but save resources (memory/CPU), a good alternative to delete indexes is to close them. Elasticsearch allows you to open/close an index to put it into online/offline mode. Getting ready To execute curl via the command line you need to install curl for your operative system. How to do it... For opening/closing an index, we will perform the following steps: From the command line, we can execute a POST call to close an index using: curl -XPOST http://127.0.0.1:9200/myindex/_close If the call is successful, the result returned by Elasticsearch should be: {,"acknowledged":true} To open an index, from the command line, type the following command: curl -XPOST http://127.0.0.1:9200/myindex/_open If the call is successful, the result returned by Elasticsearch should be: {"acknowledged":true} How it works... When an index is closed, there is no overhead on the cluster (except for metadata state): the index shards are switched off and they don't use file descriptors, memory, and threads. There are many use cases when closing an index: Disabling date-based indices (indices that store their records by date), for example, when you keep an index for a week, month, or day and you want to keep online a fixed number of old indices (that is, two months) and some offline (that is, from two months to six months). When you do searches on all the active indices of a cluster and don't want search in some indices (in this case, using alias is the best solution, but you can achieve the same concept of alias with closed indices). An alias cannot have the same name as an index When an index is closed, calling the open restores its state. Putting a mapping in an index We saw how to build mapping by indexing documents. This recipe shows how to put a type mapping in an index. This kind of operation can be considered as the Elasticsearch version of an SQL created table. Getting ready To execute curl via the command line you need to install curl for your operative system. How to do it... The HTTP method to put a mapping is PUT (also POST works). The URL format for putting a mapping is: http://<server>/<index_name>/<type_name>/_mapping For putting a mapping in an index, we will perform the steps given as follows: If we consider the type order, the call will be: curl -XPUT 'http://localhost:9200/myindex/order/_mapping' -d '{ "order" : { "properties" : { "id" : {"type" : "keyword", "store" : "yes"}, "date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"}, "customer_id" : {"type" : "keyword", "store" : "yes"}, "sent" : {"type" : "boolean", "index":"not_analyzed"}, "name" : {"type" : "text", "index":"analyzed"}, "quantity" : {"type" : "integer", "index":"not_analyzed"}, "vat" : {"type" : "double", "index":"no"} } } }' In case of success, the result returned by Elasticsearch should be: {"acknowledged":true} How it works... This call checks if the index exists and then it creates one or more type mapping as described in the definition. During mapping insert if there is an existing mapping for this type, it is merged with the new one. If there is a field with a different type and the type could not be updated, an exception expanding fields property is raised. To prevent an exception during the merging mapping phase, it's possible to specify the ignore_conflicts parameter to true (default is false). The put mapping call allows you to set the type for several indices in one shot; list the indices separated by commas or to apply all indexes using the _all alias. There's more… There is not a delete operation for mapping. It's not possible to delete a single mapping from an index. To remove or change a mapping you need to manage the following steps: Create a new index with the new/modified mapping Reindex all the records Delete the old index with incorrect mapping Getting a mapping After having set our mappings for processing types, we sometimes need to control or analyze the mapping to prevent issues. The action to get the mapping for a type helps us to understand structure or its evolution due to some merge and implicit type guessing. Getting ready To execute curl via command-line you need to install curl for your operative system. How to do it… The HTTP method to get a mapping is GET. The URL formats for getting mappings are: http://<server>/_mapping http://<server>/<index_name>/_mapping http://<server>/<index_name>/<type_name>/_mapping To get a mapping from the type of an index, we will perform the following steps: If we consider the type order of the previous chapter, the call will be: curl -XGET 'http://localhost:9200/myindex/order/_mapping?pretty=true' The pretty argument in the URL is optional, but very handy to pretty print the response output. The result returned by Elasticsearch should be: { "myindex" : { "mappings" : { "order" : { "properties" : { "customer_id" : { "type" : "keyword", "store" : true }, … truncated } } } } } How it works... The mapping is stored at the cluster level in Elasticsearch. The call checks both index and type existence and then it returns the stored mapping. The returned mapping is in a reduced form, which means that the default values for a field are not returned. Elasticsearch stores only not default field values to reduce network and memory consumption. Retrieving a mapping is very useful for several purposes: Debugging template level mapping Checking if implicit mapping was derivated correctly by guessing fields Retrieving the mapping metadata, which can be used to store type-related information Simply checking if the mapping is correct If you need to fetch several mappings, it is better to do it at index level or cluster level to reduce the numbers of API calls. Summary We learned how to manage indices and perform operations on documents. We'll discuss different operations on indices such as create, delete, update, open, and close. These operations are very important because they allow better define the container (index) that will store your documents. The index create/delete actions are similar to the SQL create/delete database commands. Resources for Article: Further resources on this subject: Elastic Stack Overview [article] Elasticsearch – Spicing Up a Search Using Geo [article] Downloading and Setting Up ElasticSearch [article]
Read more
  • 0
  • 0
  • 5843

article-image-metric-analytics-metricbeat
Packt
11 Jan 2017
5 min read
Save for later

Metric Analytics with Metricbeat

Packt
11 Jan 2017
5 min read
In this article by Bahaaldine Azarmi, the author of the book Learning Kibana 5.0, we will learn about metric analytics, which is fundamentally different in terms of data structure. (For more resources related to this topic, see here.) Author would like to spend a few lines on the following question: What is a metric? A metric is an event that contains a timestamp and usually one or more numeric values. It is appended to a metric file sequentially, where all lines of metrics are ordered based on the timestamp. As an example, here are a few system metrics: 02:30:00 AM    all    2.58    0.00    0.70    1.12    0.05     95.5502:40:00 AM    all    2.56    0.00    0.69    1.05    0.04     95.6602:50:00 AM    all    2.64    0.00    0.65    1.15    0.05     95.50 Unlike logs, metrics are sent periodically, for example, every 10 minutes (as the preceding example illustrates) whereas logs are usually appended to the log file when something happens. Metrics are often used in the context of software or hardware health monitoring, such as resource utilization monitoring, database execution metrics monitoring, and so on. Since version 5.0, Elastic had, at all layers of the solutions, new features to enhance the user experience of metrics management and analytics. Metricbeat is one of the new features in 5.0. It allows the user to ship metrics data, whether from the machine or from applications, to Elasticsearch, and comes with out-of-the-box dashboards for Kibana. Kibana also integrates Timelion with its core, a plugin which has been made for manipulating numeric data, such as metrics. In this article, we'll start by working with Metricbeat. Metricbeat in Kibana The procedure to import the dashboard has been laid out in the subsequent section. Importing the dashboard Before importing the dashboard, let's have a look at the actual metric data that Metricbeat ships. As I have Chrome opened while typing this article, I'm going to filter the data by process name, here chrome: Discover tab filtered by process name   Here is an example of one of the documents I have: { "_index": "metricbeat-2016.09.06", "_type": "metricsets", "_id": "AVcBFstEVDHwfzZYZHB8", "_score": 4.29527, "_source": { "@timestamp": "2016-09-06T20:00:53.545Z", "beat": { "hostname": "MacBook-Pro-de-Bahaaldine.local", "name": "MacBook-Pro-de-Bahaaldine.local" }, "metricset": { "module": "system", "name": "process", "rtt": 5916 }, "system": { "process": { "cmdline": "/Applications/Google Chrome.app/Contents/Versions/52.0.2743.116/Google Chrome Helper.app/Contents/MacOS/Google Chrome Helper --type=ppapi --channel=55142.2188.1032368744 --ppapi-flash-args --lang=fr", "cpu": { "start_time": "09:52", "total": { "pct": 0.0035 } }, "memory": { "rss": { "bytes": 67813376, "pct": 0.0039 }, "share": 0, "size": 3355303936 }, "name": "Google Chrome H", "pid": 76273, "ppid": 55142, "state": "running", "username": "bahaaldine" } }, "type": "metricsets" }, "fields": { "@timestamp": [ 1473192053545 ] } } Metricbeat document example The preceding document breaks down the utilization of resources for the chrome process. We can see, for example, the usage of CPU and memory, as well as the state of the process as a whole. Now how about visualizing the data in an actual dashboard? To do so, go into the Kibana folder located in the Metricbeat installation directory: MacBook-Pro-de-Bahaaldine:kibana bahaaldine$ pwd /elastic/metricbeat-5.0.0/kibana MacBook-Pro-de-Bahaaldine:kibana bahaaldine$ ls dashboard import_dashboards.ps1 import_dashboards.sh index-pattern search visualization import_dashboards.sh is the file we will use to import the dashboards in Kibana. Execute the file script like the following: ./import_dashboards.sh –h This should print out the help, which, essentially, will give you the list of arguments you can pass to the script. Here, we need to specify a username and a password as we are using the X-Pack security plugin, which secures our cluster: ./import_dashboards.sh –u elastic:changeme You should normally get a bunch of logs stating that dashboards have been imported, as shown in the following example: Import visualization Servers-overview: {"_index":".kibana","_type":"visualization","_id":"Servers-overview","_version":4,"forced_refresh":false,"_shards":{"total":2,"successful":1,"failed":0},"created":false} Now, at this point, you have metric data in Elasticsearch and dashboards created in Kibana, so you can now visualize the data. Visualizing metrics If you go back into the Kibana/dashboard section and try to open the Metricbeat System Statistics dashboard, you should get something similar to the following: Metricbeat Kibana dashboard You should see in your own dashboard the metric based on the processes that are running on your computer. In my case, I have a bunch of them for which I can visualize the CPU and memory utilization, for example: RAM and CPU utilization As an example, what can be important here is to be sure that Metricbeat has a very low footprint on the overall system in terms of CPU or RAM, as shown here: Metricbeat resource utilization As we can see in the preceding diagram, Metricbeat only uses about 0.4% of the CPU and less than 0.1% of the memory on my Macbook Pro. On the other hand, if I want to get the most resource-consuming processes, I can check in the Top processes data table, which gives the following information: Top processes Besides Google Chrome H, which uses a lot of CPU, zoom.us, a conferencing application, seems to bring a lot of stress to my laptop. Rather than using the Kibana standard visualization to manipulate our metrics, we'll use Timelion instead, and focus on this heavy CPU consuming processes use case. Summary In this article, we have seen how we can use Kibana in the context of technical metric analytics. We relied on the data that Metricbeat is able to ship from a machine and visualized the result both in Kibana dashboard and in Kibana Timelion. Resources for Article: Further resources on this subject: An Introduction to Kibana [article] Big Data Analytics [article] Static Data Management [article]
Read more
  • 0
  • 0
  • 3342

article-image-ml-package
Packt
11 Jan 2017
18 min read
Save for later

ML Package

Packt
11 Jan 2017
18 min read
In this article by Denny Lee, the author of the book Learning PySpark, has provided a brief implementation and theory on ML packages. So, let's get to it! In this article, we will reuse a portion of the dataset. The data can be downloaded from http://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz. (For more resources related to this topic, see here.) Overview of the package At the top level, the package exposes three main abstract classes: a Transformer, an Estimator, and a Pipeline. We will shortly explain each with some short examples. Transformer The Transformer class, like the name suggests, transforms your data by (normally) appending a new column to your DataFrame. At the high level, when deriving from the Transformer abstract class, each and every new Transformer needs to implement a .transform(...) method. The method, as a first and normally the only obligatory parameter, requires passing a DataFrame to be transformed. This, of course, varies method-by-method in the ML package: other popular parameters are inputCol and outputCol; these, however, frequently default to some predefined values, such as 'features' for the inputCol parameter. There are many Transformers offered in the spark.ml.feature and we will briefly describe them here: Binarizer: Given a threshold, the method takes a continuous variable and transforms it into a binary one. Bucketizer: Similar to the Binarizer, this method takes a list of thresholds (the splits parameter) and transforms a continuous variable into a multinomial one. ChiSqSelector: For the categorical target variables (think, classification models), the feature allows you to select a predefined number of features (parameterized by the numTopFeatures parameter) that explain the variance in the target the best. The selection is done, as the name of the method suggest using a Chi-Square test. It is one of the two-step methods: first, you need to .fit(...) your data (so the method can calculate the Chi-square tests). Calling the .fit(...) method (you pass your DataFrame as a parameter) returns a ChiSqSelectorModel object that you can then use to transform your DataFrame using the .transform(...) method. More information on Chi-square can be found here: http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.html. CountVectorizer: Useful for a tokenized text (such as [['Learning', 'PySpark', 'with', 'us'],['us', 'us', 'us']]). It is of two-step methods: first, you need to .fit(...), that is, learn the patterns from your dataset, before you can .transform(...) with the CountVectorizerModel returned by the .fit(...) method. The output from this transformer, for the tokenized text presented previously, would look similar to this: [(4, [0, 1, 2, 3], [1.0, 1.0, 1.0, 1.0]),(4, [3], [3.0])]. DCT: The Discrete Cosine Transform takes a vector of real values and returns a vector of the same length, but with the sum of cosine functions oscillating at different frequencies. Such transformations are useful to extract some underlying frequencies in your data or in data compression. ElementwiseProduct: A method that returns a vector with elements that are products of the vector passed to the method, and a vector passed as the scalingVec parameter. For example, if you had a [10.0, 3.0, 15.0] vector and your scalingVec was [0.99, 3.30, 0.66], then the vector you would get would look as follows: [9.9, 9.9, 9.9]. HashingTF: A hashing trick transformer that takes a list of tokenized text and returns a vector (of predefined length) with counts. From PySpark's documentation: Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns. IDF: The method computes an Inverse Document Frequency for a list of documents. Note that the documents need to already be represented as a vector (for example, using either the HashingTF or CountVectorizer). IndexToString: A complement to the StringIndexer method. It uses the encoding from the StringIndexerModel object to reverse the string index to original values. MaxAbsScaler: Rescales the data to be within the [-1, 1] range (thus, does not shift the center of the data). MinMaxScaler: Similar to the MaxAbsScaler with the difference that it scales the data to be in the [0.0, 1.0] range. NGram: The method that takes a list of tokenized text and returns n-grams: pairs, triples, or n-mores of subsequent words. For example, if you had a ['good', 'morning', 'Robin', 'Williams'] vector you would get the following output: ['good morning', 'morning Robin', 'Robin Williams']. Normalizer: A method that scales the data to be of unit norm using the p-norm value (by default, it is L2). OneHotEncoder: A method that encodes a categorical column to a column of binary vectors. PCA: Performs the data reduction using principal component analysis. PolynomialExpansion: Performs a polynomial expansion of a vector. For example, if you had a vector symbolically written as [x, y, z], the method would produce the following expansion: [x, x*x, y, x*y, y*y, z, x*z, y*z, z*z]. QuantileDiscretizer: Similar to the Bucketizer method, but instead of passing the splits parameter you pass the numBuckets one. The method then decides, by calculating approximate quantiles over your data, what the splits should be. RegexTokenizer: String tokenizer using regular expressions. RFormula: For those of you who are avid R users - you can pass a formula such as vec ~ alpha * 3 + beta (assuming your DataFrame has the alpha and beta columns) and it will produce the vec column given the expression. SQLTransformer: Similar to the previous, but instead of R-like formulas you can use SQL syntax. The FROM statement should be selecting from __THIS__ indicating you are accessing the DataFrame. For example: SELECT alpha * 3 + beta AS vec FROM __THIS__. StandardScaler: Standardizes the column to have 0 mean and standard deviation equal to 1. StopWordsRemover: Removes stop words (such as 'the' or 'a') from a tokenized text. StringIndexer: Given a list of all the words in a column, it will produce a vector of indices. Tokenizer: Default tokenizer that converts the string to lower case and then splits on space(s). VectorAssembler: A highly useful transformer that collates multiple numeric (vectors included) columns into a single column with a vector representation. For example, if you had three columns in your DataFrame: df = spark.createDataFrame( [(12, 10, 3), (1, 4, 2)], ['a', 'b', 'c']) The output of calling: ft.VectorAssembler(inputCols=['a', 'b', 'c'], outputCol='features') .transform(df) .select('features') .collect() It would look as follows: [Row(features=DenseVector([12.0, 10.0, 3.0])), Row(features=DenseVector([1.0, 4.0, 2.0]))] VectorIndexer: A method for indexing categorical column into a vector of indices. It works in a column-by-column fashion, selecting distinct values from the column, sorting and returning an index of the value from the map instead of the original value. VectorSlicer: Works on a feature vector, either dense or sparse: given a list of indices it extracts the values from the feature vector. Word2Vec: The method takes a sentence (string) as an input and transforms it into a map of {string, vector} format, a representation that is useful in natural language processing. Note that there are many methods in the ML package that have an E letter next to it; this means the method is currently in beta (or Experimental) and it sometimes might fail or produce erroneous results. Beware. Estimators Estimators can be thought of as statistical models that need to be estimated to make predictions or classify your observations. If deriving from the abstract Estimator class, the new model has to implement the .fit(...) method that fits the model given the data found in a DataFrame and some default or user-specified parameters. There are a lot of estimators available in PySpark and we will now shortly describe the models available in Spark 2.0. Classification The ML package provides a data scientist with seven classification models to choose from. These range from the simplest ones (such as Logistic Regression) to more sophisticated ones. We will provide short descriptions of each of them in the following section: LogisticRegression: The benchmark model for classification. The logistic regression uses logit function to calculate the probability of an observation belonging to a particular class. At the time of writing, the PySpark ML supports only binary classification problems. DecisionTreeClassifier: A classifier that builds a decision tree to predict a class for an observation. Specifying the maxDepth parameter limits the depth the tree grows, the minInstancePerNode determines the minimum number of observations in the tree node required to further split, the maxBins parameter specifies the maximum number of bins the continuous variables will be split into, and the impurity specifies the metric to measure and calculate the information gain from the split. GBTClassifier: A Gradient Boosted Trees classification model for classification. The model belongs to the family of ensemble models: models that combine multiple weak predictive models to form a strong one. At the moment the GBTClassifier model supports binary labels, and continuous and categorical features. RandomForestClassifier: The models produce multiple decision trees (hence the name - forest) and use the mode output of those decision trees to classify observations. The RandomForestClassifier supports both binary and multinomial labels. NaiveBayes: Based on the Bayes' theorem, this model uses conditional probability theory to classify observations. The NaiveBayes model in PySpark ML supports both binary and multinomial labels. MultilayerPerceptronClassifier: A classifier that mimics the nature of a human brain. Deeply rooted in the Artificial Neural Networks theory, the model is a black-box, that is, it is not easy to interpret the internal parameters of the model. The model consists, at a minimum, of three, fully connected layers (a parameter that needs to be specified when creating the model object) of artificial neurons: the input layer (that needs to be equal to the number of features in your dataset), a number of hidden layers (at least one), and an output layer with the number of neurons equal to the number of categories in your label. All the neurons in the input and hidden layers have a sigmoid activation function, whereas the activation function of the neurons in the output layer is softmax. OneVsRest: A reduction of a multiclass classification to a binary one. For example, in the case of a multinomial label, the model can train multiple binary logistic regression models. For example, if label == 2 the model will build a logistic regression where it will convert the label == 2 to 1 (or else label values would be set to 0) and then train a binary model. All the models are then scored and the model with the highest probability wins. Regression There are seven models available for regression tasks in the PySpark ML package. As with classification, these range from some basic ones (such as obligatory Linear Regression) to more complex ones: AFTSurvivalRegression: Fits an Accelerated Failure Time regression model; It is a parametric model that assumes that a marginal effect of one of the features accelerates or decelerates a life expectancy (or process failure). It is highly applicable for the processes with well-defined stages. DecisionTreeRegressor: Similar to the model for classification with an obvious distinction that the label is continuous instead of binary (or multinomial). GBTRegressor: As with the DecisionTreeRegressor, the difference is the data type of the label. GeneralizedLinearRegression: A family of linear models with differing kernel functions (link functions). In contrast to the linear regression that assumes normality of error terms, the GLM allows the label to have different error term distributions: the GeneralizedLinearRegression model from the PySpark ML package supports gaussian, binomial, gamma, and poisson families of error distributions with a host of different link functions. IsotonicRegression: A type of regression that fits a free-form, non-decreasing line to your data. It is useful to fit the datasets with ordered and increasing observations. LinearRegression: The most simple of regression models, assumes linear relationship between features and a continuous label, and normality of error terms. RandomForestRegressor: Similar to either DecisionTreeRegressor or GBTRegressor, the RandomForestRegressor fits a continuous label instead of a discrete one. Clustering Clustering is a family of unsupervised models that is used to find underlying patterns in your data. The PySpark ML package provides four most popular models at the moment: BisectingKMeans: A combination of k-means clustering method and hierarchical clustering. The algorithm begins with all observations in a single cluster and iteratively splits the data into k clusters. Check out this website for more information on pseudo-algorithm: http://minethedata.blogspot.com/2012/08/bisecting-k-means.html. KMeans: It is the famous k-mean algorithm that separates data into k clusters, iteratively searching for centroids that minimize the sum of square distances between each observation and the centroid of the cluster it belongs to. GaussianMixture: This method uses k Gaussian distributions with unknown parameters to dissect the dataset. Using the Expectation-Maximization algorithm, the parameters for the Gaussians are found by maximizing the log-likelihood function. Beware that for datasets with many features this model might perform poorly due to the curse of dimensionality and numerical issues with Gaussian distributions. LDA: This model is used for topic modeling in natural language processing applications. There is also one recommendation model available in PySpark ML, but I will refrain from describing it here. Pipeline A Pipeline in PySpark ML is a concept of an end-to-end transformation-estimation process (with distinct stages) that ingests some raw data (in a DataFrame form), performs necessary data carpentry (transformations), and finally estimates a statistical model (estimator). A Pipeline can be purely transformative, that is, consisting of Transformers only. A Pipeline can be thought of as a chain of multiple discrete stages. When a .fit(...) method is executed on a Pipeline object, all the stages are executed in the order they were specified in the stages parameter; the stages parameter is a list of Transformer and Estimator objects. The .fit(...) method of the Pipeline object executes the .transform(...) method for the Transformers and the .fit(...) method for the Estimators. Normally, the output of a preceding stage becomes the input for the following stage: when deriving from either the Transformer or Estimator abstract classes, one needs to implement the .getOutputCol() method that returns the value of the outputCol parameter specified when creating an object. Predicting chances of infant survival with ML In this section, we will use the portion of the dataset to present the ideas of PySpark ML. If you have not yet downloaded the data, it can be accessed here: http://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz. In this section, we will, once again, attempt to predict the chances of the survival of an infant. Loading the data First, we load the data with the help of the following code: import pyspark.sql.types as typ labels = [ ('INFANT_ALIVE_AT_REPORT', typ.IntegerType()), ('BIRTH_PLACE', typ.StringType()), ('MOTHER_AGE_YEARS', typ.IntegerType()), ('FATHER_COMBINED_AGE', typ.IntegerType()), ('CIG_BEFORE', typ.IntegerType()), ('CIG_1_TRI', typ.IntegerType()), ('CIG_2_TRI', typ.IntegerType()), ('CIG_3_TRI', typ.IntegerType()), ('MOTHER_HEIGHT_IN', typ.IntegerType()), ('MOTHER_PRE_WEIGHT', typ.IntegerType()), ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()), ('MOTHER_WEIGHT_GAIN', typ.IntegerType()), ('DIABETES_PRE', typ.IntegerType()), ('DIABETES_GEST', typ.IntegerType()), ('HYP_TENS_PRE', typ.IntegerType()), ('HYP_TENS_GEST', typ.IntegerType()), ('PREV_BIRTH_PRETERM', typ.IntegerType()) ] schema = typ.StructType([ typ.StructField(e[0], e[1], False) for e in labels ]) births = spark.read.csv('births_transformed.csv.gz', header=True, schema=schema) We specify the schema of the DataFrame; our severely limited dataset now only has 17 columns. Creating transformers Before we can use the dataset to estimate a model, we need to do some transformations. Since statistical models can only operate on numeric data, we will have to encode the BIRTH_PLACE variable. Before we do any of this, since we will use a number of different feature transformations. Let's import them all: import pyspark.ml.feature as ft To encode the BIRTH_PLACE column, we will use the OneHotEncoder method. However, the method cannot accept StringType columns - it can only deal with numeric types so first we will cast the column to an IntegerType: births = births .withColumn( 'BIRTH_PLACE_INT', births['BIRTH_PLACE'] .cast(typ.IntegerType())) Having done this, we can now create our first Transformer: encoder = ft.OneHotEncoder( inputCol='BIRTH_PLACE_INT', outputCol='BIRTH_PLACE_VEC') Let's now create a single column with all the features collated together. We will use the VectorAssembler method: featuresCreator = ft.VectorAssembler( inputCols=[ col[0] for col in labels[2:]] + [encoder.getOutputCol()], outputCol='features' ) The inputCols parameter passed to the VectorAssembler object is a list of all the columns to be combined together to form the outputCol - the 'features'. Note that we use the output of the encoder object (by calling the .getOutputCol() method), so we do not have to remember to change this parameter's value should we change the name of the output column in the encoder object at any point. It's now time to create our first estimator. Creating an estimator In this example, we will (once again) use the Logistic Regression model. However, we will showcase some more complex models from the .classification set of PySpark ML models, so we load the whole section: import pyspark.ml.classification as cl Once loaded, let's create the model by using the following code: logistic = cl.LogisticRegression( maxIter=10, regParam=0.01, labelCol='INFANT_ALIVE_AT_REPORT') We would not have to specify the labelCol parameter if our target column had the name 'label'. Also, if the output of our featuresCreator would not be called 'features' we would have to specify the featuresCol by (most conveniently) calling the getOutputCol() method on the featuresCreator object. Creating a pipeline All that is left now is to create a Pipeline and fit the model. First, let's load the Pipeline from the ML package: from pyspark.ml import Pipeline Creating Pipeline is really easy. Here's how our pipeline should look like conceptually: Converting this structure into a Pipeline is a walk in the park: pipeline = Pipeline(stages=[ encoder, featuresCreator, logistic ]) That's it! Our pipeline is now created so we can (finally!) estimate the model. Fitting the model Before you fit the model we need to split our dataset into training and testing datasets. Conveniently, the DataFrame API has the .randomSplit(...) method: births_train, births_test = births .randomSplit([0.7, 0.3], seed=666) The first parameter is a list of dataset proportions that should end up in, respectively, births_train and births_test subsets. The seed parameter provides a seed to the randomizer. You can also split the dataset into more than two subsets as long as the elements of the list sum up to 1, and you unpack the output into as many subsets. For example, we could split the births dataset into three subsets like this: train, test, val = births. randomSplit([0.7, 0.2, 0.1], seed=666) The preceding code would put a random 70% of the births dataset into the train object, 20% would go to the test, and the val DataFrame would hold the remaining 10%. Now it is about time to finally run our pipeline and estimate our model: model = pipeline.fit(births_train) test_model = model.transform(births_test) The .fit(...) method of the pipeline object takes our training dataset as an input. Under the hood, the births_train dataset is passed first to the encoder object. The DataFrame that is created at the encoder stage then gets passed to the featuresCreator that creates the 'features' column. Finally, the output from this stage is passed to the logistic object that estimates the final model. The .fit(...) method returns the PipelineModel object (the model object in the preceding snippet) that can then be used for prediction; we attain this by calling the .transform(...) method and passing the testing dataset created earlier. Here's what the test_model looks like in the following command: test_model.take(1) It generates the following output: As you can see, we get all the columns from the Transfomers and Estimators. The logistic regression model outputs several columns: the rawPrediction is the value of the linear combination of features and the β coefficients, probability is the calculated probability for each of the classes, and finally, the prediction, which is our final class assignment. Evaluating the performance of the model Obviously, we would like to now test how well our model did. PySpark exposes a number of evaluation methods for classification and regression in the .evaluation section of the package: import pyspark.ml.evaluation as ev We will use the BinaryClassficationEvaluator to test how well our model performed: evaluator = ev.BinaryClassificationEvaluator( rawPredictionCol='probability', labelCol='INFANT_ALIVE_AT_REPORT') The rawPredictionCol can either be the rawPrediction column produced by the estimator or the probability. Let's see how well our model performed: print(evaluator.evaluate(test_model, {evaluator.metricName: 'areaUnderROC'})) print(evaluator.evaluate(test_model, {evaluator.metricName: 'areaUnderPR'})) The preceding code produces the following result: The area under the ROC of 74% and area under PR 71% shows a well-defined model, but nothing out of extraordinary; if we had other features, we could drive this up. Saving the model PySpark allows you to save the Pipeline definition for later use. It not only saves the pipeline structure, but also all the definitions of all the Transformers and Estimators: pipelinePath = './infant_oneHotEncoder_Logistic_Pipeline' pipeline.write().overwrite().save(pipelinePath) So, you can load it up later and use it straight away to .fit(...) and predict: loadedPipeline = Pipeline.load(pipelinePath) loadedPipeline .fit(births_train) .transform(births_test) .take(1) The preceding code produces the same result (as expected): Summary Hence we studied ML package. We explained what Transformer and Estimator are, and showed their role in another concept introduced in the ML library: the Pipeline. Subsequently, we also presented how to use some of the methods to fine-tune the hyper parameters of models. Finally, we gave some examples of how to use some of the feature extractors and models from the library. Resources for Article: Further resources on this subject: Package Management [article] Everything in a Package with concrete5 [article] Writing a Package in Python [article]
Read more
  • 0
  • 0
  • 2786

article-image-visualization-dashboard-design
Packt
10 Jan 2017
18 min read
Save for later

Visualization Dashboard Design

Packt
10 Jan 2017
18 min read
In this article by David Baldwin, the author of the book Mastering Tableau, we will cover how you need to create some effective dashboards. (For more resources related to this topic, see here.) Since that fateful week in Manhattan, I've read Edward Tufte, Stephen Few, and other thought leaders in the data visualization space. This knowledge has been very fruitful. For instance, quite recently a colleague told me that one of his clients thought a particular dashboard had too many bar charts and he wanted some variation. I shared the following two quotes: Show data variation, not design variation. –Edward Tufte in The Visual Display of Quantitative Information Variety might be the spice of life, but, if it is introduced on a dashboard for its own sake, the display suffers. –Stephen Few in Information Dashboard Design Those quotes proved helpful for my colleague. Hopefully the following information will prove helpful to you. Additionally I would also like to draw attention to Alberto Cairo—a relatively new voice providing new insight. Each of these authors should be considered a must-read for anyone working in data visualization. Visualization design theory Dashboard design Sheet selection Visualization design theory Any discussion on designing dashboards should begin with information about constructing well-designed content. The quality of the dashboard layout and the utilization of technical tips and tricks do not matter if the content is subpar. In other words we should consider the worksheets displayed on dashboards and ensure that those worksheets are well-designed. Therefore, our discussion will begin with a consideration of visualization design principles. Regarding these principles, it's tempting to declare a set of rules such as: To plot change over time, use a line graph To show breakdowns of the whole, use a treemap To compare discrete elements, use a bar chart To visualize correlation, use a scatter plot But of course even a cursory review of the preceding list brings to mind many variations and alternatives! Thus, we will consider various rules while always keeping in mind that rules (at least rules such as these) are meant to be broken. Formatting rules The following formatting rules encompass fonts, lines, and bands. Fonts are, of course, an obvious formatting consideration. Lines and bands, however, may not be something you typically think of when formatting, especially when considering formatting from the perspective of Microsoft Word. But if we broaden formatting considerations to think of Adobe Illustrator, InDesign, and other graphic design tools, then lines and bands are certainly considered. This illustrates that data visualization is closely related to graphic design and that formatting considers much more than just textual layout. Rule – keep the font choice simple Typically using one or two fonts on a dashboard is advisable. More fonts can create a confusing environment and interfere with readability. Fonts chosen for titles should be thick and solid while the body fonts should be easy to read. As of Tableau 10.0 choosing appropriate fonts is simple because of the new Tableau Font Family. Go to Format | Font to display the Format Font window to see and choose these new fonts: Assuming your dashboard is primarily intended for the screen, sans serif fonts are best. On the rare occasions a dashboard is primarily intended for print, you may consider serif fonts; particularly if the print resolution is high. Rule – Trend line > Fever line > Reference line > Drop line > Zero line > Grid line The preceding pseudo formula is intended to communicate line visibility. For example, trend line visibility should be greater than fever line visibility. Visibility is usually enhanced by increasing line thickness but may be enhanced via color saturation or by choosing a dotted or dashed line over a solid line. The trend line, if present, is usually the most visible line on the graph. Trend lines are displayed via the Analytics pane and can be adjusted via Format à Lines. The fever line (for example, the line used on a time-series chart) should not be so heavy as to obscure twists and turns in the data. Although a fever line may be displayed as dotted or dashed by utilizing the Pages shelf, this is usually not advisable because it may obscure visibility. The thickness of a fever line can be adjusted by clicking on the Size shelf in the Marks View card. Reference lines are usually less prevalent than either fever or trend lines and can be formatted by going to Format | Reference lines. Drop lines are not frequently used. To deploy drop lines, right-click in a blank portion of the view and go to Drop lines | Show drop lines. Next, click on a point in the view to display a drop line. To format droplines, go to Format | Droplines. Drop lines are relevant only if at least one axis is utilized in the visualization. Zero lines (sometimes referred to as base lines) display only if zero or negative values are included in the view or positive numerical values are relatively close to zero. Format zero lines by going to Format | Lines. Grid lines should be the most muted lines on the view and may be dispensed with altogether. Format grid lines by going to Format | Lines. Rule – band in groups of three to five Visualizations comprised of a tall table of text or horizontal bars should segment dimension members in groups of three to five. Exercise – banding Navigate to https://public.tableau.com/profile/david.baldwin#!/ to locate and download the workbook. Navigate to the worksheet titled Banding. Select the Superstore data source and place Product Name on the Rows shelf. Double-click on Discount, Profit, Quantity, and Sales. Navigate to Format | Shading and set Band Size under Row Banding so that three to five lines of text are encompassed by each band. Be sure to set an appropriate color for both Pane and Header: Note that after completing the preceding five steps, Tableau defaulted to banding every other row. This default formatting is fine for a short table but is quite busy for a tall table. The band in groups of three to five rule is influenced by Dona W. Wong, who, in her book The Wall Street Journal Guide to Information Graphics, recommends separating long tables or bar charts with thin rules to separate the bars in groups of three to five to help the readers read across. Color rules It seems slightly ironic to discuss color rules in a black-and-white publication such as Mastering Tableau. Nonetheless, even in a monochromatic setting, a discussion of color is relevant. For example, exclusive use of black text communicates differently than using variations of gray. The following survey of color rules should be helpful to ensure that you use colors effectively in a variety of settings. Rule – keep colors simple and limited Stick to the basic hues and provide only a few (perhaps three to five) hue variations. Alberto Cairo, in his book The Functional Art: An Introduction to Information Graphics and Visualization, provides insights into why this is important. The limited capacity of our visual working memory helps explain why it's not advisable to use more than four or five colors or pictograms to identify different phenomena on maps and charts. Rule – respect the psychological implication of colors In Western society, there is a color vocabulary so pervasive, it's second nature. Exit signs marking stairwell locations are red. Traffic cones are orange. Baby boys are traditionally dressed in blue while baby girls wear pink. Similarly, in Tableau reds and oranges should usually be associated with negative performance while blues and greens should be associated with positive performance. Using colors counterintuitively can cause confusion. Rule – be colorblind-friendly Colorblindness is usually manifested as an inability to distinguish red and green or blue and yellow. Red/green and blue/yellow are on opposite sides of the color wheel. Consequently, the challenges these color combinations present for colorblind individuals can be easily recreated with image editing software such as Photoshop. If you are not colorblind, convert an image with these color combinations to grayscale and observe. The challenge presented to the 8.0% of the males and 0.5% of the females who are color blind becomes immediately obvious! Rule – use pure colors sparingly The resulting colors from the following exercise should be a very vibrant red, green, and blue. Depending on the monitor, you may even find it difficult to stare directly at the colors. These are known as pure colors and should be used sparingly; perhaps only to highlight particularly important items. Exercise – using pure colors Open the workbook and navigate to the worksheet entitled Pure Colors. Select the Superstore data source and place Category on both the Rows shelf and the Color shelf. Set the Fit to Entire View. Click on the Color shelf and choose Edit Colors…. In the Edit Colors dialog box, double-click on the color icons to the left of each dimension member; that is, Furniture, Office Supplies, and Technology: Within the resulting dialog box, set furniture to an HTML value of #0000ff, Office Supplies to #ff0000, and Technology to #00ff00. Rule – color variations over symbol variation Deciphering different symbols takes more mental energy for the end user than distinguishing color. Therefore color variation should be used over symbol variation. This rule can actually be observed in Tableau defaults. Create a scatter plot and place a dimension with many members on the Color shelf and Shape shelf respectively. Note that by default, the view will display 20 unique colors but only 10 unique shapes. Older versions of Tableau (such as Tableau 9.0) display warnings that include text such as “…the recommended maximum for this shelf is 10”: Visualization type rules We won't spend time here to delve into a lengthy list of visualization type rules. However, it does seem appropriate to review at least a couple of rules. In the following exercise, we will consider keeping shapes simple and effectively using pie charts. Rule – keep shapes simple Too many shape details impede comprehension. This is because shape details draw the user's focus away from the data. Consider the following exercise on using two different shopping cart images. Exercise – shapes Open the workbook associated and navigate to the worksheet entitled Simple Shopping Cart. Note that the visualization is a scatterplot showing the top 10 selling Sub-Categories in terms of total sales and profits. On your computer, navigate to the Shapes directory located in the My Tableau Repository. On my computer, the path is C:UsersDavid BaldwinDocumentsMy Tableau RepositoryShapes. Within the Shapes directory, create a folder named My Shapes. Reference the link included in the comment section of the worksheet to download the assets. In the downloaded material, find the images titled Shopping_Cart and Shopping_Cart_3D and copy those images into the My Shapes directory created previously. Within Tableau, access the Simple Shopping Cart worksheet. Click on the Shape shelf and then select More Shapes. Within the Edit Shape dialog box, click on the Reload Shapes button. Select the My Shapes palette and set the shape to the simple shopping cart. After closing the dialog box, click on the Size shelf and adjust as desired. Also adjust other aspects of the visualization as desired. Navigate to the 3D Shopping Cart worksheet and then repeat steps 8 to 11. Instead of using the simple shopping cart, use the 3D shopping cart: Compare the two visualizations. Which version of the shopping cart is more attractive? Likely the cart with the 3D look was your choice. Why not choose the more attractive image? Making visualizations attractive is only of secondary concern. The primary goal is to display the data as clearly and efficiently as possible. A simple shape is grasped more quickly and intuitively than a complex shape. Besides, the cuteness of the 3D image will quickly wear off. Rule – use pie charts sparingly Edward Tufte makes an acrid (and somewhat humorous) comment against the use of pie charts in his book The Visual Display of Quantitative Information. A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them. Given their low density and failure to order numbers along a visual dimension, pie charts should never be used. The present sentiment in data visualization circles is largely sympathetic to Tufte's criticism. There may, however, be some exceptions; that is, some circumstances where a pie chart is optimal. Consider the following visualization: Which of the four visualizations best demonstrates that A accounts for 25% of the whole? Clearly it is the pie chart! Therefore, perhaps it is fairer to refer to pie charts as limited and to use them sparingly as opposed to considering them inherently evil. Compromises In this section, we will transition from more or less strict rules to compromises. Often, building visualizations is a balancing act. It's common to encounter contradictory directions from books, blogs, consultants, and within organizations. One person may insist on utilizing every pixel of space while another urges simplicity and whitespace. One counsels a guided approach while another recommends building wide open dashboards that allow end users to discover their own path. Avant gardes may crave esoteric visualizations while those of a more conservative bent prefer to stay with the conventional. We now explore a few of the more common competing requests and suggests compromises. Make the dashboard simple versus make the dashboard robust Recently a colleague showed me a complex dashboard he had just completed. Although he was pleased that he had managed to get it working well, he felt the need to apologize by saying, “I know it's dense and complex but it's what the client wanted.” Occam's Razor encourages the simplest possible solution for any problem. For my colleague's dashboard, the simplest solution was rather complex. This is OK! Complexity in Tableau dashboarding need not be shunned. But a clear understanding of some basic guidelines can help the author intelligently determine how to compromise between demands for simplicity and demands for robustness. More frequent data updates necessitate simpler design. Some Tableau dashboards may be near-real-time. Third-party technology may be utilized to force a browser displaying a dashboard via Tableau Server to refresh every few minutes to ensure the absolute latest data displays. In such cases, the design should be quite simple. The end user must be able to see at a glance all pertinent data and should not use that dashboard for extensive analysis. Conversely, a dashboard that is refreshed monthly can support high complexity and thus may be used for deep exploration. Greater end user expertise supports greater dashboard complexity. Know thy users. If they want easy, at-a-glance visualizations, keep the dashboards simple. If they like deep dives, design accordingly. Smaller audiences require more precise design. If only a few people monitor a given dashboard, it may require a highly customized approach. In such cases, specifications may be detailed, complex, and difficult to execute and maintain because the small user base has expectations that may not be natively easy to produce in Tableau. Screen resolution and visualization complexity are proportional. Users with low-resolution devices will need to interact fairly simply with a dashboard. Thus the design of such a dashboard will likely be correspondingly uncomplicated. Conversely, high-resolution devices support greater complexity. Greater distance from the screen requires larger dashboard elements. If the dashboard is designed for conference room viewing, the elements on the dashboard may need to be fairly large to meet the viewing needs of those far from the screen. Thus the dashboard will likely be relatively simple. Conversely, a dashboard to be viewed primarily on end users desktops can be more complex. Although these points are all about simple versus complex, do not equate simple with easy. A simple and elegantly designed dashboard can be more difficult to create than a complex dashboard. In the words of Steve Jobs: Simple can be harder than complex: You have to work hard to get your thinking clean to make it simple. But it's worth it in the end because once you get there, you can move mountains. Present dense information versus present sparse information Normally, a line graph should have a maximum of four to five lines. However, there are times when you may wish to display many lines. A compromise can be achieved by presenting many lines and empowering the end user to highlight as desired. The following line graph displays the percentage of Internet usage by country from 2000 to 2012. Those countries with the largest increases have been highlighted. Assuming that Highlight Selected Items has been activated within the Color legend, the end user can select items (countries in this case) from the legend to highlight as desired. Or, even better, a worksheet can be created listing all countries and used in conjunction with a highlight action on a dashboard to focus attention on selected items on the line graph: Tell a story versus allow a story to be discovered Albert Cairo, in his excellent book The Functional Art: An Introduction to Information Graphics and Visualization, includes a section where he interviews prominent data visualization and information graphics professionals. Two of these interviews are remarkable for their opposing views. I… feel that many visualization designers try to transform the user into an editor.  They create these amazing interactive tools with tons of bubbles, lines, bars, filters, and scrubber bars, and expect readers to figure the story out by themselves, and draw conclusions from the data. That's not an approach to information graphics I like. – Jim Grimwade The most fascinating thing about the rise of data visualization is exactly that anyone can explore all those large data sets without anyone telling us what the key insight is. – Moritz Stefaner Fortunately, the compromise position can be found in the Jim Grimwade interview: [The New York Times presents] complex sets of data, and they let you go really deep into the figures and their connections. But beforehand, they give you some context, some pointers as to what you can do with those data. If you don't do this… you will end up with a visualization that may look really beautiful and intricate, but that will leave readers wondering, What has this thing really told me? What is this useful for? – Jim Grimwade Although the case scenarios considered in the preceding quotes are likely quite different from the Tableau work you are involved in, the underlying principles remain the same. You can choose to tell a story or build a platform that allows the discovery of numerous stories. Your choice will differ depending on the given dataset and audience. If you choose to create a platform for story discovery, be sure to take the New York Times approach suggested by Grimwade. Provide hints, pointers, and good documentation to lead your end user to successfully interact with the story you wish to tell or successfully discover their own story. Document, Document, Document! But don't use any space! Immediately above we considered the suggestion Provide hints, pointers, and good documentation… but there's an issue. These things take space. Dashboard space is precious. Often Tableau authors are asked to squeeze more and more stuff on a dashboard and are hence looking for ways to conserve space. Here are some suggestions for maximizing documentation on a dashboard while minimally impacting screen real estate. Craft titles for clear communication Titles are expected. Not just a title for a dashboard and worksheets on the dashboard, but also titles for legends, filters and other objects. These titles can be used for effective and efficient documentation. For instance a filter should not just read Market. Instead it should say something like Select a Market. Notice the imperative statement. The user is being told to do something and this is a helpful hint. Adding a couple of words to a title will usually not impact dashboard space. Use subtitles to relay instructions A subtitle will take some extra space but it does not have to be much. A small, italicized font immediately underneath a title is an obvious place a user will look at for guidance. Consider an example: red represents loss. This short sentence could be used as a subtitle that may eliminate the need for a legend and thus actually save space. Use intuitive icons Consider a use case of navigating from one dashboard to another. Of course you could associate an action with some hyperlinked text stating Click here to navigate to another dashboard. But this seems quite unnecessary when an action can be associated with a small, innocuous arrow, such as is natively used in PowerPoint, to communicate the same thing. Store more extensive documentation in a tooltip associated with a help icon. A small question mark in the top-right corner of an application is common. This clearly communicates where to go if additional help is required. As shown in the following exercise, it's easy to create a similar feature on a Tableau dashboard. Summary Hence from this article we studied to create some effective dashboards that are very beneficial in corporate world as a statistical tool to calculate average growth in terms of revenue. Resources for Article: Further resources on this subject: Say Hi to Tableau [article] Tableau Data Extract Best Practices [article] Getting Started with Tableau Public [article]
Read more
  • 0
  • 0
  • 3264

article-image-elastic-stack-overview
Packt
10 Jan 2017
9 min read
Save for later

Elastic Stack Overview

Packt
10 Jan 2017
9 min read
In this article by Ravi Kumar Gupta and Yuvraj Gupta, from the book, Mastering Elastic Stack, we will have an overview of Elastic Stack, it's very easy to read a log file of a few MBs or hundreds, so is it to keep data of this size in databases or files and still get sense out of it. But then a day comes when this data takes terabytes and petabytes, and even notepad++ would refuse to open a data file of a few hundred MBs. Then we start to find something for huge log management, or something that can index the data properly and make sense out of it. If you Google this, you would stumble upon ELK Stack. Elasticsearch manages your data, Logstash reads the data from different sources, and Kibana makes a fine visualization of it. Recently, ELK Stack has evolved as Elastic Stack. We will get to know about it in this article. The following are the points that will be covered in this article: Introduction to ELK Stack The birth of Elastic Stack Who uses the Stack (For more resources related to this topic, see here.) Introduction to ELK Stack It all began with Shay Banon, who started it as an open source project, Elasticsearch, successor of Compass, which gained popularity to become one of the top open source database engines. Later, based on the distributed model of working, Kibana was introduced to visualize the data present in Elasticsearch. Earlier, to put data into Elasticsearch we had rivers, which provided us with a specific input via which we inserted data into Elasticsearch. However, with growing popularity this setup required a tool via which we can insert data into Elasticsearch and have flexibility to perform various transformations on data to make unstructured data structured to have full control on how to process the data. Based on this premise, Logstash was born, which was then incorporated into the Stack, and together these three tools, Elasticsearch, Logstash, and Kibana were named ELK Stack. The following diagram is a simple data pipeline using ELK Stack: As we can see from the preceding figure, data is read using Logstash and indexed to Elasticsearch. Later we can use Kibana to read the indices from Elasticsearch and visualize it using charts and lists. Let's understand these components separately and the role they play in the making of the Stack. Logstash As we got to know that rivers were used initially to put data into Elasticsearch before ELK Stack. For ELK Stack, Logstash is the entry point for all types of data. Logstash has so many plugins to read data from a number of sources and so many output plugins to submit data to a variety of destinations and one of those is the Elasticsearch plugin, which helps to send data to Elasticsearch. After Logstash became popular, eventually rivers got deprecated as they made the cluster unstable and also performance issues were observed. Logstash does not ship data from one end to another; it helps us with collecting raw data and modifying/filtering it to convert it to something meaningful, formatted, and organized. The updated data is then sent to Elasticsearch. If there is no plugin available to support reading data from a specific source, or writing the data to a location, or modifying it in your way, Logstash is flexible enough to allow you to write your own plugins. Simply put, Logstash is open source, highly flexible, rich with plugins, can read your data from your choice of location, normalizes it as per your defined configurations, and sends it to a particular destination as per the requirements. Elasticsearch All of the data read by Logstash is sent to Elasticsearch for indexing. There is a lot more than just indexing. Elasticsearch is not only used to index data, but it is a full-text search engine, highly scalable, distributed, and offers many more things. Elasticsearch manages and maintains your data in the form of indices, offers you to query, access, and aggregate the data using its APIs. Elasticsearch is based on Lucene, thus providing you all of the features that Lucene does. Kibana Kibana uses Elasticsearch APIs to read/query data from Elasticsearch indices to visualize and analyze in the form of charts, graphs and tables. Kibana is in the form of a web application, providing you a highly configurable user interface that lets you query the data, create a number of charts to visualize, and make actual sense out of the data stored. After a robust ELK Stack, as time passed, a few important and complex demands took place, such as authentication, security, notifications, and so on. This demand led for few other tools such as Watcher (providing alerting and notification based on changes in data), Shield (authentication and authorization for securing clusters), Marvel (monitoring statistics of the cluster), ES-Hadoop, Curator, and Graph as requirement arose. The birth of Elastic Stack All the jobs of reading data were done using Logstash, but that's resource consuming. Since Logstash runs on JVM, it consumes a good amount of memory. The community realized the need of improvement and to make the pipelining process resource friendly and lightweight. Back in 2015, Packetbeat was born, a project which was an effort to make a network packet analyzer that could read from different protocols, parse the data, and ship to Elasticsearch. Being lightweight in nature did the trick and a new concept of Beats was formed. Beats are written in Go programming language. The project evolved a lot, and now ELK stack was no more than just Elasticsearch, Logstash, and Kibana, but Beats also became a significant component. The pipeline now looked as follows: Beat A Beat reads data, parses it, and can ship to either Elasticsearch or Logstash. The difference is that they are lightweight, serve a specific purpose, and are installed as agents. There are few beats available such as Topbeat, Filebeat, Packetbeat, and so on, which are supported and provided by the Elastic.co and a good number of Beats already written by the community. If you have a specific requirement, you can write your own Beat using the libbeat library. In simple words, Beats can be treated as very light weight agents to ship data to either Logstash or Elasticsearch and offer you an infrastructure using the libbeat library to create your own Beats. Together Elasticsearch, Logstash, Kibana, and Beats become Elastic Stack, formally known as ELK Stack. Elastic Stack did not just add Beats to its team, but they will be using the same version always. The starting version of the Elastic Stack will be 5.0.0 and the same version will apply to all the components. This version and release method is not only for Elastic Stack, but for other tools of the Elastic family as well. Due to so many tools, there was a problem of unification, wherein each tool had their own version and every version was not compatible with each other, hence leading to a problem. To solve this, now all of the tools will be built, tested, and released together. All of these components play a significant role in creating a pipeline. While Beats and Logstash are used to collect the data, parse it, and ship it, Elasticsearch creates indices, which is finally used by Kibana to make visualizations. While Elastic Stack helps with a pipeline, other tools add security, notifications, monitoring, and other such capabilities to the setup. Who uses Elastic Stack? In the past few years, implementations of Elastic Stack have been increasing very rapidly. In this section, we will consider a few case studies to understand how Elastic Stack has helped this development. Salesforce Salesforce developed a new plugin named ELF (Event Log Files) to collect Salesforce logged data to enable auditing of user activities. The purpose was to analyze the data to understand user behavior and trends in Salesforce. The plugin is available on GitHub at https://github.com/developerforce/elf_elk_docker. This plugin simplifies the Stack configuration and allows us to download ELF to get indexed and finally sensible data that can be visualized using Kibana. This implementation utilizes Elasticsearch, Logstash, and Kibana. CERN There is not just one use case that Elastic Stack helped CERN (European Organization for Nuclear Research), but five. At CERN, Elastic Stack is used for the following: Messaging Data monitoring Cloud benchmarking Infrastructure monitoring Job monitoring Multiple Kibana dashboards are used by CERN for a number of visualizations. Green Man Gaming This is an online gaming platform where game providers publish their games. The website wanted to make a difference by proving better gameplay. They started using Elastic Stack to do log analysis, search, and analysis of gameplay data. They began with setting up Kibana dashboards to gain insights about the counts of gamers by the country and currency used by gamers. That helped them to understand and streamline the support and help in order to provide an improved response. Apart from these case studies, Elastic Stack is used by a number of other companies to gain insights of the data they own. Sometimes, not all of the components are used, that is, not all of the times a Beat would be used and Logstash would be configured. Sometimes, only an Elasticsearch and Kibana combination is used. If we look at the users within the organization, all of the titles who are expected to do big data analysis, business intelligence, data visualizations, log analysis, and so on, can utilize Elastic Stack for their technical forte. A few of these titles are data scientists, DevOps, and so on. Stack competitors Well, it would be wrong to call for Elastic Stack competitors because Elastic Stack has been emerged as a strong competitor to many other tools in the market in recent years and is growing rapidly. A few of these are: Open source: Graylog: Visit https://www.graylog.org/ for more information InfluxDB: Visit https://influxdata.com/ for more information Others: Logscape: Visit http://logscape.com/ for more information Logscene: Visit http://sematext.com/logsene/ for more information Splunk: Visit http://www.splunk.com/ for more information Sumo Logic: Visit https://www.sumologic.com/ for more information Kibana competitors: Grafana: Visit http://grafana.org/ for more information Graphite: Visit https://graphiteapp.org/ for more information Elasticsearch competitors: Lucene/Solr: Visit http://lucene.apache.org/solr/ or https://lucene.apache.org/ for more information Sphinx: Visit http://sphinxsearch.com/ for more information Most of these compare with respect to log management, while Elastic Stack is much more than that. It offers you the ability to analyze any type of data and not just logs. Resources for Article: Further resources on this subject: AIO setup of OpenStack – preparing the infrastructure code environment [article] Our App and Tool Stack [article] Using the OpenStack Dashboard [article]
Read more
  • 0
  • 0
  • 3325

article-image-deep-learning-and-regression-analysis
Packt
09 Jan 2017
6 min read
Save for later

Deep learning and regression analysis

Packt
09 Jan 2017
6 min read
In this article by Richard M. Reese and Jennifer L. Reese, authors of the book, Java for Data Science, We will discuss neural networks can be used to perform regression analysis. However, other techniques may offer a more effective solution. With regression analysis, we want to predict a result based on several input variables (For more resources related to this topic, see here.) We can perform regression analysis using an output layer that consists of a single neuron that sums the weighted input plus bias of the previous hidden layer. Thus, the result is a single value representing the regression. Preparing the data We will use a car evaluation database to demonstrate how to predict the acceptability of a car based on a series of attributes. The file containing the data we will be using can be downloaded from: http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data. It consists of car data such as price, number of passengers, and safety information, and an assessment of its overall quality. It is this latter element that we will try to predict. The comma-delimited values in each attribute are shown next, along with substitutions. The substitutions are needed because the model expects numeric data: Attribute Original value Substituted value Buying price vhigh, high, med, low 3,2,1,0 Maintenance price vhigh, high, med, low 3,2,1,0 Number of doors 2, 3, 4, 5-more 2,3,4,5 Seating 2, 4, more 2,4,5 Cargo space small, med, big 0,1,2 Safety low, med, high 0,1,2 There are 1,728 instances in the file. The cars are marked with four classes: Class Number of instances Percentage of instances Original value Substituted value Unacceptable 1210 70.023% unacc 0 Acceptable 384 22.222% acc 1 Good 69 3.99% good 2 Very good 65 3.76% v-good 3 Setting up the class We start with the definition of a CarRegressionExample class, as shown next: public class CarRegressionExample { public CarRegressionExample() { try { ... } catch (IOException | InterruptedException ex) { // Handle exceptions } } public static void main(String[] args) { new CarRegressionExample(); } } Reading and preparing the data The first task is to read in the data. We will use the CSVRecordReader class to get the data: RecordReader recordReader = new CSVRecordReader(0, ","); recordReader.initialize(new FileSplit(new File("car.txt"))); DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader, 1728, 6, 4); With this dataset, we will split the data into two sets. Sixty five percent of the data is used for training and the rest for testing: DataSet dataset = iterator.next(); dataset.shuffle(); SplitTestAndTrain testAndTrain = dataset.splitTestAndTrain(0.65); DataSet trainingData = testAndTrain.getTrain(); DataSet testData = testAndTrain.getTest(); The data now needs to be normalized: DataNormalization normalizer = new NormalizerStandardize(); normalizer.fit(trainingData); normalizer.transform(trainingData); normalizer.transform(testData); We are now ready to build the model. Building the model A MultiLayerConfiguration instance is created using a series of NeuralNetConfiguration.Builder methods. The following is the dice used. We will discuss the individual methods following the code. Note that this configuration uses two layers. The last layer uses the softmax activation function, which is used for regression analysis: MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder() .iterations(1000) .activation("relu") .weightInit(WeightInit.XAVIER) .learningRate(0.4) .list() .layer(0, new DenseLayer.Builder() .nIn(6).nOut(3) .build()) .layer(1, new OutputLayer .Builder(LossFunctions.LossFunction .NEGATIVELOGLIKELIHOOD) .activation("softmax") .nIn(3).nOut(4).build()) .backprop(true).pretrain(false) .build(); Two layers are created. The first is the input layer. The DenseLayer.Builder class is used to create this layer. The DenseLayer class is a feed-forward and fully connected layer. The created layer uses the six car attributes as input. The output consists of three neurons that are fed into the output layer and is duplicated here for your convenience: .layer(0, new DenseLayer.Builder() .nIn(6).nOut(3) .build()) The second layer is the output layer created with the OutputLayer.Builder class. It uses a loss function as the argument of its constructor. The softmax activation function is used since we are performing regression as shown here: .layer(1, new OutputLayer .Builder(LossFunctions.LossFunction .NEGATIVELOGLIKELIHOOD) .activation("softmax") .nIn(3).nOut(4).build()) Next, a MultiLayerNetwork instance is created using the configuration. The model is initialized, its listeners are set, and then the fit method is invoked to perform the actual training. The ScoreIterationListener instance will display information as the model trains which we will see shortly in the output of this example. Its constructor argument specifies the frequency that information is displayed: MultiLayerNetwork model = new MultiLayerNetwork(conf); model.init(); model.setListeners(new ScoreIterationListener(100)); model.fit(trainingData); We are now ready to evaluate the model. Evaluating the model In the next sequence of code, we evaluate the model against the training dataset. An Evaluation instance is created using an argument specifying that there are four classes. The test data is fed into the model using the output method. The eval method takes the output of the model and compares it against the test data classes to generate statistics. The getLabels method returns the expected values: Evaluation evaluation = new Evaluation(4); INDArray output = model.output(testData.getFeatureMatrix()); evaluation.eval(testData.getLabels(), output); out.println(evaluation.stats()); The output of the training follows, which is produced by the ScoreIterationListener class. However, the values you get may differ due to how the data is selected and analyzed. Notice that the score improves with the iterations but levels out after about 500 iterations: 12:43:35.685 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 0 is 1.443480901811554 12:43:36.094 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 100 is 0.3259061845624861 12:43:36.390 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 200 is 0.2630572026049783 12:43:36.676 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 300 is 0.24061281470878784 12:43:36.977 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 400 is 0.22955121170274934 12:43:37.292 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 500 is 0.22249920540161677 12:43:37.575 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 600 is 0.2169898450109222 12:43:37.872 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 700 is 0.21271599814600958 12:43:38.161 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 800 is 0.2075677126088741 12:43:38.451 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 900 is 0.20047317735870715 This is followed by the results of the stats method as shown next. The first part reports on how examples are classified and the second part displays various statistics: Examples labeled as 0 classified by model as 0: 397 times Examples labeled as 0 classified by model as 1: 10 times Examples labeled as 0 classified by model as 2: 1 times Examples labeled as 1 classified by model as 0: 8 times Examples labeled as 1 classified by model as 1: 113 times Examples labeled as 1 classified by model as 2: 1 times Examples labeled as 1 classified by model as 3: 1 times Examples labeled as 2 classified by model as 1: 7 times Examples labeled as 2 classified by model as 2: 21 times Examples labeled as 2 classified by model as 3: 14 times Examples labeled as 3 classified by model as 1: 2 times Examples labeled as 3 classified by model as 3: 30 times ==========================Scores======================================== Accuracy: 0.9273 Precision: 0.854 Recall: 0.8323 F1 Score: 0.843 ======================================================================== The regression model does a reasonable job with this dataset. Summary In this article, we examined deep learning and regression analysis. We showed how to prepare the data and class, build the model, and evaluate the model. We used sample data and displayed output statistics to demonstrate the relative effectiveness of our model. Resources for Article: Further resources on this subject: KnockoutJS Templates [article] The Heart of It All [article] Bringing DevOps to Network Operations [article]
Read more
  • 0
  • 0
  • 5104
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime
article-image-exploring-structure-motion-using-opencv
Packt
09 Jan 2017
20 min read
Save for later

Exploring Structure from Motion Using OpenCV

Packt
09 Jan 2017
20 min read
In this article by Roy Shilkrot, coauthor of the book Mastering OpenCV 3, we will discuss the notion of Structure from Motion (SfM), or better put, extracting geometric structures from images taken with a camera under motion, using OpenCV's API to help us. First, let's constrain the otherwise very broad approach to SfM using a single camera, usually called a monocular approach, and a discrete and sparse set of frames rather than a continuous video stream. These two constrains will greatly simplify the system we will sketch out in the coming pages and help us understand the fundamentals of any SfM method. In this article, we will cover the following: Structure from Motion concepts Estimating the camera motion from a pair of images (For more resources related to this topic, see here.) Throughout the article, we assume the use of a calibrated camera—one that was calibrated beforehand. Calibration is a ubiquitous operation in computer vision, fully supported in OpenCV using command-line tools. We, therefore, assume the existence of the camera's intrinsic parameters embodied in the K matrix and the distortion coefficients vector—the outputs from the calibration process. To make things clear in terms of language, from this point on, we will refer to a camera as a single view of the scene rather than to the optics and hardware taking the image. A camera has a position in space and a direction of view. Between two cameras, there is a translation element (movement through space) and a rotation of the direction of view. We will also unify the terms for the point in the scene, world, real, or 3D to be the same thing, a point that exists in our real world. The same goes for points in the image or 2D, which are points in the image coordinates, of some real 3D point that was projected on the camera sensor at that location and time. Structure from Motion concepts The first discrimination we should make is the difference between stereo (or indeed any multiview), 3D reconstruction using calibrated rigs, and SfM. A rig of two or more cameras assumes we already know what the "motion" between the cameras is, while in SfM, we don't know what this motion is and we wish to find it. Calibrated rigs, from a simplistic point of view, allow a much more accurate reconstruction of 3D geometry because there is no error in estimating the distance and rotation between the cameras—it is already known. The first step in implementing an SfM system is finding the motion between the cameras. OpenCV may help us in a number of ways to obtain this motion, specifically using the findFundamentalMat and findEssentialMat functions. Let's think for one moment of the goal behind choosing an SfM algorithm. In most cases, we wish to obtain the geometry of the scene, for example, where objects are in relation to the camera and what their form is. Having found the motion between the cameras picturing the same scene, from a reasonably similar point of view, we would now like to reconstruct the geometry. In computer vision jargon, this is known as triangulation, and there are plenty of ways to go about it. It may be done by way of ray intersection, where we construct two rays: one from each camera's center of projection and a point on each of the image planes. The intersection of these rays in space will, ideally, intersect at one 3D point in the real world that was imaged in each camera, as shown in the following diagram: In reality, ray intersection is highly unreliable. This is because the rays usually do not intersect, making us fall back to using the middle point on the shortest segment connecting the two rays. OpenCV contains a simple API for a more accurate form of triangulation, the triangulatePoints function, so this part we do not need to code on our own. After you have learned how to recover 3D geometry from two views, we will see how you can incorporate more views of the same scene to get an even richer reconstruction. At that point, most SfM methods try to optimize the bundle of estimated positions of our cameras and 3D points by means of Bundle Adjustment. OpenCV contains means for Bundle Adjustment in its new Image Stitching Toolbox. However, the beauty of working with OpenCV and C++ is the abundance of external tools that can be easily integrated into the pipeline. We will, therefore, see how to integrate an external bundle adjuster, the Ceres non-linear optimization package. Now that we have sketched an outline of our approach to SfM using OpenCV, we will see how each element can be implemented. Estimating the camera motion from a pair of images Before we set out to actually find the motion between two cameras, let's examine the inputs and the tools we have at hand to perform this operation. First, we have two images of the same scene from (hopefully not extremely) different positions in space. This is a powerful asset, and we will make sure that we use it. As for tools, we should take a look at mathematical objects that impose constraints over our images, cameras, and the scene. Two very useful mathematical objects are the fundamental matrix (denoted by F) and the essential matrix (denoted by E). They are mostly similar, except that the essential matrix is assuming usage of calibrated cameras; this is the case for us, so we will choose it. OpenCV allows us to find the fundamental matrix via the findFundamentalMat function and the essential matrix via the findEssentialMatrix function. Finding the essential matrix can be done as follows: Mat E = findEssentialMat(leftPoints, rightPoints, focal, pp); This function makes use of matching points in the "left" image, leftPoints, and "right" image, rightPoints, which we will discuss shortly, as well as two additional pieces of information from the camera's calibration: the focal length, focal, and principal point, pp. The essential matrix, E, is a 3 x 3 matrix, which imposes the following constraint on a point in one image and a point in the other image: x'K­TEKx = 0, where x is a point in the first image one, x' is the corresponding point in the second image, and K is the calibration matrix. This is extremely useful, as we are about to see. Another important fact we use is that the essential matrix is all we need in order to recover the two cameras' positions from our images, although only up to an arbitrary unit of scale. So, if we obtain the essential matrix, we know where each camera is positioned in space and where it is looking. We can easily calculate the matrix if we have enough of those constraint equations, simply because each equation can be used to solve for a small part of the matrix. In fact, OpenCV internally calculates it using just five point-pairs, but through the Random Sample Consensus algorithm (RANSAC), many more pairs can be used and make for a more robust solution. Point matching using rich feature descriptors Now we will make use of our constraint equations to calculate the essential matrix. To get our constraints, remember that for each point in image A, we must find a corresponding point in image B. We can achieve such a matching using OpenCV's extensive 2D feature-matching framework, which has greatly matured in the past few years. Feature extraction and descriptor matching is an essential process in computer vision and is used in many methods to perform all sorts of operations, for example, detecting the position and orientation of an object in the image or searching a big database of images for similar images through a given query. In essence, feature extraction means selecting points in the image that would make for good features and computing a descriptor for them. A descriptor is a vector of numbers that describes the surrounding environment around a feature point in an image. Different methods have different lengths and data types for their descriptor vectors. Descriptor Matching is the process of finding a corresponding feature from one set in another using its descriptor. OpenCV provides very easy and powerful methods to support feature extraction and matching. Let's examine a very simple feature extraction and matching scheme: vector<KeyPoint> keypts1, keypts2; Mat desc1, desc2; // detect keypoints and extract ORB descriptors Ptr<Feature2D> orb = ORB::create(2000); orb->detectAndCompute(img1, noArray(), keypts1, desc1); orb->detectAndCompute(img2, noArray(), keypts2, desc2); // matching descriptors Ptr<DescriptorMatcher> matcher = DescriptorMatcher::create("BruteForce-Hamming"); vector<DMatch> matches; matcher->match(desc1, desc2, matches); You may have already seen similar OpenCV code, but let's review it quickly. Our goal is to obtain three elements: feature points for two images, descriptors for them, and a matching between the two sets of features. OpenCV provides a range of feature detectors, descriptor extractors, and matchers. In this simple example, we use the ORB class to get both the 2D location of Oriented BRIEF (ORB) (where Binary Robust Independent Elementary Features (BRIEF)) feature points and their respective descriptors. We use a brute-force binary matcher to get the matching, which is the most straightforward way to match two feature sets by comparing each feature in the first set to each feature in the second set (hence the phrasing "brute-force"). In the following image, we will see a matching of feature points on two images from the Fountain-P11 sequence found at http://cvlab.epfl.ch/~strecha/multiview/denseMVS.html: Practically, raw matching like we just performed is good only up to a certain level, and many matches are probably erroneous. For that reason, most SfM methods perform some form of filtering on the matches to ensure correctness and reduce errors. One form of filtering, which is built into OpenCV's brute-force matcher, is cross-check filtering. That is, a match is considered true if a feature of the first image matches a feature of the second image, and the reverse check also matches the feature of the second image with the feature of the first image. Another common filtering mechanism, used in the provided code, is to filter based on the fact that the two images are of the same scene and have a certain stereo-view relationship between them. In practice, the filter tries to robustly calculate the fundamental or essential matrix and retain those feature pairs that correspond to this calculation with small errors. An alternative to using rich features, such as ORB, is to use optical flow. The following information box provides a short overview of optical flow. It is possible to use optical flow instead of descriptor matching to find the required point matching between two images, while the rest of the SfM pipeline remains the same. OpenCV recently extended its API for getting the flow field from two images, and now it is faster and more powerful. Optical flow It is the process of matching selected points from one image to another, assuming that both images are part of a sequence and relatively close to one another. Most optical flow methods compare a small region, known as the search window or patch, around each point from image A to the same area in image B. Following a very common rule in computer vision, called the brightness constancy constraint (and other names), the small patches of the image will not change drastically from one image to other, and therefore the magnitude of their subtraction should be close to zero. In addition to matching patches, newer methods of optical flow use a number of additional methods to get better results. One is using image pyramids, which are smaller and smaller resized versions of the image, which allow for working from coarse to-fine—a very well-used trick in computer vision. Another method is to define global constraints on the flow field, assuming that the points close to each other move together in the same direction. Finding camera matrices Now that we have obtained matches between keypoints, we can calculate the essential matrix. However, we must first align our matching points into two arrays, where an index in one array corresponds to the same index in the other. This is required by the findEssentialMat function as we've seen in the Estimating Camera Motion section. We would also need to convert the KeyPoint structure to a Point2f structure. We must pay special attention to the queryIdx and trainIdx member variables of DMatch, the OpenCV struct that holds a match between two keypoints, as they must align with the way we used the DescriptorMatcher::match() function. The following code section shows how to align a matching into two corresponding sets of 2D points, and how these can be used to find the essential matrix: vector<KeyPoint> leftKpts, rightKpts; // ... obtain keypoints using a feature extractor vector<DMatch> matches; // ... obtain matches using a descriptor matcher //align left and right point sets vector<Point2f> leftPts, rightPts; for (size_t i = 0; i < matches.size(); i++) { // queryIdx is the "left" image leftPts.push_back(leftKpts[matches[i].queryIdx].pt); // trainIdx is the "right" image rightPts.push_back(rightKpts[matches[i].trainIdx].pt); } //robustly find the Essential Matrix Mat status; Mat E = findEssentialMat( leftPts, //points from left image rightPts, //points from right image focal, //camera focal length factor pp, //camera principal point cv::RANSAC, //use RANSAC for a robust solution 0.999, //desired solution confidence level 1.0, //point-to-epipolar-line threshold status); //binary vector for inliers We may later use the status binary vector to prune those points that align with the recovered essential matrix. Refer to the following image for an illustration of point matching after pruning. The red arrows mark feature matches that were removed in the process of finding the matrix, and the green arrows are feature matches that were kept: Now we are ready to find the camera matrices; however, the new OpenCV 3 API makes things very easy for us by introducing the recoverPose function. First, we will briefly examine the structure of the camera matrix we will use: This is the model for our camera; it consists of two elements, rotation (denoted as R) and translation (denoted as t). The interesting thing about it is that it holds a very essential equation: x = PX, where x is a 2D point on the image and X is a 3D point in space. There is more to it, but this matrix gives us a very important relationship between the image points and the scene points. So, now that we have a motivation for finding the camera matrices, we will see how it can be done. The following code section shows how to decompose the essential matrix into the rotation and translation elements: Mat E; // ... find the essential matrix Mat R, t; //placeholders for rotation and translation //Find Pright camera matrix from the essential matrix //Cheirality check is performed internally. recoverPose(E, leftPts, rightPts, R, t, focal, pp, mask); Very simple. Without going too deep into mathematical interpretation, this conversion of the essential matrix to rotation and translation is possible because the essential matrix was originally composed by these two elements. Strictly for satisfying our curiosity, we can look at the following equation for the essential matrix, which appears in the literature:. We see that it is composed of (some form of) a translation element, t, and a rotational element, R. Note that a cheirality check is internally performed in the recoverPose function. The cheirality check makes sure that all triangulated 3D points are in front of the reconstructed camera. Camera matrix recovery from the essential matrix has in fact four possible solutions, but the only correct solution is the one that will produce triangulated points in front of the camera, hence the need for a cheirality check. Note that what we just did only gives us one camera matrix, and for triangulation, we require two camera matrices. This operation assumes that one camera matrix is fixed and canonical (no rotation and no translation): The other camera that we recovered from the essential matrix has moved and rotated in relation to the fixed one. This also means that any of the 3D points that we recover from these two camera matrices will have the first camera at the world origin point (0, 0, 0). One more thing we can think of adding to our method is error checking. Many times, the calculation of an essential matrix from point matching is erroneous, and this affects the resulting camera matrices. Continuing to triangulate with faulty camera matrices is pointless. We can install a check to see if the rotation element is a valid rotation matrix. Keeping in mind that rotation matrices must have a determinant of 1 (or -1), we can simply do the following: bool CheckCoherentRotation(const cv::Mat_<double>& R) { if (fabsf(determinant(R)) - 1.0 > 1e-07) { cerr << "rotation matrix is invalid" << endl; return false; } return true; } We can now see how all these elements combine into a function that recovers the P matrices. First, we will introduce some convenience data structures and type short hands: typedef std::vector<cv::KeyPoint> Keypoints; typedef std::vector<cv::Point2f> Points2f; typedef std::vector<cv::Point3f> Points3f; typedef std::vector<cv::DMatch> Matching; struct Features { //2D features Keypoints keyPoints; Points2f points; cv::Mat descriptors; }; struct Intrinsics { //camera intrinsic parameters cv::Mat K; cv::Mat Kinv; cv::Mat distortion; }; Now, we can write the camera matrix finding function: void findCameraMatricesFromMatch( constIntrinsics& intrin, constMatching& matches, constFeatures& featuresLeft, constFeatures& featuresRight, cv::Matx34f& Pleft, cv::Matx34f& Pright) { { //Note: assuming fx = fy const double focal = intrin.K.at<float>(0, 0); const cv::Point2d pp(intrin.K.at<float>(0, 2), intrin.K.at<float>(1, 2)); //align left and right point sets using the matching Features left; Features right; GetAlignedPointsFromMatch( featuresLeft, featuresRight, matches, left, right); //find essential matrix Mat E, mask; E = findEssentialMat( left.points, right.points, focal, pp, RANSAC, 0.999, 1.0, mask); Mat_<double> R, t; //Find Pright camera matrix from the essential matrix recoverPose(E, left.points, right.points, R, t, focal, pp, mask); Pleft = Matx34f::eye(); Pright = Matx34f(R(0,0), R(0,1), R(0,2), t(0), R(1,0), R(1,1), R(1,2), t(1), R(2,0), R(2,1), R(2,2), t(2)); } At this point, we have the two cameras that we need in order to reconstruct the scene. The canonical first camera, in the Pleft variable, and the second camera we calculated, form the essential matrix in the Pright variable. Choosing the image pair to use first Given we have more than just two image views of the scene, we must choose which two views we will start the reconstruction from. In their paper, Snavely et al. suggest that we pick the two views that have the least number of homography inliers. A homography is a relationship between two images or sets of points that lie on a plane; the homography matrix defines the transformation from one plane to another. In case of an image or a set of 2D points, the homography matrix is of size 3 x 3. When Snavely et al. look for the lowest inlier ratio, they essentially suggest to calculate the homography matrix between all pairs of images and pick the pair whose points mostly do not correspond with the homography matrix. This means the geometry of the scene in these two views is not planar or at least not the same plane in both views, which helps when doing 3D reconstruction. For reconstruction, it is best to look at a complex scene with non-planar geometry, with things closer and farther away from the camera. The following code snippet shows how to use OpenCV's findHomography function to count the number of inliers between two views whose features were already extracted and matched: int findHomographyInliers( const Features& left, const Features& right, const Matching& matches) { //Get aligned feature vectors Features alignedLeft; Features alignedRight; GetAlignedPointsFromMatch(left, right, matches, alignedLeft, alignedRight); //Calculate homography with at least 4 points Mat inlierMask; Mat homography; if(matches.size() >= 4) { homography = findHomography(alignedLeft.points, alignedRight.points, cv::RANSAC, RANSAC_THRESHOLD, inlierMask); } if(matches.size() < 4 or homography.empty()) { return 0; } return countNonZero(inlierMask); } The next step is to perform this operation on all pairs of image views in our bundle and sort them based on the ratio of homography inliers to outliers: //sort pairwise matches to find the lowest Homography inliers map<float, ImagePair> pairInliersCt; const size_t numImages = mImages.size(); //scan all possible image pairs (symmetric) for (size_t i = 0; i < numImages - 1; i++) { for (size_t j = i + 1; j < numImages; j++) { if (mFeatureMatchMatrix[i][j].size() < MIN_POINT_CT) { //Not enough points in matching pairInliersCt[1.0] = {i, j}; continue; } //Find number of homography inliers const int numInliers = findHomographyInliers( mImageFeatures[i], mImageFeatures[j], mFeatureMatchMatrix[i][j]); const float inliersRatio = (float)numInliers / (float)(mFeatureMatchMatrix[i][j].size()); pairInliersCt[inliersRatio] = {i, j}; } } Note that the std::map<float, ImagePair> will internally sort the pairs based on the map's key: the inliers ratio. We then simply need to traverse this map from the beginning to find the image pair with least inlier ratio, and if that pair cannot be used, we can easily skip ahead to the next pair. Summary In this article, we saw how OpenCV v3 can help us approach Structure from Motion in a manner that is both simple to code and to understand. OpenCV v3's new API contains a number of useful functions and data structures that make our lives easier and also assist in a cleaner implementation. However, the state-of-the-art SfM methods are far more complex. There are many issues we choose to disregard in favor of simplicity, and plenty more error examinations that are usually in place. Our chosen methods for the different elements of SfM can also be revisited. Some methods even use the N-view triangulation once they understand the relationship between the features in multiple images. If we would like to extend and deepen our familiarity with SfM, we will certainly benefit from looking at other open source SfM libraries. One particularly interesting project is libMV, which implements a vast array of SfM elements that may be interchanged to get the best results. There is a great body of work from University of Washington that provides tools for many flavors of SfM (Bundler and VisualSfM). This work inspired an online product from Microsoft, called PhotoSynth, and 123D Catch from Adobe. There are many more implementations of SfM readily available online, and one must only search to find quite a lot of them. Resources for Article: Further resources on this subject: Basics of Image Histograms in OpenCV [article] OpenCV: Image Processing using Morphological Filters [article] Face Detection and Tracking Using ROS, Open-CV and Dynamixel Servos [article]
Read more
  • 0
  • 1
  • 57530

article-image-tensorflow
Packt
04 Jan 2017
17 min read
Save for later

TensorFlow

Packt
04 Jan 2017
17 min read
In this article by Nicholas McClure, the author of the book TensorFlow Machine Learning Cookbook, we will cover basic recipes in order to understand how TensorFlow works and how to access data for this book and additional resources: How TensorFlow works Declaring tensors Using placeholders and variables Working with matrices Declaring operations (For more resources related to this topic, see here.) Introduction Google's TensorFlow engine has a unique way of solving problems. This unique way allows us to solve machine learning problems very efficiently. We will cover the basic steps to understand how TensorFlow operates. This understanding is essential in understanding recipes for the rest of this book. How TensorFlow works At first, computation in TensorFlow may seem needlessly complicated. But there is a reason for it: because of how TensorFlow treats computation, developing more complicated algorithms is relatively easy. This recipe will talk you through the pseudo code of how a TensorFlow algorithm usually works. Getting ready Currently, TensorFlow is only supported on Mac and Linux distributions. Using TensorFlow on Windows requires the usage of a virtual machine. Throughout this book we will only concern ourselves with the Python library wrapper of TensorFlow. This book will use Python 3.4+ (https://www.python.org) and TensorFlow 0.7 (https://www.tensorflow.org). While TensorFlow can run on the CPU, it runs faster if it runs on the GPU, and it is supported on graphics cards with NVidia Compute Capability 3.0+. To run on a GPU, you will also need to download and install the NVidia Cuda Toolkit (https://developer.nvidia.com/cuda-downloads). Some of the recipes will rely on a current installation of the Python packages Scipy, Numpy, and Scikit-Learn as well. How to do it… Here we will introduce the general flow of TensorFlow algorithms. Most recipes will follow this outline: Import or generate data: All of our machine-learning algorithms will depend on data. In this book we will either generate data or use an outside source of data. Sometimes it is better to rely on generated data because we will want to know the expected outcome. Transform and normalize data: The data is usually not in the correct dimension or type that our TensorFlow algorithms expect. We will have to transform our data before we can use it. Most algorithms also expect normalized data and we will do this here as well. TensorFlow has built in functions that can normalize the data for you as follows: data = tf.nn.batch_norm_with_global_normalization(...) Set algorithm parameters: Our algorithms usually have a set of parameters that we hold constant throughout the procedure. For example, this can be the number of iterations, the learning rate, or other fixed parameters of our choosing. It is considered good form to initialize these together so the reader or user can easily find them, as follows: learning_rate = 0.01 iterations = 1000 Initialize variables and placeholders: TensorFlow depends on us telling it what it can and cannot modify. TensorFlow will modify the variables during optimization to minimize a loss function. To accomplish this, we feed in data through placeholders. We need to initialize both of these, variables and placeholders with size and type, so that TensorFlow knows what to expect. See the following: code: a_var = tf.constant(42) x_input = tf.placeholder(tf.float32, [None, input_size]) y_input = tf.placeholder(tf.fload32, [None, num_classes]) Define the model structure: After we have the data, and have initialized our variables and placeholders, we have to define the model. This is done by building a computational graph. We tell TensorFlow what operations must be done on the variables and placeholders to arrive at our model predictions: y_pred = tf.add(tf.mul(x_input, weight_matrix), b_matrix) Declare the loss functions: After defining the model, we must be able to evaluate the output. This is where we declare the loss function. The loss function is very important as it tells us how far off our predictions are from the actual values: loss = tf.reduce_mean(tf.square(y_actual – y_pred)) Initialize and train the model: Now that we have everything in place, we need to create an instance for our graph, feed in the data through the placeholders and let TensorFlow change the variables to better predict our training data. Here is one way to initialize the computational graph: with tf.Session(graph=graph) as session: ... session.run(...) ... Note that we can also initiate our graph with: session = tf.Session(graph=graph) session.run(…) (Optional) Evaluate the model: Once we have built and trained the model, we should evaluate the model by looking at how well it does with new data through some specified criteria. (Optional) Predict new outcomes: It is also important to know how to make predictions on new, unseen, data. We can do this with all of our models, once we have them trained. How it works… In TensorFlow, we have to setup the data, variables, placeholders, and model before we tell the program to train and change the variables to improve the predictions. TensorFlow accomplishes this through the computational graph. We tell it to minimize a loss function and TensorFlow does this by modifying the variables in the model. TensorFlow knows how to modify the variables because it keeps track of the computations in the model and automatically computes the gradients for every variable. Because of this, we can see how easy it can be to make changes and try different data sources. See also A great place to start is the official Python API Tensorflow documentation: https://www.tensorflow.org/versions/r0.7/api_docs/python/index.html There are also tutorials available: https://www.tensorflow.org/versions/r0.7/tutorials/index.html Declaring tensors Getting ready Tensors are the data structure that TensorFlow operates on in the computational graph. We can declare these tensors as variables or feed them in as placeholders. First we must know how to create tensors. When we create a tensor and declare it to be a variable, TensorFlow creates several graph structures in our computation graph. It is also important to point out that just by creating a tensor, TensorFlow is not adding anything to the computational graph. TensorFlow does this only after creating a variable out of the tensor. See the next section on variables and placeholders for more information. How to do it… Here we will cover the main ways to create tensors in TensorFlow. Fixed tensors: Creating a zero filled tensor. Use the following: zero_tsr = tf.zeros([row_dim, col_dim]) Creating a one filled tensor. Use the following: ones_tsr = tf.ones([row_dim, col_dim]) Creating a constant filled tensor. Use the following: filled_tsr = tf.fill([row_dim, col_dim], 42) Creating a tensor out of an existing constant.Use the following: constant_tsr = tf.constant([1,2,3]) Note that the tf.constant() function can be used to broadcast a value into an array, mimicking the behavior of tf.fill() by writing tf.constant(42, [row_dim, col_dim]) Tensors of similar shape: We can also initialize variables based on the shape of other tensors, as follows: zeros_similar = tf.zeros_like(constant_tsr) ones_similar = tf.ones_like(constant_tsr) Note, that since these tensors depend on prior tensors, we must initialize them in order. Attempting to initialize all the tensors all at once will result in an error. Sequence tensors: Tensorflow allows us to specify tensors that contain defined intervals. The following functions behave very similarly to the range() outputs and numpy's linspace() outputs. See the following function: linear_tsr = tf.linspace(start=0, stop=1, start=3) The resulting tensor is the sequence [0.0, 0.5, 1.0]. Note that this function includes the specified stop value. See the following function integer_seq_tsr = tf.range(start=6, limit=15, delta=3) The result is the sequence [6, 9, 12]. Note that this function does not include the limit value. Random tensors: The following generated random numbers are from a uniform distribution: randunif_tsr = tf.random_uniform([row_dim, col_dim], minval=0, maxval=1) Know that this random uniform distribution draws from the interval that includes the minval but not the maxval ( minval<=x<maxval ). To get a tensor with random draws from a normal distribution, as follows: randnorm_tsr = tf.random_normal([row_dim, col_dim], mean=0.0, stddev=1.0) There are also times when we wish to generate normal random values that are assured within certain bounds. The truncated_normal() function always picks normal values within two standard deviations of the specified mean. See the following: runcnorm_tsr = tf.truncated_normal([row_dim, col_dim], mean=0.0, stddev=1.0) We might also be interested in randomizing entries of arrays. To accomplish this there are two functions that help us, random_shuffle() and random_crop(). See the following: shuffled_output = tf.random_shuffle(input_tensor) cropped_output = tf.random_crop(input_tensor, crop_size) Later on in this book, we will be interested in randomly cropping an image of size (height, width, 3) where there are three color spectrums. To fix a dimension in the cropped_output, you must give it the maximum size in that dimension: cropped_image = tf.random_crop(my_image, [height/2, width/2, 3]) How it works… Once we have decided on how to create the tensors, then we may also create the corresponding variables by wrapping the tensor in the Variable() function, as follows. More on this in the next section: my_var = tf.Variable(tf.zeros([row_dim, col_dim])) There's more… We are not limited to the built in functions, we can convert any numpy array, Python list, or constant to a tensor using the function convert_to_tensor(). Know that this function also accepts tensors as an input in case we wish to generalize a computation inside a function. Using placeholders and variables Getting ready One of the most important distinctions to make with data is whether it is a placeholder or variable. Variables are the parameters of the algorithm and TensorFlow keeps track of how to change these to optimize the algorithm. Placeholders are objects that allow you to feed in data of a specific type and shape or that depend on the results of the computational graph, like the expected outcome of a computation. How to do it… The main way to create a variable is by using the Variable() function, which takes a tensor as an input and outputs a variable. This is the declaration and we still need to initialize the variable. Initializing is what puts the variable with the corresponding methods on the computational graph. Here is an example of creating and initializing a variable: my_var = tf.Variable(tf.zeros([2,3])) sess = tf.Session() initialize_op = tf.initialize_all_variables() sess.run(initialize_op) To see whatthe computational graph looks like after creating and initializing a variable, see the next part in this section, How it works…, Figure 1. Placeholders are just holding the position for data to be fed into the graph. Placeholders get data from a feed_dict argument in the session. To put a placeholder in the graph, we must perform at least one operation on the placeholder. We initialize the graph, declare x to be a placeholder, and define y as the identity operation on x, which just returns x. We then create data to feed into the x placeholder and run the identity operation. It is worth noting that Tensorflow will not return a self-referenced placeholder in the feed dictionary. The code is shown below and the resulting graph is in the next section, How it works…: sess = tf.Session() x = tf.placeholder(tf.float32, shape=[2,2]) y = tf.identity(x) x_vals = np.random.rand(2,2) sess.run(y, feed_dict={x: x_vals}) # Note that sess.run(x, feed_dict={x: x_vals}) will result in a self-referencing error. How it works… The computational graph of initializing a variable as a tensor of zeros is seen in Figure 1to follow: Figure 1: Variable Figure 1: Here we can see what the computational graph looks like in detail with just one variable, initialized to all zeros. The grey shaded region is a very detailed view of the operations and constants involved. The main computational graph with less detail is the smaller graph outside of the grey region in the upper right. For more details on creating and visualizing graphs. Similarly, the computational graph of feeding a numpy array into a placeholder can be seen to follow, in Figure 2: Figure 2: Computational graph of an initialized placeholder Figure 2: Here is the computational graph of a placeholder initialized. The grey shaded region is a very detailed view of the operations and constants involved. The main computational graph with less detail is the smaller graph outside of the grey region in the upper right. There's more… During the run of the computational graph, we have to tell TensorFlow when to initialize the variables we have created. While each variable has an initializer method, the most common way to do this is with the helper function initialize_all_variables(). This function creates an operation in the graph that initializes all the variables we have created, as follows: initializer_op = tf.initialize_all_variables() But if we want to initialize a variable based on the results of initializing another variable, we have to initialize variables in the order we want, as follows: sess = tf.Session() first_var = tf.Variable(tf.zeros([2,3])) sess.run(first_var.initializer) second_var = tf.Variable(tf.zeros_like(first_var)) # Depends on first_var sess.run(second_var.initializer) Working with matrices Getting ready Many algorithms depend on matrix operations. TensorFlow gives us easy-to-use operations to perform such matrix calculations. For all of the following examples, we can create a graph session by running the following code: import tensorflow as tf sess = tf.Session() How to do it… Creating matrices: We can create two-dimensional matrices from numpy arrays or nested lists, as we described in the earlier section on tensors. We can also use the tensor creation functions and specify a two-dimensional shape for functions like zeros(), ones(), truncated_normal(), and so on: Tensorflow also allows us to create a diagonal matrix from a one dimensional array or list with the function diag(), as follows: identity_matrix = tf.diag([1.0, 1.0, 1.0]) # Identity matrix A = tf.truncated_normal([2, 3]) # 2x3 random normal matrix B = tf.fill([2,3], 5.0) # 2x3 constant matrix of 5's C = tf.random_uniform([3,2]) # 3x2 random uniform matrix D = tf.convert_to_tensor(np.array([[1., 2., 3.],[-3., -7., -1.],[0., 5., -2.]])) print(sess.run(identity_matrix)) [[ 1. 0. 0.] [ 0. 1. 0.] [ 0. 0. 1.]] print(sess.run(A)) [[ 0.96751703 0.11397751 -0.3438891 ] [-0.10132604 -0.8432678 0.29810596]] print(sess.run(B)) [[ 5. 5. 5.] [ 5. 5. 5.]] print(sess.run(C)) [[ 0.33184157 0.08907614] [ 0.53189191 0.67605299] [ 0.95889051 0.67061249]] print(sess.run(D)) [[ 1. 2. 3.] [-3. -7. -1.] [ 0. 5. -2.]] Note that if we were to run sess.run(C) again, we would reinitialize the random variables and end up with different random values. Addition and subtraction uses the following function: print(sess.run(A+B)) [[ 4.61596632 5.39771316 4.4325695 ] [ 3.26702736 5.14477345 4.98265553]] print(sess.run(B-B)) [[ 0. 0. 0.] [ 0. 0. 0.]] Multiplication print(sess.run(tf.matmul(B, identity_matrix))) [[ 5. 5. 5.] [ 5. 5. 5.]] Also, the function matmul() has arguments that specify whether or not to transpose the arguments before multiplication or whether each matrix is sparse. Transpose the arguments as follows: print(sess.run(tf.transpose(C))) [[ 0.67124544 0.26766731 0.99068872] [ 0.25006068 0.86560275 0.58411312]] Again, it is worth mentioning the reinitializing that gives us different values than before. Determinant, use the following: print(sess.run(tf.matrix_determinant(D))) -38.0 Inverse: print(sess.run(tf.matrix_inverse(D))) [[-0.5 -0.5 -0.5 ] [ 0.15789474 0.05263158 0.21052632] [ 0.39473684 0.13157895 0.02631579]] Note that the inverse method is based on the Cholesky decomposition if the matrix is symmetric positive definite or the LU decomposition otherwise. Decompositions: Cholesky decomposition, use the following: print(sess.run(tf.cholesky(identity_matrix))) [[ 1. 0. 1.] [ 0. 1. 0.] [ 0. 0. 1.]] Eigenvalues and Eigenvectors, use the following code: print(sess.run(tf.self_adjoint_eig(D)) [[-10.65907521 -0.22750691 2.88658212] [ 0.21749542 0.63250104 -0.74339638] [ 0.84526515 0.2587998 0.46749277] [ -0.4880805 0.73004459 0.47834331]] Note that the function self_adjoint_eig() outputs the eigen values in the first row and the subsequent vectors in the remaining vectors. In mathematics, this is called the eigen decomposition of a matrix. How it works… TensorFlow provides all the tools for us to get started with numerical computations and add such computations to our graphs. This notation might seem quite heavy for simple matrix operations. Remember that we are adding these operations to the graph and telling TensorFlow what tensors to run through those operations. Declaring operations Getting ready Besides the standard arithmetic operations, TensorFlow provides us more operations that we should be aware of and how to use them before proceeding. Again, we can create a graph session by running the following code: import tensorflow as tf sess = tf.Session() How to do it… TensorFlow has the standard operations on tensors, add(), sub(), mul(), and div(). Note that all of these operations in this section will evaluate the inputs element-wise unless specified otherwise. TensorFlow provides some variations of div() and relevant functions. It is worth mentioning that div() returns the same type as the inputs. This means it really returns the floor of the division (akin to Python 2) if the inputs are integers. To return the Python 3 version, which casts integers into floats before dividing and always returns a float, TensorFlow provides the function truediv()shown as follows: print(sess.run(tf.div(3,4))) 0 print(sess.run(tf.truediv(3,4))) 0.75 If we have floats and want integer division, we can use the function floordiv(). Note that this will still return a float, but rounded down to the nearest integer. The function is shown as follows: print(sess.run(tf.floordiv(3.0,4.0))) 0.0 Another important function is mod(). This function returns the remainder after division.It is shown as follows: print(sess.run(tf.mod(22.0, 5.0))) 2.0 The cross product between two tensors is achieved by the cross() function. Remember that the cross product is only defined for two 3-dimensional vectors, so it only accepts two 3-dimensional tensors. The function is shown as follows: print(sess.run(tf.cross([1., 0., 0.], [0., 1., 0.]))) [ 0. 0. 1.0] Here is a compact list of the more common math functions. All of these functions operate element-wise: abs() Absolute value of one input tensor ceil() Ceiling function of one input tensor cos() Cosine function of one input tensor exp() Base e exponential of one input tensor floor() Floor function of one input tensor inv() Multiplicative inverse (1/x) of one input tensor log() Natural logarithm of one input tensor maximum() Element-wise max of two tensors minimum() Element-wise min of two tensors neg() Negative of one input tensor pow() The first tensor raised to the second tensor element-wise round() Rounds one input tensor rsqrt() One over the square root of one tensor sign() Returns -1, 0, or 1, depending on the sign of the tensor sin() Sine function of one input tensor sqrt() Square root of one input tensor square() Square of one input tensor Specialty mathematical functions: There are some special math functions that get used in machine learning that are worth mentioning and TensorFlow has built in functions for them. Again, these functions operate element-wise, unless specified otherwise: digamma() Psi function, the derivative of the lgamma() function erf() Gaussian error function, element-wise, of one tensor erfc() Complimentary error function of one tensor igamma() Lower regularized incomplete gamma function igammac() Upper regularized incomplete gamma function lbeta() Natural logarithm of the absolute value of the beta function lgamma() Natural logarithm of the absolute value of the gamma function squared_difference() Computes the square of the differences between two tensors How it works… It is important to know what functions are available to us to add to our computational graphs. Mostly we will be concerned with the preceding functions. We can also generate many different custom functions as compositions of the preceding, as follows: # Tangent function (tan(pi/4)=1) print(sess.run(tf.div(tf.sin(3.1416/4.), tf.cos(3.1416/4.)))) 1.0 There's more… If we wish to add other operations to our graphs that are not listed here, we must create our own from the preceding functions. Here is an example of an operation not listed above that we can add to our graph: # Define a custom polynomial function def custom_polynomial(value): # Return 3 * x^2 - x + 10 return(tf.sub(3 * tf.square(value), value) + 10) print(sess.run(custom_polynomial(11))) 362 Summary Thus in this article we have implemented some introductory recipes that will help us to learn the basics of TensorFlow. Resources for Article: Further resources on this subject: Data Clustering [article] The TensorFlow Toolbox [article] Implementing Artificial Neural Networks with TensorFlow [article]
Read more
  • 0
  • 0
  • 2713

article-image-introduction-deep-learning
Packt
04 Jan 2017
19 min read
Save for later

Introduction to Deep Learning

Packt
04 Jan 2017
19 min read
In this article by Dipayan Dev, the author of the book Deep Learning with Hadoop, we will see a brief introduction to concept of the deep learning and deep feed-forward networks. "By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it."                                                                                                                  - Eliezer Yudkowsky Ever thought, why it is often difficult to beat the computer in chess, even by the best players of the game? How Facebook is able to recognize your face among hundreds of millions photos? How your mobile phone can recognize your voice, and redirects the call to the correct person selecting from hundreds of contacts listed? The primary goal of this book is to deal with many of those queries, and to provide detailed solutions to the readers. This book can be used for a wide range of reasons by a variety of readers, however, we wrote the book with two main target audiences in mind. One of the primary target audiences are the undergraduate or graduate university students learning about deep learning and Artificial Intelligence; the second group of readers belongs to the software engineers who already have a knowledge of Big Data, deep learning, and statistical modeling, but want to rapidly gain the knowledge of how deep learning can be used for Big Data and vice versa. This article will mainly try to set the foundation of the readers by providing the basic concepts, terminologies, characteristics, and the major challenges of deep learning. The article will also put forward the classification of different deep network algorithms, which have been widely used by researchers in the last decade. Following are the main topics that this article will cover: Get started with deep learning Deep learning: A revolution in Artificial Intelligence Motivations for deep learning Classification of deep learning networks Ever since the dawn of civilization, people have always dreamt of building some artificial machines or robots which can behave and work exactly like human beings. From the Greek mythological characters to the ancient Hindu epics, there are numerous such examples, which clearly suggest people's interest and inclination towards creating and having an artificial life. During the initial computer generations, people had always wondered if the computer could ever become as intelligent as a human being! Going forward, even in medical science too, the need of automated machines became indispensable and almost unavoidable. With this need and constant research in the same field, Artificial Intelligence (AI) has turned out to be a flourishing technology with its various applications in several domains, such as image processing, video processing, and many other diagnosis tools in medical science too. Although there are many problems that are resolved by AI systems on a daily basis, nobody knows the specific rules for how an AI system is programmed! Few of the intuitive problems are as follows: Google search, which does a really good job of understanding what you type or speak As mentioned earlier, Facebook too, is somewhat good at recognizing your face, and hence, understanding your interests Moreover, with the integration of various other fields, for example, probability, linear algebra, statistics, machine learning, deep learning, and so on, AI has already gained a huge amount of popularity in the research field over the course of time. One of the key reasons for he early success of AI could be because it basically dealt with fundamental problems for which the computer did not require vast amount of knowledge. For example, in 1997, IBM's Deep Blue chess-playing system was able to defeat the world champion Garry Kasparov [1]. Although this kind of achievement at that time can be considered as substantial, however, chess, being limited by only a few number of rules, it was definitely not a burdensome task to train the computer with only those number of rules! Training a system with fixed and limited number of rules is termed as hard-coded knowledge of the computer. Many Artificial Intelligence projects have undergone this hard-coded knowledge about the various aspects of the world in many traditional languages. As time progresses, this hard-coded knowledge does not seem to work with systems dealing with huge amounts of data. Moreover, the number of rules that the data were following also kept changing in a frequent manner. Therefore, most of those projects following that concept failed to stand up to the height of expectation. The setbacks faced by this hard-coded knowledge implied that those artificial intelligent systems need some way of generalizing patterns and rules from the supplied raw data, without the need of external spoon-feeding. The proficiency of a system to do so is termed as machine learning. There are various successful machine learning implementations, which we use in our daily life. Few of the most common and important implementations are as follows: Spam detection: Given an e-mail in your inbox, the model can detect whether to put that e-mail in spam or in the inbox folder. A common naive Bayes model can distinguish between such e-mails. Credit card fraud detection: A model that can detect whether a number of transactions performed at a specific time interval are done by the original customer or not. One of the most popular machine learning model, given by Mor-Yosef et al. [1990], used logistic regression, which could recommend whether caesarean delivery is needed for the patient or not! There are many such models, which have been implemented with the help of machine learning techniques: The figure shows the example of different types of representation. Let's say we want to train the machine to detect some empty spaces in between the jelly beans. In the image on the right side, we have sparse jelly beans, and it would be easier for the AI system to determine the empty parts. However, in the image on the left side, we have extremely compact jelly beans, and hence, it will be an extremely difficult task for the machine to find the empty spaces. Images sourced from USC-SIPI image database. A large portion of performance of the machine learning systems depends on the data fed to the system. This is called representation of the data. All the information related to the representation is called feature of the data. For example, if logistic regression is used to detect a brain tumor in a patient, the AI system will not try to diagnose the patient directly! Rather, the concerned doctor will provide the necessary input to the systems according to the common symptoms of that patient. The AI system will then match those inputs with the already received past inputs which were used to train the system. Based on the predictive analysis of the system, it will provide its decision regarding the disease. Although logistic regression can learn and decide based on the features given, it cannot influence or modify the way features are defined. For example, if that model was provided with a cesarean patient's report instead of the brain tumor patient's report, it would surely fail to predict the outcome, as the given features would never match with the trained data. This dependency of the machine learning systems on the representation of the data is not really unknown to us! In fact, most of our computer theory performs better based on how the data is represented. For example, the quality of database is considered based on the schema design. The execution of any database query, even on a thousand of million lines of data, becomes extremely fast if the schema is indexed properly. Therefore, the dependency of data representation of the AI systems should not surprise us. There are many such daily life examples too, where the representation of the data decides our efficiency. To locate a person from among 20 people is obviously easier than to locate the same from a crowd of 500 people. A visual representation of two different types of data representation in shown in preceding figure. Therefore, if the AI systems are fed with the appropriate featured data, even the hardest problems could be resolved. However, collecting and feeding the desired data in the correct way to the system has been a serious impediment for the computer programmer. There can be numerous real-time scenarios, where extracting the features could be a cumbersome task. Therefore, the way the data are represented decides the prime factors in the intelligence of the system. Finding cats from among a group of humans and cats could be extremely complicated if the features are not appropriate. We know that cats have tails; therefore, we might like to detect the presence of tails as a prominent feature. However, given the different tail shapes and sizes, it is often difficult to describe exactly how a tail will look like in terms of pixel values! Moreover, tails could sometimes be confused with the hands of humans. Also, overlapping of some objects could omit the presence of a cat's tail, making the image even more complicated. From all the above discussions, it can really be concluded that the success of AI systems depends, mainly, on how the data is represented. Also, various representations can ensnare and cache the different explanatory factors of all the disparities behind the data. Representation learning is one of the most popular and widely practiced learning approaches used to cope with these specific problems. Learning the representations of the next layer from the existing representation of data can be defined as representation learning. Ideally, all representation learning algorithms have this advantage of learning representations, which capture the underlying factors, a subset that might be applicable for each particular sub-task. A simple illustration is given in the following figure: The figure illustrates of representation learning. The middle layers are able to discover the explanatory factors (hidden layers, in blue rectangular boxes). Some of the factors explain each task's target, whereas some explain the inputs. However, while dealing with extracting some high-level data and features from a huge amount of raw data, which requires some sort of human-level understanding, has shown its limitations. There can be many such following examples: Differentiating the cry of two similar age babies. Identifying the image of a cat's eye in both day and night times. This becomes clumsy, because a cat's eyes glow at night unlike during daytime. In all these preceding edge cases, representation learning does not appear to behave exceptionally, and shows deterrent behavior. Deep learning, a sub-field of machine learning can rectify this major problem of representation learning by building multiple levels of representations or learning a hierarchy of features from a series of other simple representations and features [2] [8]. The figure shows how a deep learning system can represent the human image through identifying various combinations such as corners, contours, which can be defined in terms of edges. The preceding figure shows an illustration of a deep learning model. It is generally a cumbersome task for the computer to decode the meaning of raw unstructured input data, as represented by this image, as a collection of different pixel values. A mapping function, which will convert the group of pixels to identify the image, is, ideally, difficult to achieve. Also, to directly train the computer for these kinds of mapping looks almost insuperable. For these types of tasks, deep learning resolves the difficulty by creating a series of subset of mappings to reach the desired output. Each subset of mapping corresponds to a different set of layer of the model. The input contains the variables that one can observe, and hence, represented in the visible layers. From the given input, we can incrementally extract the abstract features of the data. As these values are not available or visible in the given data, these layers are termed as hidden layers. In the image, from the first layer of data, the edges can easily be identified just by a comparative study of the neighboring pixels. The second hidden layer can distinguish the corners and contours from the first layer's description of the edges. From this second hidden layer, which describes the corners and contours, the third hidden layer can identify the different parts of the specific objects. Ultimately, the different objects present in the image can be distinctly detected from the third layer. Image reprinted with permission from Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, published by The MIT Press. Deep learning started its journey exclusively since 2006, Hinton et al. in 2006[2]; also Bengio et al. in 2007[3] initially focused on the MNIST digit classification problem. In the last few years, deep learning has seen major transitions from digits to object recognition in natural images. One of the major breakthroughs was achieved by Krizhevsky et al. in 2012 [4] using the ImageNet dataset 4. The scope of this book is mainly limited to deep learning, so before diving into it directly, the necessary definitions of deep learning should be provided. Many researchers have defined deep learning in many ways, and hence, in the last 10 years, it has gone through many explanations too! Following are few of the widely accepted definitions: As noted by GitHub, deep learning is a new area of machine learning research, which has been introduced with the objective of moving machine learning closer to one of its original goals: Artificial Intelligence. Deep learning is about learning multiple levels of representation and abstraction, which help to make sense of data such as images, sound, and text. As recently updated by Wikipedia, deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in the data by using a deep graph with multiple processing layers, composed of multiple linear and non-linear transformations. As the definitions suggest, deep learning can also be considered as a special type of machine learning. Deep learning has achieved immense popularity in the field of data science with its ability to learn complex representation from various simple features. To have an in-depth grip on deep learning, we have listed out a few terminologies. The next topic of this article will help the readers lay a foundation of deep learning by providing various terminologies and important networks used for deep learning. Getting started with deep learning To understand the journey of deep learning in this book, one must know all the terminologies and basic concepts of machine learning. However, if the reader has already got enough insight into machine learning and related terms, they should feel free to ignore this section and jump to the next topic of this article. The readers who are enthusiastic about data science, and want to learn machine learning thoroughly, can follow Machine Learning by Tom M. Mitchell (1997) [5] and Machine Learning: a Probabilistic Perspective (2012) [6]. Image shows the scattered data points of social network analysis. Image sourced from Wikipedia. Neural networks do not perform miracles. But if used sensibly, they can produce some amazing results. Deep feed-forward networks Neural networks can be recurrent as well as feed-forward. Feed-forward networks do not have any loop associated in their graph, and are arranged in a set of layers. A network with many layers is said to be a deep network. In simple words, any neural network with two or more layers (hidden) is defined as deep feed-forward network or feed-forward neural network. Figure 4 shows a generic representation of a deep feed-forward neural network. Deep feed-forward network works on the principle that with an increase in depth, the network can also execute more sequential instructions. Instructions in sequence can offer great power, as these instructions can point to the earlier instruction. The aim of a feed-forward network is to generalize some function f. For example, classifier y=f/(x) maps from input x to category y. A deep feed-forward network modified the mapping, y=f(x; α), and learns the value of the parameter α, which gives the most appropriate value of the function. The following figure shows a simple representation of the deep-forward network to provide the architectural difference with the traditional neural network. Deep neural network is feed-forward network with many hidden layers: Datasets are considered to be the building blocks of a learning process. A dataset can be defined as a collection of interrelated sets of data, which is comprised of separate entities, but which can be used as a single entity depending on the use-case. The individual data elements of a dataset are called data points. The preceding figure gives the visual representation of the following data points: Unlabeled data: This part of data consists of human-generated objects, which can be easily obtained from the surroundings. Some of the examples are X-rays, log file data, news articles, speech, videos, tweets, and so on. Labelled data: Labelled data are normalized data from a set of unlabeled data. These types of data are usually well formatted, classified, tagged, and easily understandable by human beings for further processing. From the top-level understanding, the machine learning techniques can be classified as supervised and unsupervised learning based on how their learning process is carried out. Unsupervised learning In unsupervised learning algorithms, there is no desired output from the given input datasets. The system learns meaningful properties and features from its experience during the analysis of the dataset. During deep learning, the system generally tries to learn from the whole probability distribution of the data points. There are various types of unsupervised learning algorithms too, which perform clustering, which means separating the data points among clusters of similar types of data. However, with this type of learning, there is no feedback based on the final output, that is, there won't be any teacher to correct you! Figure 6 shows a basic overview of unsupervised clustering. A real life example of an unsupervised clustering algorithm is Google News. When we open a topic under Google News, it shows us a number of hyper-links redirecting to several pages. Each of those topics can be considered as a cluster of hyper-links that point to independent links. Supervised learning In supervised learning, unlike unsupervised learning, there is an expected output associated with every step of the experience. The system is given a dataset, and it already knows what the desired output will look like, along with the correct relationship between the input and output of every associated layer. This type of learning is often used for classification problems. A visual representation is given in Figure 7. Real-life examples of supervised learning are face detection, face recognition, and so on. Although supervised and unsupervised learning look like different identities, they are often connected to each other by various means. Hence, that fine line between these two learnings is often hazy to the student fraternity. The preceding statement can be formulated with the following mathematical expression: The general product rule of probability states that for an n number of datasets n ε ℝk, the joint distribution can be given fragmented as follows: The distribution signifies that the appeared unsupervised problem can be resolved by k number of supervised problems. Apart from this, the conditional probability of p (k | n), which is a supervised problem, can be solved using unsupervised learning algorithms to experience the joint distribution of p (n, k): Although these two types are not completely separate identities, they often help to classify the machine learning and deep learning algorithms based on the operations performed. Generally speaking, cluster formation, identifying the density of a population based on similarity, and so on are termed as unsupervised learning, whereas, structured formatted output, regression, classification, and so on are recognized as supervised learning. Semi-supervised learning As the name suggests, in this type of learning, both labelled and unlabeled data are used during the training. It's a class of supervised learning, which uses a vast amount of unlabeled data during training. For example, semi-supervised learning is used in Deep belief network (explained network), a type of deep network, where some layers learn the structure of the data (unsupervised), whereas one layer learns how to classify the data (supervised learning). In semi-supervised learning, unlabeled data from p (n) and labelled data from p (n, k) are used to predict the probability of k, given the probability of n, or p (k | n): Figure shows the impact of a large amount of unlabeled data during the semi-supervised learning technique. Art the top, it shows the decision boundary that the model puts after distinguishing the white and black circle. The figure at the bottom displays another decision boundary, which the model embraces. In that dataset, in addition to two different categories of circles, a collection of unlabeled data (grey circle) is also annexed. This type of training can be viewed as creating the cluster, and then marking those with the labelled data, which moves the decision boundary away from the high-density data region. Figure obtained from Wikipedia. Deep learning networks are all about representation of data. Therefore, semi-supervised learning is, generally, about learning a representation, whose objective function is given by the following: l = f (n) The objective of the equation is to determine the representation-based cluster. The preceding figure depicts the illustration of a semi-supervised learning. Readers can refer to Chapelle et al.'s book [7] to know more about semi-supervised learning methods. So, as we have already got a foundation of what Artificial Intelligence, machine learning, representation learning are, we can move our entire focus to elaborate on deep learning with further description. From the previously mentioned definition of deep learning, two major characteristics of deep learning can be pointed out as follows: A way of experiencing unsupervised and supervised learning of the feature representation through successive knowledge from subsequent abstract layers A model comprising of multiple abstract stages of non-linear information processing Summary In this article, we have explained most of these concepts in detail, and have also classified the various algorithms of deep learning.
Read more
  • 0
  • 0
  • 2480

article-image-microsoft-cognitive-services
Packt
04 Jan 2017
16 min read
Save for later

Microsoft Cognitive Services

Packt
04 Jan 2017
16 min read
In this article by Leif Henning Larsen, author of the book Learning Microsoft Cognitive Services, we will look into what Microsoft Cognitive Services offer. You will then learn how to utilize one of the APIs by recognizing faces in images. Microsoft Cognitive Services give developers the possibilities of adding AI-like capabilities to their applications. Using a few lines of code, we can take advantage of powerful algorithms that would usually take a lot of time, effort, and hardware to do yourself. (For more resources related to this topic, see here.) Overview of Microsoft Cognitive Services Using Cognitive Services means you have 21 different APIs at your hand. These are in turn separated into 5 top-level domains according to what they do. They are vision, speech, language, knowledge, and search. Let's see more about them in the following sections. Vision APIs under the vision flags allows your apps to understand images and video content. It allows you to retrieve information about faces, feelings, and other visual content. You can stabilize videos and recognize celebrities. You can read text in images and generate thumbnails from videos and images. There are four APIs contained in the vision area, which we will see now. Computer Vision Using the Computer Vision API, you can retrieve actionable information from images. This means you can identify content (such as image format, image size, colors, faces, and more). You can detect whether an image is adult/racy. This API can recognize text in images and extract it to machine-readable words. It can detect celebrities from a variety of areas. Lastly, it can generate storage-efficient thumbnails with smart cropping functionality. Emotion The Emotion API allows you to recognize emotions, both in images and videos. This can allow for more personalized experiences in applications. The emotions that are detected are cross-cultural emotions: anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise. Face We have already seen the very basic example of what the Face API can do. The rest of the API revolves around the same—to detect, identify, organize, and tag faces in photos. Apart from face detection, you can see how likely it is that two faces belong to the same person. You can identify faces and also find similar-looking faces. Video The Video API is about analyzing, editing, and processing videos in your app. If you have a video that is shaky, the API allows you to stabilize it. You can detect and track faces in videos. If a video contains a stationary background, you can detect motion. The API lets you to generate thumbnail summaries for videos, which allows users to see previews or snapshots quickly. Speech Adding one of the Speech APIs allows your application to hear and speak to your users. The APIs can filter noise and identify speakers. They can drive further actions in your application based on the recognized intent. Speech contains three APIs, which we will discuss now. Bing Speech Adding the Bing Speech API to your application allows you to convert speech to text and vice versa. You can convert spoken audio to text either by utilizing a microphone or other sources in real time or by converting audio from files. The API also offer speech intent recognition, which is trained by Language Understanding Intelligent Service to understand the intent. Speaker Recognition The Speaker Recognition API gives your application the ability to know who is talking. Using this API, you can use verify that someone speaking is who they claim to be. You can also determine who an unknown speaker is, based on a group of selected speakers. Custom Recognition To improve speech recognition, you can use the Custom Recognition API. This allows you to fine-tune speech recognition operations for anyone, anywhere. Using this API, the speech recognition model can be tailored to the vocabulary and speaking style of the user. In addition to this, the model can be customized to match the expected environment of the application. Language APIs related to language allow your application to process natural language and learn how to recognize what users want. You can add textual and linguistic analysis to your application as well as natural language understanding. The following five APIs can be found in the Language area. Bing Spell Check Bing Spell Check API allows you to add advanced spell checking to your application. Language Understanding Intelligent Service (LUIS) Language Understanding Intelligent Service, or LUIS, is an API that can help your application understand commands from your users. Using this API, you can create language models that understand intents. Using models from Bing and Cortana, you can make these models recognize common requests and entities (such as places, time, and numbers). You can add conversational intelligence to your applications. Linguistic Analysis Linguistic Analysis API lets you parse complex text to explore the structure of text. Using this API, you can find nouns, verbs, and more from text, which allows your application to understand who is doing what to whom. Text Analysis Text Analysis API will help you in extracting information from text. You can find the sentiment of a text (whether the text is positive or negative). You will be able to detect language, topic, and key phrases used throughout the text. Web Language Model Using the Web Language Model (WebLM) API, you are able to leverage the power of language models trained on web-scale data. You can use this API to predict which words or sequences follow a given sequence or word. Knowledge When talking about Knowledge APIs, we are talking about APIs that allow you to tap into rich knowledge. This may be knowledge from the Web, it may be academia, or it may be your own data. Using these APIs, you will be able to explore different nuances of knowledge. The following four APIs are contained in the Knowledge area. Academic Using the Academic API, you can explore relationships among academic papers, journals, and authors. This API allows you to interpret natural language user query strings, which allows your application to anticipate what the user is typing. It will evaluate the said expression and return academic knowledge entities. Entity Linking Entity Linking is the API you would use to extend knowledge of people, places, and events based on the context. As you may know, a single word may be used differently based on the context. Using this API allows you to recognize and identify each separate entity within a paragraph based on the context. Knowledge Exploration The Knowledge Exploration API will let you add the ability to use interactive search for structured data in your projects. It interprets natural language queries and offers auto-completions to minimize user effort. Based on the query expression received, it will retrieve detailed information about matching objects. Recommendations The Recommendations API allows you to provide personalized product recommendations for your customers. You can use this API to add frequently bought together functionality to your application. Another feature you can add is item-to-item recommendations, which allow customers to see what other customers who likes this also like. This API will also allow you to add recommendations based on the prior activity of the customer. Search Search APIs give you the ability to make your applications more intelligent with the power of Bing. Using these APIs, you can use a single call to access data from billions of web pages, images videos, and news. The following five APIs are in the search domain. Bing Web Search With Bing Web Search, you can search for details in billions of web documents indexed by Bing. All the results can be arranged and ordered according to the layout you specify, and the results are customized to the location of the end user. Bing Image Search Using Bing Image Search API, you can add advanced image and metadata search to your application. Results include URLs to images, thumbnails, and metadata. You will also be able to get machine-generated captions and similar images and more. This API allows you to filter the results based on image type, layout, freshness (how new is the image), and license. Bing Video Search Bing Video Search will allow you to search for videos and returns rich results. The results contain metadata from the videos, static- or motion- based thumbnails, and the video itself. You can add filters to the result based on freshness, video length, resolution, and price. Bing News Search If you add Bing News Search to your application, you can search for news articles. Results can include authoritative image, related news and categories, information on the provider, URL, and more. To be more specific, you can filter news based on topics. Bing Autosuggest Bing Autosuggest API is a small, but powerful one. It will allow your users to search faster with search suggestions, allowing you to connect powerful search to your apps. Detecting faces with the Face API We have seen what the different APIs can do. Now we will test the Face API. We will not be doing a whole lot, but we will see how simple it is to detect faces in images. The steps we need to cover to do this are as follows: Register for a free Face API preview subscription. Add necessary NuGet packages to our project. Add some UI to the test application. Detect faces on command. Head over to https://www.microsoft.com/cognitive-services/en-us/face-api to start the process of registering for a free subscription to the Face API. By clicking on the yellow button, stating Get started for free,you will be taken to a login page. Log in with your Microsoft account, or if you do not have one, register for one. Once logged in, you will need to verify that the Face API Preview has been selected in the list and accept the terms and conditions. With that out of the way, you will be presented with the following: You will need one of the two keys later, when we are accessing the API. In Visual Studio, create a new WPF application. Following the instructions at https://www.codeproject.com/articles/100175/model-view-viewmodel-mvvm-explained, create a base class that implements the INotifyPropertyChanged interface and a class implementing the ICommand interface. The first should be inherited by the ViewModel, the MainViewModel.cs file, while the latter should be used when creating properties to handle button commands. The Face API has a NuGet package, so we need to add that to our project. Head over to NuGet Package Manager for the project we created earlier. In the Browse tab, search for the Microsoft.ProjectOxford.Face package and install the it from Microsoft: As you will notice, another package will also be installed. This is the Newtonsoft.Json package, which is required by the Face API. The next step is to add some UI to our application. We will be adding this in the MainView.xaml file. First, we add a grid and define some rows for the grid: <Grid> <Grid.RowDefinitions> <RowDefinition Height="*" /> <RowDefinition Height="20" /> <RowDefinition Height="30" /> </Grid.RowDefinitions> Three rows are defined. The first is a row where we will have an image. The second is a line for status message, and the last is where we will place some buttons: Next, we add our image element: <Image x_Name="FaceImage" Stretch="Uniform" Source="{Binding ImageSource}" Grid.Row="0" /> We have given it a unique name. By setting the Stretch parameter to Uniform, we ensure that the image keeps its aspect ratio. Further on, we place this element in the first row. Last, we bind the image source to a BitmapImage interface in the ViewModel, which we will look at in a bit. The next row will contain a text block with some status text. The text property will be bound to a string property in the ViewModel: <TextBlock x_Name="StatusTextBlock" Text="{Binding StatusText}" Grid.Row="1" /> The last row will contain one button to browse for an image and one button to be able to detect faces. The command properties of both buttons will be bound to the DelegateCommand properties in the ViewModel: <Button x_Name="BrowseButton" Content="Browse" Height="20" Width="140" HorizontalAlignment="Left" Command="{Binding BrowseButtonCommand}" Margin="5, 0, 0, 5" Grid.Row="2" /> <Button x_Name="DetectFaceButton" Content="Detect face" Height="20" Width="140" HorizontalAlignment="Right" Command="{Binding DetectFaceCommand}" Margin="0, 0, 5, 5" Grid.Row="2"/> With the View in place, make sure that the code compiles and run it. This should present you with the following UI: The last part is to create the binding properties in our ViewModel and make the buttons execute something. Open the MainViewModel.cs file. First, we define two variables: private string _filePath; private IFaceServiceClient _faceServiceClient; The string variable will hold the path to our image, while the IFaceServiceClient variable is to interface the Face API. Next we define two properties: private BitmapImage _imageSource; public BitmapImage ImageSource { get { return _imageSource; } set { _imageSource = value; RaisePropertyChangedEvent("ImageSource"); } } private string _statusText; public string StatusText { get { return _statusText; } set { _statusText = value; RaisePropertyChangedEvent("StatusText"); } } What we have here is a property for the BitmapImage mapped to the Image element in the view. We also have a string property for the status text, mapped to the text block element in the view. As you also may notice, when either of the properties is set, we call the RaisePropertyChangedEvent method. This will ensure that the UI is updated when either of the properties has new values. Next, we define our two DelegateCommand objects and do some initialization through the constructor: public ICommand BrowseButtonCommand { get; private set; } public ICommand DetectFaceCommand { get; private set; } public MainViewModel() { StatusText = "Status: Waiting for image..."; _faceServiceClient = new FaceServiceClient("YOUR_API_KEY_HERE"); BrowseButtonCommand = new DelegateCommand(Browse); DetectFaceCommand = new DelegateCommand(DetectFace, CanDetectFace); } In our constructor, we start off by setting the status text. Next, we create an object of the Face API, which needs to be created with the API key we got earlier. At last, we create the DelegateCommand object for our command properties. Note how the browse command does not specify a predicate. This means it will always be possible to click on the corresponding button. To make this compile, we need to create the functions specified in the DelegateCommand constructors—the Browse, DetectFace, and CanDetectFace functions: private void Browse(object obj) { var openDialog = new Microsoft.Win32.OpenFileDialog(); openDialog.Filter = "JPEG Image(*.jpg)|*.jpg"; bool? result = openDialog.ShowDialog(); if (!(bool)result) return; We start the Browse function by creating an OpenFileDialog object. This dialog is assigned a filter for JPEG images, and in turn it is opened. When the dialog is closed, we check the result. If the dialog was cancelled, we simply stop further execution: _filePath = openDialog.FileName; Uri fileUri = new Uri(_filePath); With the dialog closed, we grab the filename of the file selected and create a new URI from it: BitmapImage image = new BitmapImage(fileUri); image.CacheOption = BitmapCacheOption.None; image.UriSource = fileUri; With the newly created URI, we want to create a new BitmapImage interface. We specify it to use no cache, and we set the URI source the URI we created: ImageSource = image; StatusText = "Status: Image loaded..."; } The last step we take is to assign the bitmap image to our BitmapImage property, so the image is shown in the UI. We also update the status text to let the user know the image has been loaded. Before we move on, it is time to make sure that the code compiles and that you are able to load an image into the View: private bool CanDetectFace(object obj) { return !string.IsNullOrEmpty(ImageSource?.UriSource.ToString()); } The CanDetectFace function checks whether or not the detect faces button should be enabled. In this case, it checks whether our image property actually has a URI. If it does, by extension that means we have an image, and we should be able to detect faces: private async void DetectFace(object obj) { FaceRectangle[] faceRects = await UploadAndDetectFacesAsync(); string textToSpeak = "No faces detected"; if (faceRects.Length == 1) textToSpeak = "1 face detected"; else if (faceRects.Length > 1) textToSpeak = $"{faceRects.Length} faces detected"; Debug.WriteLine(textToSpeak); } Our DetectFace method calls an async method to upload and detect faces. The return value contains an array of FaceRectangles. This array contains the rectangle area for all face positions in the given image. We will look into the function we call in a bit. After the call has finished executing, we print a line with the number of faces to the debug console window: private async Task<FaceRectangle[]> UploadAndDetectFacesAsync() { StatusText = "Status: Detecting faces..."; try { using (Stream imageFileStream = File.OpenRead(_filePath)) { In the UploadAndDetectFacesAsync function, we create a Stream object from the image. This stream will be used as input for the actual call to the Face API service: Face[] faces = await _faceServiceClient.DetectAsync(imageFileStream, true, true, new List<FaceAttributeType>() { FaceAttributeType.Age }); This line is the actual call to the detection endpoint for the Face API. The first parameter is the file stream we created in the previous step. The rest of the parameters are all optional. The second parameter should be true if you want to get a face ID. The next specifies if you want to receive face landmarks or not. The last parameter takes a list of facial attributes you may want to receive. In our case, we want the age parameter to be returned, so we need to specify that. The return type of this function call is an array of faces with all the parameters you have specified: List<double> ages = faces.Select(face => face.FaceAttributes.Age).ToList(); FaceRectangle[] faceRects = faces.Select(face => face.FaceRectangle).ToArray(); StatusText = "Status: Finished detecting faces..."; foreach(var age in ages) { Console.WriteLine(age); } return faceRects; } } The first line in the previous code iterates over all faces and retrieves the approximate age of all faces. This is later printed to the debug console window, in the following foreach loop. The second line iterates over all faces and retrieves the face rectangle with the rectangular location of all faces. This is the data we return to the calling function. Add a catch clause to finish the method. In case an exception is thrown, in our API call, we catch that. You want to show the error message and return an empty FaceRectangle array. With that code in place, you should now be able to run the full example. The end result will look like the following image: The resulting debug console window will print the following text: 1 face detected 23,7 Summary In this article, we looked at what Microsoft Cognitive Services offer. We got a brief description of all the APIs available. From there, we looked into the Face API, where we saw how to detect faces in images. Resources for Article: Further resources on this subject: Auditing and E-discovery [article] The Sales and Purchase Process [article] Manage Security in Excel [article]
Read more
  • 0
  • 0
  • 16238
article-image-neural-network
Packt
04 Jan 2017
11 min read
Save for later

What is an Artificial Neural Network?

Packt
04 Jan 2017
11 min read
In this article by Prateek Joshi, author of book Artificial Intelligence with Python, we are going to learn about artificial neural networks. We will start with an introduction to artificial neural networks and the installation of the relevant library. We will discuss perceptron and how to build a classifier based on that. We will learn about single layer neural networks and multilayer neural networks. (For more resources related to this topic, see here.) Introduction to artificial neural networks One of the fundamental premises of Artificial Intelligence is to build machines that can perform tasks that require human intelligence. The human brain is amazing at learning new things. Why not use the model of the human brain to build a machine? An artificial neural network is a model designed to simulate the learning process of the human brain. Artificial neural networks are designed such that they can identify the underlying patterns in data and learn from them. They can be used for various tasks such as classification, regression, segmentation, and so on. We need to convert any given data into the numerical form before feeding it into the neural network. For example, we deal with many different types of data including visual, textual, time-series, and so on. We need to figure out how to represent problems in a way that can be understood by artificial neural networks. Building a neural network The human learning process is hierarchical. We have various stages in our brain’s neural network and each stage corresponds to a different granularity. Some stages learn simple things and some stages learn more complex things. Let’s consider an example of visually recognizing an object. When we look at a box, the first stage identifies simple things like corners and edges. The next stage identifies the generic shape and the stage after that identifies what kind of object it is. This process differs for different tasks, but you get the idea! By building this hierarchy, our human brain quickly separates the concepts and identifies the given object. To simulate the learning process of the human brain, an artificial neural network is built using layers of neurons. These neurons are inspired from the biological neurons we discussed in the previous paragraph. Each layer in an artificial neural network is a set of independent neurons. Each neuron in a layer is connected to neurons in the adjacent layer. Training a neural network If we are dealing with N-dimensional input data, then the input layer will consist of N neurons. If we have M distinct classes in our training data, then the output layer will consist of M neurons. The layers between the input and output layers are called hidden layers. A simple neural network will consist of a couple of layers and a deep neural network will consist of many layers. Consider the case where we want to use a neural network to classify the given data. The first step is to collect the appropriate training data and label it. Each neuron acts as a simple function and the neural network trains itself until the error goes below a certain value. The error is basically the difference between the predicted output and the actual output. Based on how big the error is, the neural network adjusts itself and retrains until it gets closer to the solution. You can learn more about neural networks here: http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html. We will be using a library called NeuroLab . You can find more about it here: https://pythonhosted.org/neurolab. You can install it by running the following command on your Terminal: $ pip3 install neurolab Once you have installed it, you can proceed to the next section. Building a perceptron based classifier Perceptron is the building block of an artificial neural network. It is a single neuron that takes inputs, performs computation on them, and then produces an output. It uses a simple linear function to make the decision. Let’s say we are dealing with an N-dimension input datapoint. A perceptron computes the weighted summation of those N numbers and it then adds a constant to produce the output. The constant is called the bias of the neuron. It is remarkable to note that these simple perceptrons are used to design very complex deep neural networks. Let’s see how to build a perceptron based classifier using NeuroLab. Create a new python file and import the following packages: importnumpy as np importmatplotlib.pyplot as plt importneurolab as nl Load the input data from the text file data_perceptron.txt provided to you. Each line contains space separated numbers where the first two numbers are the features and the last number is the label: # Load input data text = np.loadtxt(‘data_perceptron.txt’) Separate the text into datapoints and labels: # Separate datapoints and labels data = text[:, :2] labels = text[:, 2].reshape((text.shape[0], 1)) Plot the datapoints: # Plot input data plt.figure() plt.scatter(data[:,0], data[:,1]) plt.xlabel(‘Dimension 1’) plt.ylabel(‘Dimension 2’) plt.title(‘Input data’) Define the maximum and minimum values that each dimension can take: # Define minimum and maximum values for each dimension dim1_min, dim1_max, dim2_min, dim2_max = 0, 1, 0, 1 Since the data is separated into two classes, we just need one bit to represent the output. So the output layer will contain a single neuron. # Number of neurons in the output layer num_output = labels.shape[1] We have a dataset where the datapoints are 2-dimensional. Let’s define a perceptron with 2 input neurons where we assign one neuron for each dimension. # Define a perceptron with 2 input neurons (because we # have 2 dimensions in the input data) dim1 = [dim1_min, dim1_max] dim2 = [dim2_min, dim2_max] perceptron = nl.net.newp([dim1, dim2], num_output) Train the perceptron with the training data: # Train the perceptron using the data error_progress = perceptron.train(data, labels, epochs=100, show=20, lr=0.03) Plot the training progress using the error metric: # Plot the training progress plt.figure() plt.plot(error_progress) plt.xlabel(‘Number of epochs’) plt.ylabel(‘Training error’) plt.title(‘Training error progress’) plt.grid() plt.show() The full code is given in the file perceptron_classifier.py. If you run the code, you will get two output figures. The first figure indicates the input datapoints: The second figure represents the training progress using the error metric: As we can observe from the preceding figure, the error goes down to 0 at the end of fourth epoch. Constructing a single layer neural network A perceptron is a good start, but it cannot do much. The next step is to have a set of neurons act as a unit to see what we can achieve. Let’s create a single neural network that consists of independent neurons acting on input data to produce the output. Create a new python file and import the following packages: importnumpy as np importmatplotlib.pyplot as plt importneurolab as nl Load the input data from the file data_simple_nn.txt provided to you. Each line in this file contains 4 numbers. The first two numbers form the datapoint and the last two numbers are the labels. Why do we need to assign two numbers for labels? Because we have 4 distinct classes in our dataset, so we need two bits represent them. # Load input data text = np.loadtxt(‘data_simple_nn.txt’) Separate the data into datapoints and labels: # Separate it into datapoints and labels data = text[:, 0:2] labels = text[:, 2:] Plot the input data: # Plot input data plt.figure() plt.scatter(data[:,0], data[:,1]) plt.xlabel(‘Dimension 1’) plt.ylabel(‘Dimension 2’) plt.title(‘Input data’) Extract the minimum and maximum values for each dimension (we don’t need to hardcode it like we did in the previous section): # Minimum and maximum values for each dimension dim1_min, dim1_max = data[:,0].min(), data[:,0].max() dim2_min, dim2_max = data[:,1].min(), data[:,1].max() Define the number of neurons in the output layer: # Define the number of neurons in the output layer num_output = labels.shape[1] Define a single layer neural network using the above parameters: # Define a single-layer neural network dim1 = [dim1_min, dim1_max] dim2 = [dim2_min, dim2_max] nn = nl.net.newp([dim1, dim2], num_output) Train the neural network using training data: # Train the neural network error_progress = nn.train(data, labels, epochs=100, show=20, lr=0.03) Plot the training progress: # Plot the training progress plt.figure() plt.plot(error_progress) plt.xlabel(‘Number of epochs’) plt.ylabel(‘Training error’) plt.title(‘Training error progress’) plt.grid() plt.show() Define some sample test datapoints and run the network on those points: The full code is given in the file simple_neural_network.py. If you run the code, you will get two figures. The first figure represents the input datapoints: # Run the classifier on test datapoints print(‘nTest results:’) data_test = [[0.4, 4.3], [4.4, 0.6], [4.7, 8.1]] for item in data_test: print(item, ‘-->‘, nn.sim([item])[0]) The second figure shows the training progress: You will see the following printed on your Terminal: If you locate those test datapoints on a 2D graph, you can visually verify that the predicted outputs are correct. Constructing a multilayer neural network In order to enable higher accuracy, we need to give more freedom the neural network. This means that a neural network needs more than one layer to extract the underlying patterns in the training data. Let’s create a multilayer neural network to achieve that. Create a new python file and import the following packages: importnumpy as np importmatplotlib.pyplot as plt importneurolab as nl In the previous two sections, we saw how to use a neural network as a classifier. In this section, we will see how to use a multilayer neural network as a regressor. Generate some sample datapoints based on the equation y = 3x^2 + 5 and then normalize the points: # Generate some training data min_val = -15 max_val = 15 num_points = 130 x = np.linspace(min_val, max_val, num_points) y = 3 * np.square(x) + 5 y /= np.linalg.norm(y) Reshape the above variables to create a training dataset: # Create data and labels data = x.reshape(num_points, 1) labels = y.reshape(num_points, 1) Plot the input data: # Plot input data plt.figure() plt.scatter(data, labels) plt.xlabel(‘Dimension 1’) plt.ylabel(‘Dimension 2’) plt.title(‘Input data’) Define a multilayer neural network with 2 hidden layers. You are free to design a neural network any way you want. For this case, let’s have 10 neurons in the first layer and 6 neurons in the second layer. Our task is to predict the value, so the output layer will contain a single neuron. # Define a multilayer neural network with 2 hidden layers; # First hidden layer consists of 10 neurons # Second hidden layer consists of 6 neurons # Output layer consists of 1 neuron nn = nl.net.newff([[min_val, max_val]], [10, 6, 1]) Set the training algorithm to gradient descent: # Set the training algorithm to gradient descent nn.trainf = nl.train.train_gd Train the neural network using the training data that was generated: # Train the neural network error_progress = nn.train(data, labels, epochs=2000, show=100, goal=0.01) Run the neural network on the training datapoints: # Run the neural network on training datapoints output = nn.sim(data) y_pred = output.reshape(num_points) Plot the training progress: # Plot training error plt.figure() plt.plot(error_progress) plt.xlabel(‘Number of epochs’) plt.ylabel(‘Error’) plt.title(‘Training error progress’) Plot the predicted output: # Plot the output x_dense = np.linspace(min_val, max_val, num_points * 2) y_dense_pred = nn.sim(x_dense.reshape(x_dense.size,1)).reshape(x_dense.size) plt.figure() plt.plot(x_dense, y_dense_pred, ‘-’, x, y, ‘.’, x, y_pred, ‘p’) plt.title(‘Actual vs predicted’) plt.show() The full code is given in the file multilayer_neural_network.py. If you run the code, you will get three figures. The first figure shows the input data: The second figure shows the training progress: The third figure shows the predicted output overlaid on top of input data: The predicted output seems to follow the general trend. If you continue to train the network and reduce the error, you will see that the predicted output will match the input curve even more accurately. You will see the following printed on your Terminal: Summary In this article, we learnt more about artificial neural networks. We discussed how to build and train neural networks. We also talked about perceptron and built a classifier based on that. We also learnt about single layer neural networks as well as multilayer neural networks. Resources for Article: Further resources on this subject: Training and Visualizing a neural network with R [article] Implementing Artificial Neural Networks with TensorFlow [article] How to do Machine Learning with Python [article]
Read more
  • 0
  • 0
  • 24394

article-image-text-recognition
Packt
04 Jan 2017
7 min read
Save for later

Text Recognition

Packt
04 Jan 2017
7 min read
In this article by Fábio M. Soares and Alan M.F. Souza, the authors of the book Neural Network Programming with Java - Second Edition, we will cover pattern recognition, neural networks in pattern recognition, and text recognition (OCR). We all know that humans can read and recognize images faster than any supercomputer; however we have seen so far that neural networks show amazing capabilities of learning through data in both supervised and unsupervised way. In this article we present an additional case of pattern recognition involving an example of Optical Character Recognition (OCR). Neural networks can be trained to strictly recognize digits written in an image file. The topics of this article are: Pattern recognition Defined classes Undefined classes Neural networks in pattern recognition MLP Text recognition (OCR) Preprocessing and Classes definition (For more resources related to this topic, see here.) Pattern recognition Patterns are a bunch of data and elements that look similar to each other, in such a way that they can occur systematically and repeat from time to time. This is a task that can be solved mainly by unsupervised learning by clustering; however, when there are labelled data or defined classes of data, this task can be solved by supervised methods. We as humans perform this task more often than we can imagine. When we see objects and recognize them as belonging to a certain class, we are indeed recognizing a pattern. Also when we analyze charts discrete events and time series, we might find an evidence of some sequence of events that repeat systematically under certain conditions. In summary, patterns can be learned by data observations. Examples of pattern recognition tasks include, not liming to: Shapes recognition Objects classification Behavior clustering Voice recognition OCR Chemical reactions taxonomy Defined classes In the existence of a list of classes that has been predefined for a specific domain, then each class is considered to be a pattern, therefore every data record or occurrence is assigned one of these predefined classes. The predefinition of classes can usually be performed by an expert or based on a previous knowledge of the application domain. Also it is desirable to apply defined classes when we want the data to be classified strictly into one of the predefined classes. One illustrated example for pattern recognition using defined classes is animal recognition by image, shown in the next figure. The pattern recognizer however should be trained to catch all the characteristics that formally define the classes. In the example eight figures of animals are shown, belonging to two classes: mammals and birds. Since this is a supervised mode of learning, the neural network should be provided with a sufficient number of images that allow it to properly classify new images: Of course, sometimes the classification may fail, mainly due to similar hidden patterns in the images that neural networks may catch and also due to small nuances present in the shapes. For example, the dolphin has flippers but it is still a mammal. Sometimes in order to obtain a better classification, it is necessary to apply preprocessing and ensure that the neural network will receive the appropriate data that would allow for classification. Undefined classes When data are unlabeled and there is no predefined set of classes, it is a scenario for unsupervised learning. Shapes recognition are a good example since they may be flexible and have infinite number of edges, vertices or bindings: In the previous figure, we can see some sorts of shapes and we want to arrange them, whereby the similar ones can be grouped into the same cluster. Based on the shape information that is present in the images, it is likely for the pattern recognizer to classify the rectangle, the square and the rectangular triangle in into the same group. But if the information were presented to the pattern recognizer, not as an image, but as a graph with edges and vertices coordinates, the classification might change a little. In summary, the pattern recognition task may use both supervised and unsupervised mode of learning, basically depending of the objective of recognition. Neural networks in pattern recognition For pattern recognition the neural network architectures that can be applied are the MLPs (supervised) and the Kohonen network (unsupervised). In the first case, the problem should be set up as a classification problem, that is. the data should be transformed into the X-Y dataset, where for every data record in X there should be a corresponding class in Y. The output of the neural network for classification problems should have all of the possible classes, and this may require preprocessing of the output records. For the other case, the unsupervised learning, there is no need to apply labels on the output, however, the input data should be properly structured as well. To remind the reader the schema of both neural networks are shown in the next figure: Data pre-processing We have to deal with all possible types of data, that is., numerical (continuous and discrete) and categorical (ordinal or unscaled). But here we have the possibility to perform pattern recognition on multimedia content, such as images and videos. So how could multimedia be handled? The answer of this question lies in the way these contents are stored in files. Images, for example, are written with a representation of small colored points called pixels. Each color can be coded in an RGB notation where the intensity of red, green and blue define every color the human eye is able to see. Therefore an image of dimension 100x100 would have 10,000 pixels, each one having 3 values for red, green and blue, yielding a total of 30,000 points. That is the challenge for image processing in neural networks. Some methods, may reduce this huge number of dimensions. Afterwards an image can be treated as big matrix of numerical continuous values. For simplicity in this article we are applying only gray-scaled images with small dimension. Text recognition (OCR) Many documents are now being scanned and stored as images, making necessary the task of converting these documents back into text, for a computer to apply edition and text processing. However, this feature involves a number of challenges: Variety of text font Text size Image noise Manuscripts In the spite of that, humans can easily interpret and read even the texts written in a bad quality image. This can be explained by the fact that humans are already familiar with the text characters and the words in their language. Somehow the algorithm must become acquainted with these elements (characters, digits, signalization, and so on), in order to successfully recognize texts in images. Digits recognition Although there are a variety of tools available in the market for OCR, this remains still a big challenge for an algorithm to properly recognize texts in images. So we will be restricting our application in a smaller domain, so we could face simpler problems. Therefore, in this article we are going to implement a Neural Network to recognize digits from 0 to 9 represented on images. Also the images will have standardized and small dimensions, for simplicity purposes. Summary In this article we have covered pattern recognition, neural networks in pattern recognition, and text recognition (OCR). Resources for Article: Further resources on this subject: Training neural networks efficiently using Keras [article] Implementing Artificial Neural Networks with TensorFlow [article] Training and Visualizing a neural network with R [article]
Read more
  • 0
  • 0
  • 1859

article-image-dimensionality-reduction
Packt
03 Jan 2017
15 min read
Save for later

Dimensionality Reduction

Packt
03 Jan 2017
15 min read
In this article by Ashish Kumar and Avinash Paul the authors of the book Mastering Text Mining with R, we will Data volume and high dimensions pose an astounding challenge in text mining tasks. Inherent noise and the computational cost of processing huge amount of datasets make it even more sarduous. The science of dimensionality reduction lies in the art of losing out on only a commensurately small amount of information and still being able to reduce the high dimension space into a manageable proportion. (For more resources related to this topic, see here.) For classification and clustering techniques to be applied to text data, for different natural language processing activities, we need to reduce the dimensions and noise in the data so that each document can be represented using fewer dimensions, thus significantly reducing the noise that can hinder the performance. The curse of dimensionality Topic modeling and document clustering are common text mining activities, but the text data can be very high-dimensional, which can cause a phenomenon called curse of dimensionality. Some literature also calls it concentration of measure: Distance is attributed to all the dimensions and assumes each of them to have the same effect on the distance. The higher the dimensions, the more similar things appear to each other. The similarity measures do not take into account the association of attributes, which may result in inaccurate distance estimation. The number of samples required per attribute increases exponentially with the increase in dimensions. A lot of dimensions might be highly correlated with each other, thus causing multi-collinearity. Extra dimensions cause a rapid volume increase that can result in high sparsity, which is a major issue in any method that requires statistical significance. Also, it causes huge variance in estimates, near duplicates, and poor predictors. Distance concentration and computational infeasibility Distance concentration is a phenomenon associated with high-dimensional space wherein pairwise distances or dissimilarity between points appear indistinguishable. All the vectors in high dimensions appear to be orthogonal to each other. The distances between each data point to its neighbors, farthest or nearest, become equal. This totally jeopardizes the utility of methods that use distance based measures. Let's consider that the number of samples is n and the number of dimensions is d. If d is very large, the number of samples may prove to be insufficient to accurately estimate the parameters. For the datasets with number of dimensions d, the number of parameters in the covariance matrix will be d^2. In an ideal scenario, n should be much larger than d^2, to avoid overfitting. In general, there is an optimal number of dimensions to use for a given fixed number of samples. While it may feel like a good idea to engineer more features, if we are not able to solve a problem with less number of features. But the computational cost and model complexity increases with the rise in number of dimensions. For instance, if n number of samples look to be dense enough for a one-dimensional feature space. For a k-dimensional feature space, n^k samples would be required. Dimensionality reduction Complex and noisy characteristics of textual data with high dimensions can be handled by dimensionality reduction techniques. These techniques reduce the dimension of the textual data while still preserving its underlying statistics. Though the dimensions are reduced, it is important to preserve the inter-document relationships. The idea is to have minimum number of dimensions, which can preserve the intrinsic dimensionality of the data. A textual collection is mostly represented in form of a term document matrix wherein we have the importance of each term in a document. The dimensionality of such a collection increases with the number of unique terms. If we were to suggest the simplest possible dimensionality reduction method, that would be to specify the limit or boundary on the distribution of different terms in the collection. Any term that occurs with a significantly high frequency is not going to be informative for us, and the barely present terms can undoubtedly be ignored and considered as noise. Some examples of stop words are is, was, then, and the. Words that generally occur with high frequency and have no particular meaning are referred to as stop words. Words that occur just once or twice are more likely to be spelling errors or complicated words, and hence both these and stop words should not be considered for modeling the document in the Term Document Matrix (TDM). We will discuss a few dimensionality reduction techniques in brief and dive into their implementation using R. Principal component analysis Principal component analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs. PCA comprises the following steps: Compute the n-dimensional mean of the given dataset. Compute the covariance matrix of the features. Compute the eigenvectors and eigenvalues of the covariance matrix. Rank/sort the eigenvectors by descending eigenvalue. Choose x eigenvectors with the largest eigenvalues. Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space. PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis. Using R for PCA R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns. prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function. princomp() fails in situations if the number of variables is larger than the number of observations. Each function returns a list whose class is prcomp() or princomp(). The information returned and terminology is summarized in the following table: prcomp() princomp() Explanation sdev sdev Standard deviation of each column Rotations Loading Principle components Center Center Subtracted value of each row or column to get the center data Scale Scale Scale factors used X Score The rotated data   n.obs Number of observations of each variable   Call The call to function that created the object Here's a list of the functions available in different R packages for performing PCA: PCA(): FactoMineR package acp(): amap package prcomp(): stats package princomp(): stats package dudi.pca(): ade4 package pcaMethods: This package from Bioconductor has various convenient  methods to compute PCA Understanding the FactoMineR package FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package: library(FactoMineR) data<-replicate(10,rnorm(1000)) result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T) print(result.pca) Results for the principal component analysis (PCA). The analysis was performed on 1,000 individuals, described by nine variables. The results are available in the following objects: Name Description $eig Eigenvalues $var Results for the variables $var$coord coord. for the variables $var$cor Correlations variables - dimensions $var$cos2 cos2 for the variables $var$contrib Contributions of the variables $ind Results for the individuals $ind$coord coord. for the individuals $ind$cos2 cos2 for the individuals $ind$contrib Contributions of the individuals $call Summary statistics $call$centre Mean of the variables $call$ecart.type Standard error of the variables $call$row.w Weights for the individuals $call$col.w Weights for the variables Eigenvalue percentage of variance cumulative percentage of variance: comp 1 1.1573559 12.859510 12.85951 comp 2 1.0991481 12.212757 25.07227 comp 3 1.0553160 11.725734 36.79800 comp 4 1.0076069 11.195632 47.99363 comp 5 0.9841510 10.935011 58.92864 comp 6 0.9782554 10.869505 69.79815 comp 7 0.9466867 10.518741 80.31689 comp 8 0.9172075 10.191194 90.50808 comp 9 0.8542724 9.491916 100.00000 Amap package Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame. This function is akin to princomp() and prcomp(), except that it has slightly different graphic represention. For more intricate details, refer to the CRAN-R resource page: https://cran.r-project.org/web/packages/lLibrary(amap/amap.pdf Library(amap acp(data,center=TRUE,reduce=TRUE) Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package: acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") W(x,h,D=NULL,kernel="gaussien") acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien") Proportion of variance We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence. R has a prcomp() function in the base package to estimate principal components. Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits: pca_base<-prcomp(data) print(pca_base) The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains: pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100 pr_variance [1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907 9.498072 [8] 9.218186 8.762572 8.430598 pr_variance signifies the proportion of variance explained by each component in descending order of magnitude. Let's calculate the cumulative proportion of variance for the components: cumsum(pr_variance) [1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057 73.58864 [8] 82.80683 91.56940 100.00000 Components 1-8 explain the 82% variance in the data. Singular vector decomposition Singular vector decomposition (SVD) is a dimensionality reduction technique that gained a lot of popularity in recent times after the famous Netflix Movie Recommendation challenge. Since its inception, it has found its usage in many applications in statistics, mathematics, and signal processing. It is primarily a technique to factorize any matrix; it can be real or a complex matrix. A rectangular matrix can be factorized into two orthonormal matrices and a diagonal matrix of positive real values. An m*n matrix is considered as m points in n-dimensional space; SVD attempts to find the best k dimensional subspace that fits the data: SVD in R is used to compute approximations of singular values and singular vectors of large-scale data matrices. These approximations are made using different types of memory-efficient algorithm, and IRLBA is one of them (named after Lanczos bi-diagonalization (IRLBA) algorithm). We shall be using the irlba package here in order to implement SVD. Implementation of SVD using R The following code will show the implementation of SVD using R: # List of packages for the session packages = c("foreach", "doParallel", "irlba") # Install CRAN packages (if not already installed) inst <- packages %in% installed.packages() if(length(packages[!inst]) > 0) install.packages(packages[!inst]) # Load packages into session lapply(packages, require, character.only=TRUE) # register the parallel session for registerDoParallel(cores=detectCores(all.tests=TRUE)) std_svd <- function(x, k, p=25, iter=0 1 ) { m1 <- as.matrix(x) r <- nrow(m1) c <- ncol(m1) p <- min( min(r,c)-k,p) z <- k+p m2 <- matrix ( rnorm(z*c), nrow=c, ncol=z) y <- m1 %*% m2 q <- qr.Q(qr(y)) b<- t(q) %*% m1 #iterations b1<-foreach( i=i1:iter ) %dopar% { y1 <- m1 %*% t(b) q1 <- qr.Q(qr(y1)) b1 <- t(q1) %*% m1 } b1<-b1[[iter]] b2 <- b1 %*% t(b1) eigens <- eigen(b2, symmetric=T) result <- list() result$svalues <- sqrt(eigens$values)[1:k] u1=eigens$vectors[1:k,1:k] result$u <- (q %*% eigens$vectors)[,1:k] result$v <- (t(b) %*% eigens$vectors %*% diag(1/eigens$values))[,1:k] return(result) } svd<- std_svd(x=data,k=5)) # singular vectors svd$svalues [1] 35.37645 33.76244 32.93265 32.72369 31.46702 We obtain the following values after running SVD using the IRLBA algorithm: d: approximate singular values. u: nu approximate left singular vectors v: nv approximate right singular vectors iter: # of IRLBA algorithm iterations mprod: # of matrix vector products performed These values can be used for obtaining results of SVD and understanding the overall statistics about how the algorithm performed. Latent factors # svd$u, svd$v dim(svd$u) #u value after running IRLBA [1] 1000 5 dim(svd$v) #v value after running IRLBA [1] 10 5 A modified version of the previous function can be achieved by altering the power iterations for a robust implementation: foreach( i = 1:iter )%dopar% { y1 <- m1 %*% t(b) y2 <- t(y1) %*% y1 r2 <- chol(y2, pivot = T) q1 <- y2 %*% solve(r2) b1 <- t(q1) %*% m1 } b2 <- b1 %*% t(b1) Some other functions available in R packages are as follows: Functions Package svd() svd Irlba() irlba svdImpute bcv ISOMAP – moving towards non-linearity ISOMAP is a nonlinear dimension reduction method and is representative of isometric mapping methods. ISOMAP is one of the approaches for manifold learning. ISOMAP finds the map that preserves the global, nonlinear geometry of the data by preserving the geodesic manifold inter-point distances. Like multi-dimensional scaling, ISOMAP creates a visual presentation of distance of a number of objects. Geodesic is the shortest curve along the manifold connecting two points induced by a neighborhood graph. Multi-dimensional scaling uses the Euclidian distance measure; since the data is in a nonlinear format, ISOMPA uses geodesic distance. ISOMAP can be viewed as an extension of metric multi-dimensional scaling. At a very high level, ISOMAP can be describes in four steps: Determine the neighbor of each point Construct a neighborhood graph Compute the shortest distance path between all pairs Construct k-dimensional coordinate vectors by applying MDS Geodesic distance approximation is basically calculated in three ways: Neighboring points: input-space distance Faraway points: a sequence of short hops between neighboring points Method: Finding shortest paths in a graph with edges connecting neighboring data points source("http://bioconductor.org/biocLite.R") biocLite("RDRToolbox") library('RDRToolbox') swiss_Data=SwissRoll(N = 1000, Plot=TRUE) x=SwissRoll() open3d() plot3d(x, col=rainbow(1050)[-c(1:50)],box=FALSE,type="s",size=1) simData_Iso = Isomap(data=swiss_Data, dims=1:10, k=10,plotResiduals=TRUE) library(vegan)data(BCI) distance <- vegdist(BCI) tree <- spantree(dis) pl1 <- ordiplot(cmdscale(dis), main="cmdscale") lines(tree, pl1, col="red") z <- isomap(distance, k=3) rgl.isomap(z, size=4, color="red") pl2 <- plot(isomap(distance, epsilon=0.5), main="isomap epsilon=0.5") pl3 <- plot(isomap(distance, k=5), main="isomap k=5") pl4 <- plot(z, main="isomap k=3") Summary The idea of this article was to get you familiar with some of the generic dimensionality reduction methods and their implementation using R language. We discussed a few packages that provide functions to perform these tasks. We also covered a few custom functions that can be utilized to perform these tasks. Kudos, you have completed the basics of text mining with R. You must be feeling confident about various data mining methods, text mining algorithms (related to natural language processing of the texts) and after reading this article, dimensionality reduction. If you feel a little low on confidence, do not be upset. Turn a few pages back and try implementing those tiny code snippets on your own dataset and figure out how they help you understand your data. Remember this - to mine something, you have to get into it by yourself. This holds true for text as well. Resources for Article: Further resources on this subject: Data Science with R [Article] Machine Learning with R [Article] Data mining [Article]
Read more
  • 0
  • 0
  • 4433
article-image-notes-field
Packt
03 Jan 2017
7 min read
Save for later

Notes from the field

Packt
03 Jan 2017
7 min read
In this article by Donabel Santos author of the book Tableau 10 Business Intelligence Cookbook would like to offer you perhaps a personal, and maybe a not-so-conventional way to introduce Tableau. I’d like to highlight a few key concepts and tricks that I think would be useful to you as you go along. These are certainly points I highlight on the board whenever I do training on Tableau. If you feel like we are jumping too far ahead, please go ahead and start with the following section Tableau Primer. Come back to this section when you are ready for the tips and tricks. (For more resources related to this topic, see here.) Instead of thinking of Tableau as this software tool that has a steep learning curve, it is useful to think of it as a blank slate. You will draw on it, keep on adding things, removing things until something makes sense or something insightful pops out. After you work with Tableau for a while and get more comfortable with its functionalities, it might even feel like an extension of your brain to some degree. When you get access to data, you might automatically open Tableau to try and understand what’s in that data. Undo is your best friend Do not be afraid to make mistakes, and do not be afraid to explore in Tableau. Do not come in with strict prejudice – for example thinking that you can only use a time series graph when you have a measure and a date field. The best way to learn and explore how powerful Tableau is to try anything and everything. It’s one of the best tools to experiment. If you make a mistake, or if you don’t like what you see, no sweat. Just click on this friendly undo button and you are back to your previous view. If you are more of a shortcut person, it will be Ctrl + Z on a PC or Command + Z on a Mac. It doesn’t change your original data This is another common concern that comes up in my training sessions or whenever I talk to people about Tableau. No, Tableau does not write back to your data source. All the changes you make will be stored in Tableau like creating calculated fields, changing data types, editing aliases will be stored in your Tableau workbook or data source. Drag and drop Tableau is a highly drag and drop software. Although you can use the menu or a right click instead of a drag and drop for the same tasks, dragging and dropping is often faster. It also flows with your train of thought. Look for visual cues Tableau leverages its visual culture in your design area, so when you create views in Tableau, some of the visual cues and icons can help you along the way. A number of the visual cues have been discussed in this section. However, there may be some lesser known (or less noticeable) visual cues: Italicized field names mean they are Tableau-generated fields: Dual axis charts create fused pills. Notice the area when the two pills touch – they’re straight instead of curved: When you zoom in to maps, or when you search for a place, your map gets pinned (or fixed to this place) until you unpin it: Know the difference between blue (discrete) and green (continuous) Knowing the difference between blue and green will take you far in the Tableau world. The data type icons you will find beside your field names in the side bar are colored either blue or green. When you drag fields onto shelves and cards, the pills are also colored blue and green. Simply speaking, blue means discrete and green means continuous. Discrete means individual, separate, countable and finite. Continuous means range, and technically, there is an infinite number of values within this range. What’s more important is how these are manifested in Tableau. A blue discrete field will produce header, and a green continuous field will produce an axis. If dropped onto the Color shelf, for example, a blue discrete field will use individual, finite colors. A green continuous field will use a range (gradient) of colors. Some confusion also arises when we see that, by default, Tableau places numeric fields under Measures and are colored green, and categorical information under Dimensions are colored blue. These won’t always be the case. We can have numeric values that are discrete – for example an Order Number. We can also see non-numerical, discrete fields under Measures. Learn a few key shortcuts Shortcuts are great, but it’s typically faster to work when you know a few of them. Here are some of my favorite shortcuts: Shortcut What it does Right click + Drag Opens the Drop Field menu, which allows you to specify exactly which variation of the field you want to use Double click Adds the field to the view I particularly like this when creating text tables. After you place your first measure in Text, you can add more measures to your text table by double clicking on the succeeding measures Ctrl + Arrow Adjusts the height/width of the rows/columns in the view Ctrl + H Presentation mode You can find the complete list of shortcuts here: http://bit.ly/tableau-shortcuts Unpackage option The .twbx file is a Tableau packaged workbook, which means it packages local files with your Tableau workbook. When you right click a .twbx file in a machine that has Tableau Desktop installed in it, you will see a new option called Unpackage. When you unpack a .twbx file, you will get the .twb file and another folder that contains all the local files that were used in the original workbook: Just keep in mind that data (at least the file-based data sources and extracts) get packaged with your .twbx files. This is an important security and data governance consideration when you are deciding how to share your workbooks with others. Table calculations are calculations on your table. How you structure or lay out your table (or view) will affect your table calculations. Table calculations are highly influenced by: Layout Filters Scope and Direction Let’s say, for example, you are calculating Percent of Total in your view. If you swap the fields in your Rows and Columns, i.e. changing the layout, your numbers will change If you filter some of the products out, your numbers will change If you decide to compute Pane Down instead of Table Across, your numbers will change If you’re looking for the common use cases for table calculations, check out the Tableau article entitled Top 10 Tableau Table Calculations which can be found here: http://bit.ly/top10tablecalcs LODs Rock Many of the tasks that required complex table calculations or data blending have been greatly simplified by LODs (Level of Detail expressions). LODs allow us to have multiple levels of detail within a single view, and this increases the possibilities in Tableau. To learn more about Level of Detail expressions, I encourage you to check out the following: Understanding Level of Detail Expressions: http://bit.ly/UnderstandingLOD Top 15 LOD Expressions: http://bit.ly/top15LOD It is possible …. Another common question that comes up is can I do <this> or is it possible to do <this>. The answer to many of the questions is yes, and many will include calculations and/or parameters. However, not all solutions will be quick and straightforward. Some may require multiple calculated fields, table calculations, LOD expressions, regular expressions, R scripts etc. Summary In this article we have seen the basics of Tableau as this software tool that has a steep learning curve, it is useful to think of it as a blank slate. You will draw on it, keep on adding things, removing things until something makes sense or something insightful pops out. After you work with Tableau for a while and get more comfortable with its functionalities, it might even feel like an extension of your brain to some degree. When you get access to data, you might automatically open Tableau to try and understand what’s in that data. Resources for Article: Further resources on this subject: Say Hi to Tableau [article] Getting Started with Tableau Public [article] R and its Diverse Possibilities [article]
Read more
  • 0
  • 0
  • 3356

article-image-recommendation-engines-explained
Packt
02 Jan 2017
10 min read
Save for later

Recommendation Engines Explained

Packt
02 Jan 2017
10 min read
In this article written by Suresh Kumar Gorakala, author of the book Building Recommendation Engines we will learn how to build a basic recommender system using R. In this article we will learn about various types of recommender systems in detail. This article explains neighborhood-similarity-based recommendations, personalized recommendation engines, model-based recommender systems and hybrid recommendation engines. Following are the different subtypes of recommender systems covered in this article: Neighborhood-based recommendation engines User-based collaborative filtering Item-based collaborative filtering Personalized recommendation engines Content-based recommendation engines Context-aware recommendation engines (For more resources related to this topic, see here.) Neighborhood-based recommendation engines As the name suggests, neighborhood-based recommender systems considers the preferences or likes of other users in the neighborhood before making suggestions or recommendations to the active user. While considering the preferences or tastes of neighbors, we first calculate how similar the other users are to the active user and then new items from more similar users are recommended to the user. Here the active user is the person to whom the system is serving recommendations. Since similarity calculations are involved these recommender systems are also called similarity-based recommender systems. Also since preferences or tastes are considered collaboratively from a pool of users these recommender systems are also called as collaborative filtering recommender systems. In this type of systems the main actors are the users, products and users preference information such as rating/ranking/liking towards the products. Preceding image is an example from Amazon showing collaborative filtering The collaborative filtering systems come in two flavors, They are: User-based collaborative filtering Item-based collaborative filtering Collaborative filtering When we have only the users interaction data of the products such as ratings, like/unlike, view/not viewed and we have to recommend new products then we have to choose Collaborative filtering approach. User-based Collaborative Filtering The basic intuition behind user-based collaborative filtering systems is, people with similar tastes in the past like similar items in future as well. For example, if user A and user B have very similar purchase history and if user A buys a new book which user B has not yet seen then we can suggest this book to User B as they have similar tastes. Item-based Collaborative filtering In this type of recommender systems unlike user-based collaborative filtering, we use similarity between items instead of similarity between users. Basic intuition for item-based recommender systems is if a User likes item A in past he might like item B which is similar to item A. In this approach instead of calculating similarity between users we calculate similarity between items or products. The most common similarity measure used for this approach is cosine similarity. Like user-based collaborative approach, we project the data into vector space and similarity between items is calculated using cosine angle between items. Similar to user-based collaborative filtering approach there are two steps for item-based collaborative approach. They are: Calculating the similarity between items. Predicting the ratings for the non rated item for a active user by making use of previous ratings given to other similar items Advantages of user-based collaborative filtering Easy to implement Neither the content information of the products nor users profile information is required for building recommendations New items are recommended to users giving Surprise factor to the users Disadvantages of user-based collaborative filtering This approach is computationally expensive as all the user information, product, rating information is loaded into the memory for similarity calculations. This approach fails for new users where we do not have any information about the users. This problem is called cold-start problem. This approach performs very poor if we have little data. Since we do not have content information about users or products. We cannot generate recommendations accurately only based on rating information only. Content-based recommender systems The recommendations are generated by considering only the rating or interaction information of the products by the users, that is suggesting new items for the active user are based on the ratings given to those new items by similar users to the active user. Assume the case of a person has given 4 star rating to a movie. In a collaborative filtering approach we only consider this rating information for generating recommendations. In real, a person rates a movie based on the features or content of the movie such as its genre, actor, director, story, and screenplay. Also the person watches a movie based on his personal choices. When we are building a recommendation engine to target users at personal level, the recommendations should not be based on the tastes of other similar people but should be based on the individual users’ tastes and the contents of the products. A recommendation which is targeted at personalized level and which considers individual preferences and contents of the products for generating recommendations are called content-based recommender systems. Another motivation for building content-based recommendation engines are they solve the cold start problem which new users face in collaborative filtering approach. When a new user comes based on the preferences of the person we can suggest new items which are similar to his tastes. Building content-based recommender systems involves three main steps as follows: Generating content information for products. Generating a user profile, preferences with respect to the features of the products. Generating recommendations predicting list of items which the user might like. Let us discuss each step in detail: Content extraction: In this step, we extract the features that represent the product. Most commonly the content of the products is represented in the vector space model with products name as rows and features as columns. User Profile generation: In this step, we build the user profile or preference matrix or vector space model matching the products content. Generating Recommendations: Now that, we have the generated the product content and user profile, the next step will be to generate the recommendations. Recommender systems using machine learning or any other mathematical, statistical models to generate recommendations are called as model-based systems Cosine similarity In this approach we first represent the user profiles and product content in vector forms and then we take cosine angle between each vector. The product which forms less angle with the user profile is considered as the most preferable item for the user. This approach is a standard approach while using neighborhood approach for Content based recommendations. Empirical studies shown that this approach gives more accurate results compared to other similarity measures. Classification-based approach Classification-based approaches fall under model-based recommender systems. In this approach, first we build a machine learning model by using the historical information, with user profile similar to the product content as input and the like/dislike of the product as output response classes. Supervised classification tasks such as logistic regression, KNN-classification methods, probabilistic methods and so on can be used. Advantages Content-based recommender systems are targeting at individual level Recommendations are generated using the user preferences alone unlike from user community as in collaborative filtering This approaches can be employed at real time as recommendation model doesn’t need to load the entire data for processing or generating recommendations Accuracy is high compared to collaborative approaches as they deal with the content of the products instead of rating information alone Cold start problem can be easily handled Disadvantages As the system is more personalized and the generated recommendations will become narrowed down to only user preferences with more and more user information comes into the system As a result, no new products that are not related to the user preferences will be shown to the user The user will not be able to look at what is happening around or what’s trending around Context-aware recommender Systems Over the years there has been evolution in recommender systems from neighborhood approaches to personalized recommender systems which are targeted to the individual users. These personalized recommender systems have become a huge success as this is useful at end user level and for organizations these systems become catalysts to increase their business. The personalized recommender systems, also called as content-based recommender systems are also getting evolved into Context aware recommender systems. Though the personalized recommender systems are targeted at individual user level and caters recommendations based on the personal preferences of the users, still there was scope to improve or refine the systems. Same person at different places might have different requirements. Likewise same person has different requirements at different times. Our intelligent recommender systems should be evolved enough to cater to the needs of the users for different places, at different times. Recommender System should be robust enough to suggest cotton shirts to a person during summer and suggesting Leather Jacket in winter. Similarly based on the time of the day suggesting Good restaurants serving a person’s personal choice breakfast and dinner would be very helpful. These kinds of recommender systems which considers location, time, mood, and so on that defines the context of user and suggests personalized recommendations are called context aware recommender systems. At broad level, context aware recommender systems are content-based recommenders with the inclusion of new dimension called context. In context aware systems, recommendations are generated in two steps: Generating list of recommendations of products for each user based on users’ preferences, that is content-based recommendations. Filtering out the recommendations that are specific to a current context. For example, based on past transaction history, interaction information, browsing patterns, ratings information on e-wallet mobile app, assume that User A is a movie lover, Sports lover, fitness freak. Using this information the content-based recommender systems generate recommendations of products such as Movie Tickets, 4G data offer for watching Football matches, Discount offers at GYM. Now based on the GPS co-ordinates of the mobile if the User A found to be at a 10K RUN marathon, then my Context aware recommendation engine will take this location information as the context and filters out the offers that are relevant to the current context and recommends Discount Offers at GYM to the user A. Most common approaches for building Context Aware Recommender systems are: Post filtering Approaches Pre-filtering approaches Pre-filtering approaches In pre-filtering approach, context information is applied to the User profile and product content. This step will filter out all the non relevant features and final personalized recommendations are generated on remaining feature set. Since filtering of features are made before generating personalized recommendations, these are called pre-filtering approaches. Post filtering approaches In post-filtering, firstly personalized recommendations are generated based on the user profile and product catalogue then the context information is applied for filtering out the relevant products to the user for the current context. Advantages Context aware systems are much advanced than the personalized content-based recommenders as these systems will be constantly in sync with user movements and generate recommendations as per current context. These systems are more real-time nature. Disadvantages Serendipity or surprise factor as in other personalized recommenders will be missing in this type of recommendations as well. Summary In this article, we have learned about popular recommendation engine techniques such as, collaborative filtering, content-based recommendations, context aware systems, hybrid recommendations, model-based recommendation systems with their advantages and disadvantages. Different similarity methods such as cosine similarity, Euclidean distance, Pearson-coefficient. Sub categories within each of the recommendations are also explained. Resources for Article: Further resources on this subject: Building a Recommendation Engine with Spark [article] Machine Learning Tasks [article] Machine Learning with R [article]
Read more
  • 0
  • 0
  • 53536
Modal Close icon
Modal Close icon