Data | 0 articles | Tech News, Tutorials & Expert Insights

28 Sep 2016

11 min read

Unsupervised Learning

28 Sep 2016

0
0
3359

article-image-deep-learning-image-generation-getting-started-generative-adversarial-networks

Mohammad Pezeshki

27 Sep 2016

5 min read

Deep Learning and Image generation: Get Started with Generative Adversarial Networks

Mohammad Pezeshki

27 Sep 2016

5 min read

In machine learning, a generative model is one that captures the observable data distribution. The objective of deep neural generative models is to disentangle different factors of variation in data and be able to generate new or similar-looking samples of the data. For example, an ideal generative model on face images disentangles all different factors of variation such as illumination, pose, gender, skin color, and so on, and is also able to generate a new face by the combination of those factors in a very non-linear way. Figure 1 shows a trained generative model that has learned different factors, including pose and the degree of smiling. On the x-axis, as we go to the right, the pose changes and on y-axis as we move upwards, smiles turn to frowns. Usually these factors are orthogonal to one another, meaning that changing one while keeping the others fixed leads to a single change in data space; e.g. in the first row of Figure 1, only the pose changes with no change in the degree of smiling. The figure is adapted from here. Based on the assumption that these underlying factors of variation have a very simple distribution (unlike the data itself), to generate a new face, we can simply sample a random number from the assumed simple distribution (such as a Gaussian). In other words, if there are k different factors, we randomly sample from a k-dimensional Gaussian distribution (aka noise). In this post, we will take a look at one of the recent models in the area of deep learning and generative models, called generative adversarial network. This model can be seen as a game between two agents: the Generator and the Discriminator. The generator generates images from noise and the discriminator discriminates between real images and those images which are generated by the generator. The objective is then to train the model such that while the discriminator tries to distinguish generated images from real images, the generator tries to fool the discriminator. To train the model, we need to define a cost. In the case of GAN, the errors made by the discriminator are considered as the cost. Consequently, the objective of the discriminator is to minimize the cost, while the objective for the generator is to fool the discriminator by maximizing the cost. A graphical illustration of the model is shown in Figure 2. Formally, we define the discriminator as a binary classiffier D : Rm ! f0; 1g and the generator as the mapping G : Rk ! Rm in which k is the dimension of the latent space that represents all of the factors of variation. Denoting the data by x and a point in latent space by z, the model can be trained by playing the following minmax game: Note that the rst term encourages the discriminator to discriminate between generated images and real ones, while the second term encourages the generator to come up with images that would fool the discriminator. In practice, the log in the second term can be saturated, which would hurt the row of the gradient. As a result, the cost may be reformulated equivalently as: At the time of generation, we can sample from a simple k-dimensional Gaussian distribution with zero mean and unit variance and pass it onto the generator. Among different models that can be used as the discriminator and generator, we use deep neural networks with parameters D and G for the discriminator and generator, respectively. Since the training boils down to updating the parameters using the backpropagation algorithm, the update rule is defined as follows: If we use a convolutional network as the discriminator and another convolutional network with fractionally strided convolution layers as the generator, the model is called DCGAN (Deep Convolutional Generative Adversarial Network). Some samples of bedroom im-age generation from this model are shown in Figure 3. The generator can also be a sequential model, meaning that it can generate an image using a sequence of images with lower-resolution or details. A few examples of the generated images using such a model are shown in Figure 4. The GAN and later variants such as the DCGAN are currently considered to be among the best when it comes to the quality of the generated samples. The images look so realistic that you might assume that the model has simply memorized instances of the training set, but a quick KNN search reveals this not to be the case. About the author Mohammad Pezeshk is a master’s student in the LISA lab of Universite de Montreal working under the supervision of Yoshua Bengio and Aaron Courville. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014 and then started his master’s in September 2014. His research interests lie in the fields of artificial intelligence, machine learning, probabilistic models and specifically deep learning.

0
0
4389

article-image-language-modeling-with-deep-learning

Mohammad Pezeshki

23 Sep 2016

5 min read

Language Modeling with Deep Learning

Mohammad Pezeshki

23 Sep 2016

5 min read

Language modeling is defining a joint probability distribution over a sequence of tokens (words or characters). Considering a sequence of tokens fx1; :::; xT g. A language model defines P (x1; : : : ; xT ), which can be used in many areas of natural language processing. Language modelings define a joint probability distribution over a sequence of tokens (words or characters). Consider a sequence of tokens x1; : : : ; xT. For example, a language model can significantly improve the accuracy of a speech recognition system. As an example, in the case of two words that have the same sound but different meanings, a language model can fix the problem of recognizing the right word. In Figure 1, the speech recognizer (aka acoustic model) has assigned the same high probabilities to the words meet" and meat". It is even possible that the speech recognizer assigns a higher probability to meet" rather than meat". However, by conditioning the language model on the three rst tokens (I-cooked-some"), the next word could be sh", pasta", or meat" with a reasonable probability higher than the probability of meet". To get the final answer, we can simply multiply two tables of probabilities and normalize them. Now the word meat" has a very high relative probability! One family of deep learning models that are capable of modeling sequential data (such as language) is Recurrent Neural Networks (RNNs). RNNs have recently achieved impressive results on different problems such as the language modeling. In this article, we briefly describe RNNs and demonstrate how to code them using the Blocks library on top of Theano. Consider a sequence of T input elements x1; : : : ; xT . RNN models the sequence by applying the same operation in a recursive way. Formally, ht = f(ht 1; xt); (1) yt = g(h t); (2) Where ht is the internal hidden representation of the RNN and yt is the output at tth time-step. For the very first time-step, we also have an initial state h0. f and g are two functions, which are shared across the time axis. In the simplest case, f and g can be a linear transformation followed by a non-linearity. There are more complicated forms of f and g such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). Here we skip the exact formulations of f and g to use LSTM as a black box. Consequently, suppose we have B sequences, each with a length of T, such that each time-step is presented in a vector of size F . So the input can be seen as a 3D tensor with size T xBxF, the hidden representation with size T xBxF 0, and the output with size T xBxF 00. Let's build a character-level language model that can model the joint probability P (x1; : : : ; xT ) using the chain rule: P (x1; : : : ; xT ) = P (x1)P (x2jx1)P (x3jx1; x2):::P (xT jx1::T1) (3) We can model P (xtjx1::t1) using an RNN by predicting xt given xt1::1. In other words, given a sequence fx1; : : : ; xT g, the input sequence is fx1; : : : ; xT1g and the target sequence is fx2; : : : ; xT g. To define input and target, we can write: Now to define the model, we need a linear transformation from the input to the LSTM, and from the LSTM to the output. To train the model, we use the cross entropy between the model output and the true target: Now assuming that data is provided to us, using data stream, we can start training by initializing the model, and tuning parameters: After the model is trained, we can condition the model on an initial sequence and start generating the next token. We can repeatedly feed the predicted token into the model and get the next token. We can even just start from the initial state and ask the model to hallucinate! Here is a sample generated text from a model trained on a 96 MB text data of wikipedia (figure adapted from here): Here is a visualization of the model's output. The first line is the real data and the next six lines are the candidate with the highest output probability of for each character. The more red a cell is, the higher probability the model assigns to that character. For example, as soon as the model sees ttp://ww, it is confident that the next character is also a w" and the next one is a .". Butat this point, there is no more clue about the next character. So the model assigns almost the same probability to all the characters (figure adapted from here): In this post we learned about language modeling and one of its applications in speech recognition. We also learned how to code a recurrent neural network in order to train such a model. You can find the complete code and experiment on a bunch of datasets such as wikipedia at Github. The code is written by my close friend Eloi Zablocki and me. About the author Mohammad Pezeshk is a master’s student in the LISA lab of Universite de Montreal working under the supervision of Yoshua Bengio and Aaron Courville. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014 and then started his master’s in September 2014. His research interests lie in the fields of artifitial intelligence, machine learning, probabilistic models and specifically deep learning.

0
0
17231

Packt

06 Sep 2016

14 min read

Indexes and Constraints

Packt

06 Sep 2016

14 min read

0
0
6506

Packt

16 Aug 2016

5 min read

Preprocessing the Data

Packt

16 Aug 2016

5 min read

In this article, by Sampath Kumar Kanthala, the author of the book Practical Data Analysis discusses how to obtain, clean, normalize, and transform raw data into a standard format like CVS or JSON using OpenRefine. In this article we will cover: Data Scrubbing Statistical methods Text Parsing Data Transformation (For more resources related to this topic, see here.) Data scrubbing Scrubbing data also called data cleansing, is the process of correcting or removing data in a dataset that is incorrect, inaccurate, incomplete, improperly formatted, or duplicated. The result of the data analysis process not only depends on the algorithms, it depends on the quality of the data. That's why the next step after obtaining the data, is the data scrubbing. In order to avoid dirty data our dataset should possess the following characteristics: Correct Completeness Accuracy Consistency Uniformity Dirty data can be detected by applying some simple statistical data validation also by parsing the texts or deleting duplicate values. Missing or sparse data can lead you to highly misleading results. Statistical methods In this method we need some context about the problem (knowledge domain) to find values that are unexpected and thus erroneous, even if the data type match but the values are out of the range, it can be resolved by setting the values to an average or mean value. Statistical validations can be used to handle missing values which can be replaced by one or more probable values using Interpolation or by reducing the data set using decimation. Mean: Value calculated by summing up all values and then dividing by the number of values. Median: The median is defined as the value where 50% of values in a range will be below, 50% of values above the value. Range constraints: Numbers or dates should fall within a certain range. That is, they have minimum and/or maximum possible values. Clustering: Usually, when we obtain data directly from the user some values include ambiguity or refer to the same value with a typo. For example, "Buchanan Deluxe 750ml 12x01 "and "Buchanan Deluxe 750ml 12x01." which are different only by a "." or in the case of "Microsoft" or "MS" instead of "Microsoft Corporation" which refer to the same company and all values are valid. In those cases, grouping can help us to get accurate data and eliminate duplicated enabling a faster identification of unique values. Text parsing We perform parsing to help us to validate if a string of data is well formatted and avoid syntax errors. Regular expression patterns usually, text fields would have to be validated this way. For example, dates, e-mail, phone numbers, and IP address. Regex is a common abbreviation for "regular expression"): In Python we will use re module to implement regular expressions. We can perform text search and pattern validations. First, we need to import the re module. import re In the follow examples, we will implement three of the most common validations (e-mail, IP address, and date format). E-mail validation: myString = 'From: readers@packt.com (readers email)' result = re.search('([w.-]+)@([w.-]+)', myString) if result: print (result.group(0)) print (result.group(1)) print (result.group(2)) Output: >>> readers@packt.com >>> readers >>> packt.com The function search() scans through a string, searching for any location where the Regex matches. The function group() helps us to return the string matched by the Regex. The pattern w matches any alphanumeric character and is equivalent to the class [a-zA-Z0-9_]. IP address validation: isIP = re.compile('d{1,3}.d{1,3}.d{1,3}.d{1,3}') myString = " Your IP is: 192.168.1.254 " result = re.findall(isIP,myString) print(result) Output: >>> 192.168.1.254 The function findall() finds all the substrings where the Regex matches, and returns them as a list. The pattern d matches any decimal digit, is equivalent to the class [0-9]. Date format: myString = "01/04/2001" isDate = re.match('[0-1][0-9]/[0-3][0-9]/[1-2][0-9]{3}', myString) if isDate: print("valid") else: print("invalid") Output: >>> 'valid' The function match() finds if the Regex matches with the string. The pattern implements the class [0-9] in order to parse the date format. For more information about regular expressions: http://docs.python.org/3.4/howto/regex.html#regex-howto Data transformation Data transformation is usually related with databases and data warehouse where values from a source format are extract, transform, and load in a destination format. Extract, Transform, and Load (ETL) obtains data from data sources, performs some transformation function depending on our data model and loads the result data into destination. Data extraction allows us to obtain data from multiple data sources, such as relational databases, data streaming, text files (JSON, CSV, XML), and NoSQL databases. Data transformation allows us to cleanse, convert, aggregate, merge, replace, validate, format, and split data. Data loading allows us to load data into destination format, like relational databases, text files (JSON, CSV, XML), and NoSQL databases. In statistics data transformation refers to the application of a mathematical function to the dataset or time series points. Summary In this article, we explored the common data sources and implemented a web scraping example. Next, we introduced the basic concepts of data scrubbing like statistical methods and text parsing. Resources for Article: Further resources on this subject: MicroStrategy 10 [article] Expanding Your Data Mining Toolbox [article] Machine Learning Using Spark MLlib [article]

0
0
8387

Packt

11 Aug 2016

8 min read

Application Logging

Packt

11 Aug 2016

8 min read

In this article by Travis Marlette, author of Splunk Best Practices, will cover the following topics: (For more resources related to this topic, see here.) Log messengers Logging formats Within the working world of technology, there are hundreds of thousands of different applications, all (usually) logging in different formats. As Splunk experts, our job is make all those logs speak human, which is often the impossible task. With third-party applications that provide support, sometimes log formatting is out of our control. Take for instance, Cisco or Juniper, or any other leading application manufacturer. We won't be discussing these kinds of logs in this article, but instead the logs that we do have some control over. The logs I am referencing belong to proprietary in-house (also known as "home grown") applications that are often part of the middle-ware, and usually they control some of the most mission-critical services an organization can provide. Proprietary applications can be written in any language, however logging is usually left up to the developers for troubleshooting and up until now the process of manually scraping log files to troubleshoot quality assurance issues and system outages has been very specific. I mean that usually, the developer(s) are the only people that truly understand what those log messages mean. That being said, oftentimes developers write their logs in a way that they can understand them, because ultimately it will be them doing the troubleshooting / code fixing when something breaks severely. As an IT community, we haven't really started taking a look at the way we log things, but instead we have tried to limit the confusion to developers, and then have them help other SME's that provide operational support to understand what is actually happening. This method is successful, however it is slow, and the true value of any SME is reducing any system’s MTTR, and increasing uptime. With any system, the more transactions processed means the larger the scale of a system, which means that, after about 20 machines, troubleshooting begins to get more complex and time consuming with a manual process. This is where something like Splunk can be extremely valuable, however Splunk is only as good as the information that is coming into it. I will say this phrase for the people who haven't heard it yet; "garbage in… garbage out". There are some ways to turn proprietary logging into a powerful tool, and I have personally seen the value of these kinds of logs, after formatting them for Splunk, they turn into a huge asset in an organization’s software life cycle. I'm not here to tell you this is easy, but I am here to give you some good practices about how to format proprietary logs. To do that I'll start by helping you appreciate a very silent and critical piece of the application stack. To developers, a logging mechanism is a very important part of the stack, and the log itself is mission critical. What we haven't spent much time thinking about before log analyzers, is how to make log events/messages/exceptions more machine friendly so that we can socialize the information in a system like Splunk, and start to bridge the knowledge gap between development and operations. The nicer we format the logs, the faster Splunk can reveal the information about our systems, saving everyone time and from headaches. Loggers Here I'm giving some very high level information on loggers. My intention is not to recommend logging tools, but simply to raise awareness of their existence for those that are not in development, and allow for independent research into what they do. With the right developer, and the right Splunker, the logger turns into something immensely valuable to an organization. There is an array of different loggers in the IT universe, and I'm only going to touch on a couple of them here. Keep in mind that I'm only referencing these due to the ease of development I've seen from personal experience, and experiences do vary. I'm only going to touch on three loggers and then move on to formatting, as there are tons of logging mechanisms and the preference truly depends on the developer. Anatomy of a log I'm going to be taking some very broad strokes with the following explanations in order to familiarize you, the Splunker, with the logger. If you would like to learn more information, please either seek out a developer to help you understand the logic better or acquire some education how to develop and log in independent study. There are some pretty basic components to logging that we need to understand to learn which type of data we are looking at. I'll start with the four most common ones: Log events: This is the entirety of the message we see within a log, often starting with a timestamp. The event itself contains all other aspects of application behavior such as fields, exceptions, messages, and so on… think of this as the "container" if you will, for information. Messages: These are often made by the developer of the application and provide some human insight into what's actually happening within an application. The most common messages we see are things like unauthorized login attempt <user> or Connection Timed out to <ip address>. Message Fields: These are the pieces of information that give us the who, where, and when types of information for the application’s actions. They are handed to the logger by the application itself as it either attempts or completes an activity. For instance, in the log event below, the highlighted pieces are what would be fields, and often those that people look for when troubleshooting: "2/19/2011 6:17:46 AM Using 'xplog70.dll' version '2009.100.1600' to execute extended store procedure 'xp_common_1' operation failed to connect to 'DB_XCUTE_STOR'" Exceptions: These are the uncommon, but very important pieces of the log. They are usually only written when something went wrong, and offer developer insight into the root cause at the application layer. They are usually only printed when an error occurs, and used for debugging. These exceptions can print a huge amount of information into the log depending on the developer and the framework. The format itself is not easy and in some cases not even possible for a developer to manage. Log4* This is an open source logger that is often used in middleware applications. Pantheios This is a logger popularly used for Linux, and popular for its performance and multi-threaded handling of logging. Commonly, Pantheios is used for C/C++ applications, but it works with a multitude of frameworks. Logging – logging facility for Python This is a logger specifically for Python, and since Python is becoming more and more popular, this is a very common package used to log Python scripts and applications. Each one of these loggers has their own way of logging, and the value is determined by the application developer. If there is no standardized logging, then one can imagine the confusion this can bring to troubleshooting. Example of a structured log This is an example of a Java exception in a structured log format: When Java prints this exception, it will come in this format and a developer doesn't control what that format is. They can control some aspects about what is included within an exception, though the arrangement of the characters and how it's written is done by the Java framework itself. I mention this last part in order to help operational people understand where the control of a developer sometimes ends. My own personal experience has taught me that attempting to change a format that is handled within the framework itself is an attempt at futility. Pick your battles right? As a Splunker, you can save yourself a headache on this kind of thing. Summary While I say that, I will add an addendum by saying that Splunk, mixed with a Splunk expert and the right development resources, can also make the data I just mentioned extremely valuable. It will likely not happen as fast as they make it out to be at a presentation, and it will take more resources than you may have thought, however at the end of your Splunk journey, you will be happy. This article was to help you understand the importance of logs formatting, and how logs are written. We often don't think about our logs proactively, and I encourage you to do so. Resources for Article: Further resources on this subject: Logging and Monitoring [Article] Logging and Reports [Article] Using Events, Interceptors, and Logging Services [Article]

0
0
1841

Packt

10 Aug 2016

11 min read

Consistency Conflicts

Packt

10 Aug 2016

11 min read

In this article by Robert Strickland, author of the book Cassandra 3.x High Availability - Second Edition, we will discuss how for any given call, it is possible to achieve either strong consistency or eventual consistency. In the former case, we can know for certain that the copy of the data that Cassandra returns will be the latest. In the case of eventual consistency, the data returned may or may not be the latest, or there may be no data returned at all if the node is unaware of newly inserted data. Under eventual consistency, it is also possible to see deleted data if the node you're reading from has not yet received the delete request. (For more resources related to this topic, see here.) Depending on the read_repair_chance setting and the consistency level chosen for the read operation, Cassandra might block the client and resolve the conflict immediately, or this might occur asynchronously. If data in conflict is never requested, the system will resolve the conflict the next time nodetool repair is run. How does Cassandra know there is a conflict? Every column has three parts: key, value, and timestamp. Cassandra follows last-write-wins semantics, which means that the column with the latest timestamp always takes precedence. Now, let's discuss one of the most important knobs a developer can turn to determine the consistency characteristics of their reads and writes. Consistency levels On every read and write operation, the caller must specify a consistency level, which lets Cassandra know what level of consistency to guarantee for that one call. The following table details the various consistency levels and their effects on both read and write operations: Consistency level Reads Writes ANY This is not supported for reads. Data must be written to at least one node, but permits writes via hinted handoff. Effectively allows a write to any node, even if all nodes containing the replica are down. A subsequent read might be impossible if all replica nodes are down. ONE The replica from the closest node will be returned. Data must be written to at least one replica node (both commit log and memtable). Unlike ANY, hinted handoff writes are not sufficient. TWO The replicas from the two closest nodes will be returned. The same as ONE, except two replicas must be written. THREE The replicas from the three closest nodes will be returned. The same as ONE, except three replicas must be written. QUORUM Replicas from a quorum of nodes will be compared, and the replica with the latest timestamp will be returned. Data must be written to a quorum of replica nodes (both commit log and memtable) in the entire cluster, including all data centers. SERIAL Permits reading uncommitted data as long as it represents the current state. Any uncommitted transactions will be committed as part of the read. Similar to QUORUM, except that writes are conditional based on the support for lightweight transactions. LOCAL_ONE Similar to ONE, except that the read will be returned by the closest replica in the local data center. Similar to ONE, except that the write must be acknowledged by at least one node in the local data center. LOCAL_QUORUM Similar to QUORUM, except that only replicas in the local data center are compared. Similar to QUORUM, except the quorum must only be met using the local data center. LOCAL_SERIAL Similar to SERIAL, except only local replicas are used. Similar to SERIAL, except only writes to local replicas must be acknowledged. EACH_QUORUM The opposite of LOCAL_QUORUM; requires each data center to produce a quorum of replicas, then returns the replica with the latest timestamp. The opposite of LOCAL_QUORUM; requires a quorum of replicas to be written in each data center. ALL Replicas from all nodes in the entire cluster (including all data centers) will be compared, and the replica with the latest timestamp will be returned. Data must be written to all replica nodes (both commit log and memtable) in the entire cluster, including all data centers. As you can see, there are numerous combinations of read and write consistency levels, all with different ultimate consistency guarantees. To illustrate this point, let's assume that you would like to guarantee absolute consistency for all read operations. On the surface, it might seem as if you would have to read with a consistency level of ALL, thus sacrificing availability in the case of node failure. But there are alternatives depending on your use case. There are actually two additional ways to achieve strong read consistency: Write with consistency level of ALL: This has the advantage of allowing the read operation to be performed using ONE, which lowers the latency for that operation. On the other hand, it means the write operation will result in UnavailableException if one of the replica nodes goes offline. Read and write with QUORUM or LOCAL_QUORUM: Since QUORUM and LOCAL_QUORUM both require a majority of nodes, using this level for both the write and the read will result in a full consistency guarantee (in the same data center when using LOCAL_QUORUM), while still maintaining availability during a node failure. You should carefully consider each use case to determine what guarantees you actually require. For example, there might be cases where a lost write is acceptable, or occasions where a read need not be absolutely current. At times, it might be sufficient to write with a level of QUORUM, then read with ONE to achieve maximum read performance, knowing you might occasionally and temporarily return stale data. Cassandra gives you this flexibility, but it's up to you to determine how to best employ it for your specific data requirements. A good rule of thumb to attain strong consistency is that the read consistency level plus write consistency level should be greater than the replication factor. If you are unsure about which consistency levels to use for your specific use case, it's typically safe to start with LOCAL_QUORUM (or QUORUM for a single data center) reads and writes. This configuration offers strong consistency guarantees and good performance while allowing for the inevitable replica failure. It is important to understand that even if you choose levels that provide less stringent consistency guarantees, Cassandra will still perform anti-entropy operations asynchronously in an attempt to keep replicas up to date. Repairing data Cassandra employs a multifaceted anti-entropy mechanism that keeps replicas in sync. Data repair operations generally fall into three categories: Synchronous read repair: When a read operation requires comparing multiple replicas, Cassandra will initially request a checksum from the other nodes. If the checksum doesn't match, the full replica is sent and compared with the local version. The replica with the latest timestamp will be returned and the old replica will be updated. This means that in normal operations, old data is repaired when it is requested. Asynchronous read repair: Each table in Cassandra has a setting called read_repair_chance (as well as its related setting, dclocal_read_repair_chance), which determines how the system treats replicas that are not compared during a read. The default setting of 0.1 means that 10 percent of the time, Cassandra will also repair the remaining replicas during read operations. Manually running repair: A full repair (using nodetool repair) should be run regularly to clean up any data that has been missed as part of the previous two operations. At a minimum, it should be run once every gc_grace_seconds, which is set in the table schema and defaults to 10 days. One might ask what the consequence would be of failing to run a repair operation within the window specified by gc_grace_seconds. The answer relates to Cassandra's mechanism to handle deletes. As you might be aware, all modifications (or mutations) are immutable, so a delete is really just a marker telling the system not to return that record to any clients. This marker is called a tombstone. Cassandra performs garbage collection on data marked by a tombstone each time a compaction occurs. If you don't run the repair, you risk deleted data reappearing unexpectedly. In general, deletes should be avoided when possible as the unfettered buildup of tombstones can cause significant issues. In the course of normal operations, Cassandra will repair old replicas when their records are requested. Thus, it can be said that read repair operations are lazy, such that they only occur when required. With all these options for replication and consistency, it can seem daunting to choose the right combination for a given use case. Let's take a closer look at this balance to help bring some additional clarity to the topic. Balancing the replication factor with consistency There are many considerations when choosing a replication factor, including availability, performance, and consistency. Since our topic is high availability, let's presume your desire is to maintain data availability in the case of node failure. It's important to understand exactly what your failure tolerance is, and this will likely be different depending on the nature of the data. The definition of failure is probably going to vary among use cases as well, as one case might consider data loss a failure, whereas another accepts data loss as long as all queries return. Achieving the desired availability, consistency, and performance targets requires coordinating your replication factor with your application's consistency level configurations. In order to assist you in your efforts to achieve this balance, let's consider a single data center cluster of 10 nodes and examine the impact of various configuration combinations (where RF corresponds to the replication factor): RF Write CL Read CL Consistency Availability Use cases 1 ONE QUORUM ALL ONE QUORUM ALL Consistent Doesn't tolerate any replica loss Data can be lost and availability is not critical, such as analysis clusters 2 ONE ONE Eventual Tolerates loss of one replica Maximum read performance and low write latencies are required, and sometimes returning stale data is acceptable 2 QUORUM ALL ONE Consistent Tolerates loss of one replica on reads, but none on writes Read-heavy workloads where some downtime for data ingest is acceptable (improves read latencies) 2 ONE QUORUM ALL Consistent Tolerates loss of one replica on writes, but none on reads Write-heavy workloads where read consistency is more important than availability 3 ONE ONE Eventual Tolerates loss of two replicas Maximum read and write performance are required, and sometimes returning stale data is acceptable 3 QUORUM ONE Eventual Tolerates loss of one replica on write and two on reads Read throughput and availability are paramount, while write performance is less important, and sometimes returning stale data is acceptable 3 ONE QUORUM Eventual Tolerates loss of two replicas on write and one on reads Low write latencies and availability are paramount, while read performance is less important, and sometimes returning stale data is acceptable 3 QUORUM QUORUM Consistent Tolerates loss of one replica Consistency is paramount, while striking a balance between availability and read/write performance 3 ALL ONE Consistent Tolerates loss of two replicas on reads, but none on writes Additional fault tolerance and consistency on reads is paramount at the expense of write performance and availability 3 ONE ALL Consistent Tolerates loss of two replicas on writes, but none on reads Low write latencies and availability are paramount, but read consistency must be guaranteed at the expense of performance and availability 3 ANY ONE Eventual Tolerates loss of all replicas on write and two on read Maximum write and read performance and availability are paramount, and often returning stale data is acceptable (note that hinted writes are less reliable than the guarantees offered at CL ONE) 3 ANY QUORUM Eventual Tolerates loss of all replicas on write and one on read Maximum write performance and availability are paramount, and sometimes returning stale data is acceptable 3 ANY ALL Consistent Tolerates loss of all replicas on writes, but none on reads Write throughput and availability are paramount, and clients must all see the same data, even though they might not see all writes immediately There are also two additional consistency levels, SERIAL and LOCAL_SERIAL, which can be used to read the latest value, even if it is part of an uncommitted transaction. Otherwise, they follow the semantics of QUORUM and LOCAL_QUORUM, respectively. As you can see, there are numerous possibilities to consider when choosing these values, especially in a scenario involving multiple data centers. This discussion will give you greater confidence as you design your applications to achieve the desired balance. Summary In this article, we introduced the foundational concept of consistency. In our discussion, we outlined the importance of the relationship between replication factor and consistency level, and their impact on performance, data consistency, and availability. Resources for Article: Further resources on this subject: Cassandra Design Patterns [Article] Cassandra Architecture [Article] About Cassandra [Article]

0
0
3362

article-image-expanding-your-data-mining-toolbox

Packt

09 Aug 2016

15 min read

Expanding Your Data Mining Toolbox

Packt

09 Aug 2016

15 min read

0
0
6491

article-image-key-elements-time-series-analysis

Packt

08 Aug 2016

7 min read

Key Elements of Time Series Analysis

Packt

08 Aug 2016

7 min read

In this article by Jay Gendron, author of the book, Introduction to R for Business Intelligence, we will see that the time series analysis is the most difficult analysis technique. It is true that this is a challenging topic. However, one may also argue that an introductory awareness of a difficult topic is better than perfect ignorance about it. Time series analysis is a technique designed to look at chronologically ordered data that may form cycles over time. Key topics covered in this article include the following: (For more resources related to this topic, see here.) Introducing key elements of time series analysis Time series analysis is an upper-level college statistics course. It is also a demanding topic taught in econometrics. This article provides you an understanding of a useful but difficult analysis technique. It provides a combination of theoretical learning and hands-on practice. The goal is to provide you a basic understanding of working with time series data and give you a foundation to learn more. Use Case: forecasting future ridership The finance group approached the BI team and asked for help with forecasting future trends. They heard about your great work for the marketing team and wanted to get your perspective on their problem. Once a year they prepare an annual report that includes ridership details. They are hoping to include not only last year's ridership levels, but also a forecast of ridership levels in the coming year. These types of time-based predictions are forecasts. The Ch6_ridership_data_2011-2012.csv data file is available at the website—http://jgendron.github.io/com.packtpub.intro.r.bi/. This data is a subset of the bike sharing data. It contains two years of observations, including the date and a count of users by hour. Introducing key elements of time series analysis You just applied a linear regression model to time series data and saw it did not work. The biggest problem was not a failure in fitting a linear model to the trend. For this well-behaved time series, the average formed a linear plot over time. Where was the problem? The problem was in seasonal fluctuations. The seasonal fluctuations were one year in length and then repeated. Most of the data points existed above and below the fitted line, instead of on it or near it. As we saw, the ability to make a point estimate prediction was poor. There is an old adage that says even a broken clock is correct twice a day. This is a good analogy for analyzing seasonal time series data with linear regression. The fitted linear line would be a good predictor twice every cycle. You will need to do something about the seasonal fluctuations in order to make better forecasts; otherwise, they will simply be straight lines with no account of the seasonality. With seasonality in mind, there are functions in R that can break apart the trend, seasonality, and random components of a time series. The decompose() function found in the forecast package shows how each of these three components influence the data. You can think of this technique as being similar to creating the correlogram plot during exploratory data analysis. It captures a greater understanding of the data in a single plot: library(forecast); plot(decompose(airpass)) The output of this code is shown here: This decomposition capability is nice as it gives you insights about approaches you may want to take with the data, and with reference to the previous output, they are described as follows: The top panel provides a view of the original data for context. The next panel shows the trend. It smooths the data and removes the seasonal component. In this case, you will see that over time, air passenger volume has increased steadily and in the same direction. The third plot shows the seasonal component. Removing the trend helps reveal any seasonal cycles. This data shows a regular and repeated seasonal cycle through the years. The final plot is the randomness—everything else in the data. It is like the error term in linear regression. You will see less error in the middle of the series. The stationary assumption There is an assumption for creating time series models. The data must be stationary. Stationary data exists when its mean and variance do not change as a function of time. If you decompose a time series and witness a trend, seasonal component, or both, then you have non-stationary data. You can transform them into stationary data in order to meet the required assumption. Using a linear model for comparison, there is randomness around a mean—represented by data points scattered randomly around a fitted line. The data is independent of time and it does not follow other data in a cycle. This means that the data is stationary. Not all the data lies on the fitted line, but it is not moving. In order to analyze time series data, you need your data points to stay still. Imagine trying to count a class of primary school students while they are on the playground during recess. They are running about back and forth. In order to count them, you need them to stay still—be stationary. Transforming non-stationary data into stationary data allows you to analyze it. You can transform non-stationary data into stationary data using a technique called differencing. The differencing techniques Differencing subtracts each data point from the data point that is immediately in front of it in the series. This is done with the diff() function. Mathematically, it works as follows: Seasonal differencing is similar, but it subtracts each data point from its related data point in the next cycle. This is done with the diff() function, along with a lag parameter set to the number of data points in a cycle. Mathematically, it works as follows: Look at the results of differencing in this toy example. Build a small sample dataset of 36 data points that include an upward trend and seasonal component, as shown here: seq_down <- seq(.625, .125, -0.125) seq_up <- seq(0, 1.5, 0.25) y <- c(seq_down, seq_up, seq_down + .75, seq_up + .75, seq_down + 1.5, seq_up + 1.5) Then, plot the original data and the results obtained after calling the diff() function: par(mfrow = c(1, 3)) plot(y, type = "b", ylim = c(-.1, 3)) plot(diff(y), ylim = c(-.1, 3), xlim = c(0, 36)) plot(diff(diff(y), lag = 12), ylim = c(-.1, 3), xlim = c(0, 36)) par(mfrow = c(1, 1)) detach(package:TSA, unload=TRUE) These three panels show the results of differencing and seasonal differencing. Detach the TSA package to avoid conflicts with other functions in the forecast library we will use, as follows: These three panes are described as follows: The left pane shows n = 36 data points, with 12 points in each of the three cycles. It also shows a steadily increasing trend. Either of these characteristics breaks the stationary data assumption. The center pane shows the results of differencing. Plotting the difference between each point and its next neighbor removes the trend. Also, notice that you get one less data point. With differencing, you get (n - 1) results. The right pane shows seasonal differencing with a lag of 12. The data is stationary. Notice that the trend differencing is now the data in the seasonal differencing. Also note that you will lose a cycle of data, getting (n - lag) results. Summary Congratulations, you truly deserve recognition for getting through a very tough topic. You now have more awareness about time series analysis than some people with formal statistical training. Resources for Article: Further resources on this subject: Managing Oracle Business Intelligence [article] Self-service Business Intelligence, Creating Value from Data [article] Business Intelligence and Data Warehouse Solution - Architecture and Design [article]

0
0
12277

article-image-data-extracting-transforming-and-loading

Packt

01 Aug 2016

15 min read

Data Extracting, Transforming, and Loading

Packt

01 Aug 2016

15 min read

0
0
2661

article-image-overview-certificate-management

Packt

18 Jul 2016

24 min read

Overview of Certificate Management

Packt

18 Jul 2016

24 min read

In this article by David Steadman and Jeff Ingalls, the authors of Microsoft Identity Manager 2016 Handbook, we will look at certificate management in brief. Microsoft Identity Management (MIM)—certificate management (CM)—is deemed the outcast in many discussions. We are here to tell you that this is not the case. We see many scenarios where CM makes the management of user-based certificates possible and improved. If you are currently using FIM certificate management or considering a new certificate management deployment with MIM, we think you will find that CM is a component to consider. CM is not a requirement for using smart cards, but it adds a lot of functionality and security to the process of managing the complete life cycle of your smart cards and software-based certificates in a single forest or multiforest scenario. In this article, we will look at the following topics: What is CM? Certificate management components Certificate management agents The certificate management permission model (For more resources related to this topic, see here.) What is certificate management? Certificate management extends MIM functionality by adding management policy to a driven workflow that enables the complete life cycle of initial enrollment, duplication, and the revocation of user-based certificates. Some smart card features include offline unblocking, duplicating cards, and recovering a certificate from a lost card. The concept of this policy is driven by a profile template within the CM application. Profile templates are stored in Active Directory, which means the application already has a built-in redundancy. CM is based on the idea that the product will proxy, or be the middle man, to make a request to and get one from CA. CM performs its functions with user agents that encrypt and decrypt its communications. When discussing PKI (Public Key Infrastructure) and smart cards, you usually need to have some discussion about the level of assurance you would like for the identities secured by your PKI. For basic insight on PKI and assurance, take a look at http://bit.ly/CorePKI. In typical scenarios, many PKI designers argue that you should use Hardware Security Module (HSM) to secure your PKI in order to get the assurance level to use smart cards. Our personal opinion is that HSMs are great if you need high assurance on your PKI, but smart cards increase your security even if your PKI has medium or low assurance. Using MIM CM with HSM will not be covered in this article, but if you take a look at http://bit.ly/CMandLunSA, you will find some guidelines on how to use MIM CM and HSM Luna SA. The Financial Company has a low-assurance PKI with only one enterprise root CA issuing the certificates. The Financial Company does not use a HSM with their PKI or their MIM CM. If you are running a medium- or high-assurance PKI within your company, policies on how to issue smart cards may differ from the example. More details on PKI design can be found at http://bit.ly/PKIDesign. Certificate management components Before we talk about certificate management, we need to understand the underlying components and architecture: As depicted before, we have several components at play. We will start from the left to the right. From a high level, we have the Enterprise CA. The Enterprise CA can be multiple CAs in the environment. Communication from the CM application server to the CA is over the DCOM/RPC channel. End user communication can be with the CM web page or with a new REST API via a modern client to enable the requesting of smart cards and the management of these cards. From the CM perspective, the two mandatory components are the CM server and the CA modules. Looking at the logical architecture, we have the CA, and underneath this, we have the modules. The policy and exit module, once installed, control the communication and behavior of the CA based on your CM's needs. Moving down the stack, we have Active Directory integration. AD integration is the nuts and bolts of the operation. Integration into AD can be very complex in some environments, so understanding this area and how CM interacts with it is very important. We will cover the permission model later in this article, but it is worth mentioning that most of the configuration is done and stored in AD along with the database. CM uses its own SQL database, and the default name is FIMCertificateManagement. The CM application uses its own dedicated IIS application pool account to gain access to the CM database in order to record transactions on behalf of users. By default, the application pool account is granted the clmApp role during the installation of the database, as shown in the following screenshot: In CM, we have a concept called the profile template. The profile template is stored in the configuration partition of AD, and the security permissions on this container and its contents determine what a user is authorized to see. As depicted in the following screenshot, CM stores the data in the Public Key Services (1) and the Profile Templates container. CM then reads all the stored templates and the permissions to determine what a user has the right to do (2): Profile templates are at the core of the CM logic. The three components comprising profile templates are certificate templates, profile details, and management policies. The first area of the profile template is certificate templates. Certificate templates define the extensions and data point that can be included in the certificate being requested. The next item is profile details, which determines the type of request (either a smart card or a software user-based certificate), where we will generate the certificates (either on the server or on the client side of the operations), and which certificate templates will be included in the request. The final area of a profile template is known as management policies. Management policies are the workflow engine of the process and contain the manager, the subscriber functions, and any data collection items. The e-mail function is initiated here and commonly referred to as the One Time Password (OTP) activity. Note the word "One". A trigger will only happen once here; therefore, multiple alerts using e-mail would have to be engineered through alternate means, such as using the MIM service and expiration activities. The permission model is a bit complex, but you'll soon see the flexibility it provides. Keep in mind that Service Connection Point (SCP) also has permissions applied to it to determine who can log in to the portal and what rights the user has within the portal. SCP is created upon installation during the wizard configuration. You will want to be aware of the SCP location in case you run into configuration issues with administrators not being able to perform particular functions. The SCP location is in the System container, within Microsoft, and within Certificate Lifecycle Manager, as shown here: Typical location CN=Certificate Lifecycle Manager,CN=Microsoft,CN=System,DC=THEFINANCIALCOMPANY,DC=NET Certificate management agents We covered several key components of the profile templates and where some of the permission model is stored. We now need to understand how the separation of duties is defined within the agent role. The permission model provides granular control, which promotes the separation of duties. CM uses six agent accounts, and they can be named to fit your organization's requiremensts. We will walk through the initial setup again later in this article so that you can use our setup or alter it based on your need. The Financial Company only requires the typical setup. We precreated the following accounts for TFC, but the wizard will create them for you if you do not use them. During the installation and configuration of CM, we will use the following accounts: Besides the separation of duty, CM offers enrollment by proxy. Proxy enrollment of a request refers to providing a middle man to provide the end user with a fluid workflow during enrollment. Most of this proxy is accomplished via the agent accounts in one way or another. The first account is MIM CM Agent (MIMCMAgent), which is used by the CM server to encrypt data from the smart card admin PINs to the data collection stored in the database. So, the agent account has an important role to protect data and communication to and from the certificate authorities. The last user agent role CMAgent has is the capability to revoke certificates. The agent certificate thumbprint is very important, and you need to make sure the correct value is updated in the three areas: CM, web.config, and the certificate policy module under the Signing Certificates tab on the CA. We have identified these areas in the following. For web.config: <add key="Clm.SigningCertificate.Hash" value <add key="Clm.Encryption.Certificate.Hash" value <add key="Clm.SmartCard.ExchangeCertificate.Hash" value The Signing Certificates tab is as shown in the following screenshot: Now, when you run through the configuration wizard, these items are already updated, but it is good to know which locations need to be updated if you need to troubleshoot agent issues or even update/renew this certificate. The second account we want to look at is Key Recovery Agent (MIMCMKRAgent); this agent account is needed for CM to recover any archived private keys certificates. Now, let's look at Enrollment Agent (MIMCMEnrollAgent); the main purpose of this agent account is to provide the enrollment of smart cards. Enrollment Agent, as we call it, is responsible for signing all smart card requests before they are submitted to the CA. Typical permission for this account on the CA is read and request. Authorization Agent (MIMCMAuthAgent)—or as some folks call this, the authentication agent—is responsible for determining access rights for all objects from a DACL perspective. When you log in to the CM site, it is the authorization account's job to determine what you have the right to do based on all the core components that ACL has applied. We will go over all the agents accounts and rights needed later in this article during our setup. CA Manager Agent (MIMCMManagerAgent) is used to perform core CA functions. More importantly, its job is to issue Certificate Revocation Lists (CRLs). This happens when a smart card or certificate is retired or revoked. It is up to this account to make sure the CRL is updated with this critical information. We saved the best for last: Web Pool Agent (MIMCMWebAgent). This agent is used to run the CM web application. The agent is the account that contacts the SQL server to record all user and admin transactions. The following is a good depiction of all the accounts together and the high-level functions: The certificate management permission model In CM, we think this part is the most complex because with the implementation, you can be as granular as possible. For this reason, this area is the most difficult to understand. We will uncover the permission model so that we can begin to understand how the permission model works within CM. When looking at CM, you need to formulate the type of management model you will be deploying. What we mean by this is will you have a centralized or delegated model? This plays a key part in deployment planning for CM and the permission you will need to apply. In the centralized model, a specific set of managers are assigned all the rights for the management policy. This includes permissions on the users. Most environments use this method as it is less complex for environments. Now, within this model, we have manager-initiated permission, and this is where CM permissions are assigned to groups containing the subscribers. Subscribers are the actual users doing the enrollment or participating in the workflow. This is the model that The Financial Company will use in its configuration. The delegated model is created by updating two flags in web.config called clm.RequestSecurity.Flags and clm.RequestSecurity.Groups. These two flags work hand in hand as if you have UseGroups, then it will evaluate all the groups within the forests to include universal/global security. Now, if you use UseGroups and define clm.RequestSecurity.Groups, then it will only look for these specific groups and evaluate via the Authorization Agent . The user will tell the Authorization Agent to only read the permission on the user and ignore any group membership permissions: When we continue to look at the permission, there are five locations that permissions can be applied in. In the preceding figure is an outline of these locations, but we will go in more depth in the subsections in a bit. The basis of the figure is to understand the location and what permission can be applied. The following are the areas and the permissions that can be set: Service Connection Point: Extended Permissions Users or Groups: Extended Permissions Profile Template Objects: Container: Read or Write Template Object: Read/Write or Enroll Certificate Template: Read or Enroll CM Management Policy within the Web application: We have multiple options based on the need, such as Initiate Request Now, let's begin to discuss the core areas to understand what they can do. So, The Financial Company can design the enrollment option they want. In the example, we will use the main scenario we encounter, such as the helpdesk, manager, and user-(subscriber) based scenarios. For example, certain functions are delegated to the helpdesk to allow them to assist the user base without giving them full control over the environment (delegated model). Remember this as we look at the five core permission areas. Creating service accounts So far, in our MIM deployment, we have created quite a few service accounts. MIM CM, however, requires that we create a few more. During the configuration wizard, we will get the option of having the wizard create them for us, but we always recommend creating them manually in FIM/MIM CM deployments. One reason is that a few of these need to be assigned some certificates. If we use an HSM, we have to create it manually in order to make sure the certificates are indeed using the HSM. The wizard will ask for six different service accounts (agents), but we actually need seven. In The Financial Company, we created the following seven accounts to be used by FIM/MIM CM: MIMCMAgent MIMCMAuthAgent MIMCMCAManagerAgent MIMCMEnrollAgent MIMCMKRAgent MIMCMWebAgent MIMCMService The last one, MIMCMService, will not be used during the configuration wizard, but it will be used to run the MIM CM Update service. We also created the following security groups to help us out in the scenarios we will go over: MIMCM-Helpdesk: This is the next step in OTP for subscribers MIMCM-Managers: These are the managers of the CM environment MIMCM-Subscribers: This is group of users that will enroll Service Connection Point Service Connection Point (SCP)is located under the Systems folder within Active Directory. This location, as discussed in the earlier parts of the article, defines who functions as the user as it relates to logging in to the web application. As an example, if we just wanted every user to only log in, we would give them read rights. Again, authenticated users, have this by default, but if you only wanted a subset of users to access, you should remove authenticated users and add your group. When you run the configuration wizard, SCP is decided, but the default is the one shown in the following screenshot: If a user is assigned to any of the MIM CM permissions available on SCP, the administrative view of the MIM CM portal will be shown. The MIM CM permissions are defined in a Microsoft TechNet article at http://bit.ly/MIMCMPermission. For your convenience, we have copied parts of the information here: MIM CM Audit: This generates and displays MIM CM policy templates, defines management policies within a profile template, and generates MIM CM reports. MIM CM Enrollment Agent: This performs certificate requests for the user or group on behalf of another user. The issued certificate's subject contains the target user's name and not the requester's name. MIM CM Request Enroll: This initiates, executes, or completes an enrollment request. MIM CM Request Recover: This initiates encryption key recovery from the CA database. MIM CM Request Renew: This initiates, executes, or completes an enrollment request. The renewal request replaces a user's certificate that is near its expiration date with a new certificate that has a new validity period. MIM CM Request Revoke: This revokes a certificate before the expiration of the certificate's validity period. This may be necessary, for example, if a user's computer or smart card is stolen. MIM CM Request Unblock Smart Card: This resets a smart card's user Personal Identification Number (PIN) so that he/she can access the key material on a smart card. The Active Directory extended permissions So, even if you have the SCP defined, we still need to set up the permissions on the user or group of users that we want to manage. As in our helpdesk example, if we want to perform certain functions, the most common one is offline unblock. This would require the MIMCM-HelpDesk group. We will create this group later in this article. It would contain all help desk users then on SCP; we would give them CM Request Unblock Smart Card and CM Enrollment Agent. Then, you need to assign the permission to the extended permission on MIMCM-Subscribers, which contains all the users we plan to manage with the helpdesk and offline unblock: So, as you can see, we are getting into redundant permissions, but depending on the location, it means what the user can do. So, planning of the model is very important. Also, it is important to document what you have as with some slight tweak, things can and will break. The certificate templates permission In order for any of this to be possible, we still need to give permission to the manager of the user to enroll or read the certificate template, as this will be added to the profile template. For anyone to manage this certificate, everyone will need read and enroll permissions. This is pretty basic, but that is it, as shown in the following screenshot: The profile template permission The profile template determines what a user can read within the template. To get to the profile template, we need to use Active Directory sites and services to manage profile templates. We need to activate the services node as this is not shown by default, and to do this, we will click on View | Show Services Node: As an example if you want a user to enroll in the cert, he/she would need CM Enroll on the profile template, as shown in the following screenshot: Now, this is for users, but let's say you want to delegate the creation of profile templates. For this, all you need to do is give the MIMCM-Managers delegate the right to create all child items on the profile template container, as follows: The management policy permission For the management policy, we will break it down into two sections: a software-based policy and a smart card management policy. As we have different capabilities within CM based on the type, by default, CM comes with two sample policies (take a look at the following screenshot), which we use for duplication to create a new one. When configuring, it is good to know that you cannot combine software and smart card-based certificates in a policy: The software management policy The software-based certificate policy has the following policies available through the CM life cycle: The Duplicate Policy panel creates a duplicate of all the certificates in the current profile. Now, if the first profile is created for the user, all the other profiles created afterwards will be considered duplicate, and the first generated policy will be primary. The Enroll Policy panel defines the initial enrollment steps for certificates such as initiate enroll request and data collection during enroll initiation. The Online Update Policy panel is part of the automatic policy function when key items in the policy change. This includes certificates about to expire, when a certificate is added to the existing profile template or even removed. The Recover Policy panel allows for the recovery of the profile in the event that the user was deleted. This includes the cases where certs are deleted by accident. One thing to point out is if the certificate was a signing cert, the recovery policy would issue a new replacement cert. However, if the cert was used for encryption, you can recover the original using this policy. The Recover On Behalf Policy panel allows managers or helpdesk operations to be recovered on behalf the user in the event that they need any of the certificates. The Renew Policy panel is the workflow that defines the renew setting, such as revocation and who can initiate a request. The Suspend and Reinstate Policy panel enables a temporary revocation of the profile and puts a "certificate hold" status. More information about the CRL status can be found at http://bit.ly/MIMCMCertificateStatus. The Revoke Policy panel maintains the revocation policy and setting around being able to set the revocation reason and delay. Also, it allows the system to push a delta CRL. You also can define the initiators for this policy workflow. The smart card management policy The smart card policy has some similarities to the software-based policy, but it also has a few new workflows to manage the full life cycle of the smart card: The Profile Details panel is by far the most commonly used part in this section of the policy as it defines all the smart card certificates that will be loaded in the policy along with the type of provider. One key item is creating and destroying virtual smart cards. One final key part is diversifying the admin key. This is best practice as this secures the admin PIN using diversification. So, before we continue, we want to go over this setting as we think it is an important topic. Diversifying the admin key is important because each card or batch of cards comes with a default admin key. Smart cards may have several PINs, an admin PIN, a PINunlock key (PUK), and a user PIN. This admin key, as CM refers to it, is also known as the administrator PIN. This PIN differs from the user's PIN. When personalizing the smart card, you configure the admin key, the PUK, and the user's PIN. The admin key and the PUK are used to reset the virtual smart card's PIN. However, you cannot configure both. You must use the PUK to unlock the PIN if you assign one during the virtual smart card's creation. It is important to note that you must use the PUK to reset the PIN if you provide both a PUK and an admin key. During the configuration of the profile template, you will be asked to enter this key as follows: The admin key is typically used by smart card management solutions that enable a challenge response approach to PIN unlocking. The card provides a set of random data that the user reads (after the verification of identity) to the deployment admin. The admin then encrypts the data with the admin key (obtained as mentioned before) and gives the encrypted data back to the user. If the encrypted data matches that produced by the card during verification, the card will allow PIN resetting. As the admin key is never in the hands of anyone other than the deployment administrator, it cannot be intercepted or recorded by any other party (including the employee) and thus has significant security benefits beyond those in using a PUK—an important consideration during the personalization process. When enabled, the admin key is set to a card-unique value when the card is assigned to the user. The option to diversify admin keys with the default initialization provider allows MIM CM to use an algorithm to uniquely generate a new key on the card. The key is encrypted and securely transmitted to the client. It is not stored in the database or anywhere else. MIM CM recalculates the key as needed to manage the card: The CM profile template contains a thumbprint for the certificate to be used in admin key diversification. CM looks in the personal store of the CM agent service account for the private key of the certificate in the profile template. Once located, the private key is used to calculate the admin key for the smart card. The admin key allows CM to manage the smart card (issuing, revoking, retiring, renewing, and so on). Loss of the private key prevents the management of cards diversified using this certificate. More detail on the control can be found at http://bit.ly/MIMCMDiversifyAdminKey. Continuing on, the Disable Policy panel defines the termination of the smart card before expiration, you can define the reason if you choose. Once disabled, it cannot be reused in the environment. The Duplicate Policy panel, similarly to the software-based one, produces a duplicate of all the certificates that will be on the smart card. The Enroll Policy panel, similarly to the software policy, defines who can initiate the workflow and printing options. The Online Update Policy panel, similarly to the software-based cert, allows for the updating of certificates if the profile template is updated. The update is triggered when a renewal happens or, similarly to the software policy, a cert is added or removed. The Offline Unblock Policy panel is the configuration of a process to allow offline unblocking. This is used when a user is not connected to the network. This process only supports Microsoft-based smart cards with challenge questions and answers via, in most cases, the user calling the helpdesk. The Recovery On Behalf Policy panel allows the recovery of certificates for the management or the business to recover if the cert is needed to decrypt information from a user whose contract was terminated or who left the company. The Replace Policy panel is utilized by being able to replace a user's certificate in the event of a them losing their card. If the card they had had a signing cert, then a new signing cert would be issued on this new card. Like with software certs, if the certificate type is encryption, then it would need to be restored on the replace policy. The Renew Policy panel will be used when the profile/certificate is in the renewal period and defines revocation details and options and initiates permission. The Suspend and Reinstate Policy panel is the same as the software-based policy for putting the certificate on hold. The Retire Policy panel is similar to the disable policy, but a key difference is that this policy allows the card to be reused within the environment. The Unblock Policy panel defines the users that can perform an actual unblocking of a smart card. More in-depth detail of these policies can be found at http://bit.ly/MIMCMProfiletempates. Summary In this article, we uncovered the basics of certificate management and the management components that are required to successfully deploy a CM solution. Then, we discussed and outlined, agent accounts and the roles they play. Finally, we looked into the management permission model from the policy template to the permissions and the workflow. Resources for Article: Further resources on this subject: Managing Network Devices [article] Logging and Monitoring [article] Creating Horizon Desktop Pools [article]

0
0
7293

Packt

15 Jul 2016

13 min read

MicroStrategy 10

Packt

15 Jul 2016

13 min read

In this article by Dmitry Anoshin, Himani Rana, and Ning Ma, the authors of the book, Mastering Business Intelligence with MicroStrategy, we are going to talk about MicroStrategy 10 which is one of the leading platforms on the market, can handle all data analytics demands, and offers a powerful solution. We will be discussing the different concepts of MicroStrategy such as its history, deployment, and so on. (For more resources related to this topic, see here.) Meet MicroStrategy 10 MicroStrategy is a market leader in Business Intelligence (BI) products. It has rich functionality in order to meet the requirements of modern businesses. In 2015, MicroStrategy provided a new release of MicroStrategy, version 10. It offers both agility and governance like no other BI product. In addition, it is easy to use and enterprise ready. At the same time, it is great for both IT and business. In other words, MicroStrategy 10 offers an analytics platform that combines an easy and empowering user experience, together with enterprise-grade performance, management, and security capabilities. It is true bimodal BI and moves seamlessly between styles: Data discovery and visualization Enterprise reporting and dashboards In-memory high performance BI Scales from departments to enterprises Administration and security MicroStrategy 10 consists of three main products: MicroStrategy Desktop, MicroStrategy Mobile, and MicroStrategy Web. MicroStrategy Desktop lets users start discovering and visualizing data instantly. It is available for Mac and PC. It allows users to connect, prepare, discover, and visualize data. In addition, we can easily promote to a MicroStrategy Server. Moreover, MicroStrategy Desktop has a brand new HTML5 interface and includes all connection drivers. It allows us to use data blending, data preparation, and data enrichment. Finally, it has powerful advanced analytics and can be integrated with R. To cut a long story short, we want to notice main changes of new BI platform. All developers keep the same functionality, the looks as well as architect the same. All changes are about Web interface and Intelligence Server. Let's look closer at what MicroStrategy 10 can show us. MicroStrategy 10 expands the analytical ecosystem by using third-party toolkits such as: Data visualization libraries: We can easily plug in and use any visualization from the expanding range of Java libraries Statistical toolkits: R, SAS, SPSS, KXEN, and others Geolocation data visualization: Uses mapping capabilities to visualize and interact with location data MicroStrategy 10 has more than 25 new data sources that we can connect to quickly and simply. In addition, it allows us build reports on top of other BI tools, such as SAP Business Objects, Cognos, and Oracle BI. It has a new connector to Hadoop, which uses the native connector. Moreover, it allows us to blend multiple data sources in-memory. We want to notice that MicroStrategy 10 got reach functionality for work with data such as: Streamlined workflows to parse and prepare data Multi-table in-memory support from different sources Automatically parse and prepare data with every refresh 100+ inbuilt functions to profile and clean data Create custom groups on the fly without coding In terms of connection to Hadoop, most BI products use Hive or Impala ODBC drivers in order to use SQL to get data from Hadoop. However, this method is bad in terms of performance. MicroStrategy 10 queries directly against Hadoop. As a result, it is up to 50 times faster than via ODBC. Let's look at some of the main technical changes that have significantly improved MicroStrategy. The platform is now faster than ever before, because it doesn't have a two-billion-row limit on in-memory datasets and allows us to create analytical cubes up to 16 times bigger in size. It publishes cubes dramatically faster. Moreover, MicroStrategy 10 has higher data throughput and cubes can be loaded in parallel 4 times faster with multi-threaded parallel loading. In addition, the in-memory engine allows us to create cubes 80 times larger than before, and we can access data from cubes 50% faster, by using up to 8 parallel threads. Look at the following table, where we compare in-memory cube functionality in version 9 versus version 10: Feature Ver. 9 Ver. 10 Data volume 100 GB ~2TB Number of rows 2 billion 200 billion Load rate 8 GB/hour ~200 GB/hour Data model Star schema Any schema, tabular or multiple sets In order to make the administration of MicroStrategy more effective in the new version, MicroStrategy Operation Manager was released. It gives MicroStrategy administrators powerful development tools to monitor, automate, and control systems. Operations Manager gives us: Centralized management in a web browser Enterprise Manager Console within Tool Triggers and 24/7 alerts System health monitors Server management Multiple environment administration MicroStrategy 10 education and certification MicroStrategy 10 offers new training courses that can be conducted offline in a training center, or online at http://www.microstrategy.com/us/services/education. We believe that certification is a good thing on your journey. The following certifications now exist for version 10: MicroStrategy 10 Certified Associated Analyst MicroStrategy 10 Certified Application Designer MicroStrategy 10 Certified Application Developer MicroStrategy 10 Certified Administrator After passing all of these exams, you will become a MicroStrategy 10 Application Engineer. More details can be found here: http://www.microstrategy.com/Strategy/media/downloads/training-events/MicroStrategy-certification-matrix_v10.pdf. History ofMicroStrategy Let us briefly look at the history of MicroStrategy, which began in 1991: 1991: Released first BI product, which allowed users to create graphical views and analyses of information data 2000: Released MicroStrategy 7 with a web interface 2003: First to release a fully integrated reporting tool, combining list reports, BI-style dashboards, and interface analyses in a single module. 2005: Released MicroStrategy 8, including one-click actions and drag-and-drop dashboard creation 2009: Released MicroStrategy 9, delivering a seamless consolidated path from department to enterprise BI 2010: Unveiled new mobile BI capabilities for iPad and iPhone, and was featured on the iTunes Bestseller List 2011: Released MicroStrategy Cloud, the first SaaS offering from a major BI vendor 2012: Released Visual Data Discovery and groundbreaking new security platform, Usher 2013: Released expanded Analytics Platform and free Analytics Desktop client 2014: Announced availability of MicroStrategy Analytics via Amazon Web Services (AWS) 2015: MicroStrategy 10 was released, the first ever enterprise analytics solution for centralized and decentralized BI DeployingMicroStrategy 10 We know only one way to master MicroStrategy, through practical exercises. Let's start by downloading and deploying MicroStrategy 10.2. Overview of training architecture In order to master MicroStrategy and learn about some BI considerations, we need to download the all-important software, deploy it, and connect to a network. During the preparation of the training environment, we will cover the installation of MicroStrategy on a Linux operating system. This is very good practice, because many people work with Windows and are not familiar with Linux, so this chapter will provide additional knowledge of working with Linux, as well as installing MicroStrategy and a web server. Look at the training architecture: There are three main components: Red Hat Linux 6.4: Used for deploying the web server and Intelligence Server. Windows machine: Uses MicroStrategy Client and Oracle database. Virtual machine with Hadoop: Ready virtual machine with Hadoop, which will connect to MicroStrategy using a brand new connection. In the real world, we should use separate machines for every component, and sometimes several machines in order to run one component. This is called clustering. Let's create a virtual machine. Creating of Red Hat Linux virtual machine Let's create a virtual machine with Red Hat Linux, which will host our Intelligence Server: Go to http://www.redhat.com/ and create an account Go to the software download center: https://access.redhat.com/downloads Download RHEL: https://access.redhat.com/downloads/content/69/ver=/rhel---7/7.2/x86_64/product-software Choose Red Hat Enterprise Linux Server Download Red Hat Enterprise Linux 6.4 x86_64 Choose Binary DVD Now we can create a virtual machine with RHEL 6.4. We have several options in order to choose the software for deploying virtual machine. In our case, we will use a VMware workstation. Before starting to deploy a new VM, we should adjust the default settings, such as increasing RAM and HDD, and adding one more network card in order to connect the external environment with the MicroStrategyClient and sample database. In addition, we should create a new network. When the deployment of the RHEL virtual machine is complete, we should activate a subscription in order to install the required packages. Let us do this with one command in the terminal: # subscription-manager register --username <username> --password <password> --auto-attach Performing prerequisites for MicroStrategy 10 According to the installation and configuration guide, we should deploy all necessary packages. In order to install them, we should execute them under the root: # su # yum install compat-libstdc++-33.i686 # yum install libXp.x86_64 # yum install elfutils-devel.x86_64 # yum install libstdc++-4.4.7-3.el6.i686 # yum install krb5-libs.i686 # yum install nss-pam-ldapd.i686 # yum install ksh.x86_64 The project design process Project design is not just about creating a project in MicroStrategy architect; it involves several steps and thorough analysis, such as how data is stored in the data warehouse, what reports the user wants based on the data, and so on. The following are the steps involved in our project design process: Logical data model design Once the user have business requirements documented, the user must create a fact qualifier matrix to identify the attributes, facts, and hierarchies, which are the building blocks of any logical data model. An example of a fact qualifier is as follows: A logical data model is created based on the source systems and designed before defining a data warehouse. So, it's good for seeing which objects the users want and checking whether the objects are in the source systems. It represents the definition, characteristics, and relationships of the data. This graphical representation of information is easily understandable by business users too. A logical data model graphically represents the following concepts: Attributes: Provides a detailed description of the data Facts: Provide numerical information about the data Hierarchies: Provide relationships between data Data warehouse schema design Physical data warehouse design is based on the logical data model and represents the storage and retrieval of data from the data warehouse. Here, we determine the optimal schema design, which ensures reporting performance and maintenance. The key components of a physical data warehouse schema are columns and tables: Columns: These store attribute and fact data. The following are the three types of columns: ID column: Stores the ID for an attribute Description column: Stores text description of the attribute Fact column: Stores fact data Tables: Physical grouping of related data. Following are the types of tables: Lookup tables: Store information about attributes such as IDs and descriptions Relationship tables: Store information about relationship between two or more attributes Fact tables: Store factual data and the level of aggregation, which is defined based on the attributes of the fact table. They contain base fact columns or derived fact columns: Base fact: Stores the data at the lowest possible level of detail. Aggregate fact: Stores data at a higher or summarized level of detail. Mobile server installation and configuration While mobile client is easy to install, mobile server is not. Here we provide a step-by-step guide on how to install mobile server: Download MicroStrategyMobile.war. Mobile server is packed in a WAR file, just like Operation Manager or Web: Copy MicroStrategyMobile.war from <Microstrategy Installation folder>/Mobile/MobileServer to /usr/local/tomcat7/webapps. Then restart Tomcat, by issuing the ./shutdown.sh and ./startup.sh commands: Connect to the mobile server. Go to http://192.168.81.134:8080/MicroStrategyMobile/servlet/mstrWebAdmin. Then add the server name localhost.localdomain and click connect: Configure mobile server. You can configure (1) Authentication settings for the mobile server application; (2) Privileges and permissions; (3) SSL encryption; (4) Client authentication with a certificate server; (5) Destination folder for the photo uploader widget and signature capture input control. Performing Pareto analysis One good thing about data discovery tools is their agile approach to the data. We can connect any data source and easily slice and dice data. Let's try to use the Pareto principle in order to answer the question: How are sales distributed among the different products? The Pareto principle states that, for many events, roughly 80% of results come from 20% of the causes. For example, 80% of profits come from 20% of the products offered. This type of analysis is very popular in product analytics. In MicroStrategy Desktop, we can use shortcut metrics in order to quickly make complex calculations such as running sums or a percent of the total. Let's build a visualization in order to see the 20% of products that bring us 80% of the money: Choose Combo Chart. Drag and drop Salesamount to the vertical and Englishproductname to the horizontal. Add Orderdate to the filters and restrict to 60 days. Right-click on Sales amountand choose Descending Sort. Right-click on Salesamount | ShortcutMetrics | Percent Running Total. Drag and drop Metric Names to the Color By. Change the color of Salesamount and Percent Running Total. Change the shape of Percent Running Total. As a result, we get this chart: From this chart we can quickly understand our top 20% of products which bring us 80% of revenue. Splunk and MicroStrategy MicroStrategy 10 has announced a new connection to Splunk. I suppose that Splunk is not very popular in the world of Business Intelligence. Most people who have heard about Splunk think that it is just a platform for processing logs. The answers is both true and false. Splunk was derived from the world of spelunking, because searching for root causes in logs is a kind of spelunking without light, and Splunk solves this problem by indexing machine data from a tremendous number of data sources, starting from applications, hardware, sensors, and so on. What is Splunk Splunk's goal is making machine data accessible, usable, and valuable for everyone, and turning machine data into business value. It can: Collect data from anywhere Search and analyze everything Gain real-time Operational Intelligence In the BI world, everyone knows what a data warehouse is. Creating reports from Splunk Now we are ready to build reports using MicroStrategy Desktop and Splunk. Let's do it: Go to MicroStrategy Desktop, click add data, and choose Splunk Create a connection using the existing DNS based on Splunk ODBC: Choose one of tables (Splunk reports): Add other tables as new data sources. Now we can build a dashboard using data from Splunk by dragging and dropping attributes and metrics: Summary In this article we looked at MicroStrategy 10 and its features. We learned about its history and deployment. We also learnt about the project design process, the Pareto analysis and about the connection of Splunk and MicroStrategy. Resources for Article: Further resources on this subject: Stacked Denoising Autoencoders [article] Creating external tables in your Oracle 10g/11g Database [article] Clustering Methods [article]

0
0
4351

article-image-stacked-denoising-autoencoders

Packt

11 Jul 2016

13 min read

Stacked Denoising Autoencoders

Packt

11 Jul 2016

13 min read

In this article by John Hearty, author of the book Advanced Machine Learning with Python, we discuss autoencoders as valuable tools in themselves, significant accuracy can be obtained by stacking autoencoders to form a deep network. This is achieved by feeding the representation created by the encoder on one layer into the next layer's encoder as input to that layer. (For more resources related to this topic, see here.) Stacked DenoisingAutoencoders(SdA) are currently in use in many leading data science teams for sophisticated natural language analyses as well as a broad range of signals, images, and text analyses. The implementation of SdA will be very familiar after the previous chapter's discussion of deep belief networks. The SdA is usedin much the same way as the RBMs in our deep belief networks were used. Each layer of the deep architecture will have a dA and sigmoid component, with the autoencoder component being used to pretrain the sigmoid network. The performance measure used by anSdA is the training set error with an intensive period of layer-to-layer (layer-wise) pretraining used to gradually align network parameters before a final period of fine-tuning. During fine-tuning, the network is trained using validation and test data, over fewer epochs but with larger update steps. The goal is to have the network converge at the end of the fine-tuning to deliver an accurate result. In addition to delivering on the typical advantages of deep networks (the ability to learn feature representations for complex or high-dimensional datasets and train a model without extensive feature engineering), stacked autoencoders have an additional, very interesting property. Correctly configured stacked autoencoders can capture a hierarchical grouping of their input data. Successive layers of anSdA may learn increasingly high-level features. While the first layer might learn some first-order features from input data (such as learning edges in a photo image), a second layer may learn some grouping of first-order features (for instance, by learning given configurations of edges that correspond to contours or structural elements in the input image). There's no golden rule to determine how many layers or how large layers should be for a given problem. The best solution is usually to experiment with these model parameters until you find an optimal point. This experimentation is best done with a hyperparameter optimization technique or genetic algorithm (subjects we'll discuss in later chapters of this book). Higher layers may learn increasingly high-order configurations, enabling anSdA to learn to recognise facial features, alphanumerical characters, or the generalised forms of objects (such as a bird). This is what gives SdAs their unique capability to learn very sophisticated, high-level abstractions of their input data. Autoencoders can be stacked indefinitely, and it has been demonstrated that continuing to stack autoencoders can improve the effectiveness of the deep architecture (with the main constraint becoming computing cost in time). In this chapter, we'll look at stacking three autoencoders to solve a natural language processing challenge. Applying SdA Now that we've had a chance to understand the advantages and power of the SdA as a deep learning architecture, let's test our skills on a real-world dataset. For this chapter, let's step away from image datasets and work with the OpinRank Review Dataset, a text dataset of around 259,000 hotel reviews from TripAdvisor,which is accessible via the UCI Machine Learning dataset Repository. This freely-available dataset provides review scores (as floating point numbers from 1 to 5) and review text for a broad range of hotels; we'll be applying our SdA to attempt to identify the scoring of each hotel from its review text. We'll be applying our autoencoder to analyze a preprocessed version of this data, which is accessible from the GitHub share accompanying this chapter. We'll be discussing the techniques by which we prepare text data in an upcoming chapter. The source data is available at https://archive.ics.uci.edu/ml/datasets/OpinRank+Review+Dataset. In order to get started, we're going to need anSdA (hereafter SdA) class! classSdA(object): def __init__( self, numpy_rng, theano_rng=None, n_ins=280, hidden_layers_sizes=[500, 500], n_outs=5, corruption_levels=[0.1, 0.1] ): As we previously discussed, the SdA is created by feeding the encoding from one layer's autoencoder as the input to the subsequent layer. This class supports the configuration of the layer count (reflected in, but not set by, the length of the hidden_layers_sizes and corruption_levels vectors). It also supports differentiated layer sizes (in nodes) at each layer, which can be set using hidden_layers_sizes. As we discussed, the ability to configure successive layers of the autoencoder is critical to developing successful representations. Next, we need parameters to store the MLP (self.sigmoid_layers) and dA (self.dA_layers) elements of the SdA. In order to specify the depth of our architecture, we use the self.n_layers parameter to specify the number of sigmoid and dA layers required: self.sigmoid_layers = [] self.dA_layers = [] self.params = [] self.n_layers = len(hidden_layers_sizes) assertself.n_layers> 0 Next, we need to construct our sigmoid and dAlayers. We begin by setting the hidden layer size to be set either from the input vector size or by the activation of the preceding layer. Following this, sigmoid_layers and dA_layers components are created with the dA layer drawing from the dA class we discussed earlier in this article: for i in xrange(self.n_layers): if i == 0: input_size = n_ins else: input_size = hidden_layers_sizes[i - 1] if i == 0: layer_input = self.x else: layer_input = self.sigmoid_layers[-1].output sigmoid_layer = HiddenLayer(rng=numpy_rng, input=layer_input, n_in=input_size, n_out=hidden_layers_sizes[i], activation=T.nnet.sigmoid) self.sigmoid_layers.append(sigmoid_layer) self.params.extend(sigmoid_layer.params) dA_layer = dA(numpy_rng=numpy_rng, theano_rng=theano_rng, input=layer_input, n_visible=input_size, n_hidden=hidden_layers_sizes[i], W=sigmoid_layer.W, bhid=sigmoid_layer.b) self.dA_layers.append(dA_layer) Having implemented the layers of our SdA, we'll need a final, logistic regression layer to complete the MLP component of the network: self.logLayer = LogisticRegression( input=self.sigmoid_layers[-1].output, n_in=hidden_layers_sizes[-1], n_out=n_outs ) self.params.extend(self.logLayer.params) self.finetune_cost = self.logLayer.negative_log_likelihood(self.y) self.errors = self.logLayer.errors(self.y) This completes the architecture of our SdA. Next up, we need to generate the training functions used by the SdA class. Each function will have the minibatch index (index) as an argument, together with several other elements; corruption_level and learning_rate are enabled here so that we can adjust them (for example, gradually increase or decrease them) during training. Additionally, we identify variables that help identify where the batch starts and ends: batch_begin and batch_end, respectively. defpretraining_functions(self, train_set_x, batch_size): index = T.lscalar('index') corruption_level = T.scalar('corruption') learning_rate = T.scalar('lr') batch_begin = index * batch_size batch_end = batch_begin + batch_size pretrain_fns = [] fordAinself.dA_layers: cost, updates = dA.get_cost_updates(corruption_level, learning_rate) fn = theano.function( inputs=[ index, theano.Param(corruption_level, default=0.2), theano.Param(learning_rate, default=0.1) ], outputs=cost, updates=updates, givens={ self.x: train_set_x[batch_begin: batch_end] } ) pretrain_fns.append(fn) returnpretrain_fns The ability to dynamically adjust the learning rate particularly is very helpful and may be applied in one of two ways. Once a technique has begun to converge on an appropriate solution, it is very helpful to be able to reduce the learning rate. If you do not do this, you risk creating a situation in which the network oscillates between values located around the optimum, without ever hitting it. In some contexts, it can be helpful to tie the learning rate to the network's performance measure. If the error rate is high, it can make sense to make larger adjustments until the error rate begins to decrease! The pretraining function we've created takes the minibatch index and can optionally take the corruption level or learning rate. It performs one step of pretraining and outputs the cost value and vector of weight updates. In addition to pretraining, we need to build functions to support the fine-tuning stage, where the network is run iteratively over the validation and test data to optimize network parameters. The train_fn implements a single step of fine-tuning. The valid_score is a Python function that computes a validation score using the error measure produced by the SdA over validation data. Similarly, test_score computes the error score over test data. To get this process off the ground, we first need to set up training, validation, and test datasets. Each stage requires two datasets (set x and set y), containing the features and class labels, respectively. The required number of minibatches for validation and test is determined, and an index is created to track batch size (and provide a means of identifying at which entries a batch starts and ends). Training, validation, and testing occurs for each batch and afterward, both valid_score and test_score are calculated across all batches: defbuild_finetune_functions(self, datasets, batch_size, learning_rate): (train_set_x, train_set_y) = datasets[0] (valid_set_x, valid_set_y) = datasets[1] (test_set_x, test_set_y) = datasets[2] n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] n_valid_batches /= batch_size n_test_batches = test_set_x.get_value(borrow=True).shape[0] n_test_batches /= batch_size index = T.lscalar('index') gparams = T.grad(self.finetune_cost, self.params) updates = [ (param, param - gparam * learning_rate) forparam, gparamin zip(self.params, gparams) ] train_fn = theano.function( inputs=[index], outputs=self.finetune_cost, updates=updates, givens={ self.x: train_set_x[ index * batch_size: (index + 1) * batch_size ], self.y: train_set_y[ index * batch_size: (index + 1) * batch_size ] }, name='train' ) test_score_i = theano.function( [index], self.errors, givens={ self.x: test_set_x[ index * batch_size: (index + 1) * batch_size ], self.y: test_set_y[ index * batch_size: (index + 1) * batch_size ] }, name='test' ) valid_score_i = theano.function( [index], self.errors, givens={ self.x: valid_set_x[ index * batch_size: (index + 1) * batch_size ], self.y: valid_set_y[ index * batch_size: (index + 1) * batch_size ] }, name='valid' ) defvalid_score(): return [valid_score_i(i) for i inxrange(n_valid_batches)] deftest_score(): return [test_score_i(i) for i inxrange(n_test_batches)] returntrain_fn, valid_score, test_score With the training functionality in place, the following code initiates our SdA: numpy_rng = numpy.random.RandomState(89677) print '... building the model' sda = SdA( numpy_rng=numpy_rng, n_ins=280, hidden_layers_sizes=[240, 170, 100], n_outs=5 ) It should be noted that, at this point, we should be trying an initial configuration of layer sizes to see how we do. In this case, the layer sizes used here are the product of some initial testing. As we discussed, training the SdA occurs in two stages. The first is a layer-wise pretraining process that loops over all of the SdA's layers. The second is a process of fine-tuning over validation and test data. To pretrain the SdA, we provide the required corruption levels to train each layer and iterate over the layers using our previously-defined pretraining_fns: print '... getting the pretraining functions' pretraining_fns = sda.pretraining_functions(train_set_x=train_set_x, batch_size=batch_size) print '... pre-training the model' start_time = time.clock() corruption_levels = [.1, .2, .2] for i inxrange(sda.n_layers): for epoch inxrange(pretraining_epochs): c = [] forbatch_indexinxrange(n_train_batches): c.append(pretraining_fns[i](index=batch_index, corruption=corruption_levels[i], lr=pretrain_lr)) print 'Pre-training layer %i, epoch %d, cost ' % (i, epoch), printnumpy.mean(c) end_time = time.clock() print>>sys.stderr, ('The pretraining code for file ' + os.path.split(__file__)[1] + ' ran for %.2fm' % ((end_time - start_time) / 60.)) At this point, we're able to initialize our SdA class via calling the preceding code stored within this book's GitHub repository, MasteringMLWithPython/Chapter3/SdA.py. Assessing SdA performance The SdA will take a significant length of time to run. With 15 epochs per layer and each layer typically taking an average of 11 minutes, the network will run for around 500 minutes on a modern desktop system with GPU acceleration and a single-threaded GotoBLAS. On a system without GPU acceleration, the network will take substantially longer to train and it is recommended that you use the alternative, which runs over a significantly smaller input dataset, MasteringMLWithPython/Chapter3/SdA_no_blas.py. The results are of a high quality, with a validation error score of 3.22% and test error score of 3.14%. These results are particularly impressive given the ambiguous and sometimes challenging nature of natural language processing applications. It was noticeable that the network classified more correctly for the 1-star and 5-star rating cases than for the intermediate levels. This is largely due to the ambiguous nature of unpolarized or unemotional language. Part of the reason that this input data was classifiable well was via significant feature engineering. While time-consuming and sometimes problematic, we've seen that well-executed feature engineering combined with an optimized model can deliver an excellent level of accuracy. Summary In this article, we introduced the autoencoder, an effective dimensionality reduction technique with some unique applications. We focused on the theory behind the SdA, an extension of autoencoders whereby any numbers of autoencoders are stacked in a deep architecture. Resources for Article: Further resources on this subject: Exception Handling in MySQL for Python [article] Clustering Methods [article] Machine Learning Using Spark MLlib [article]

0
0
5584

Packt

11 Jul 2016

10 min read

Mining Twitter with Python – Influence and Engagement

Packt

11 Jul 2016

10 min read

0
0
3721

article-image-implementing-artificial-neural-networks-tensorflow

Packt

08 Jul 2016

12 min read

Implementing Artificial Neural Networks with TensorFlow

Packt

08 Jul 2016

12 min read

In this article by Giancarlo Zaccone, the author of Getting Started with TensorFlow, we will learn about artificial neural networks (ANNs), an information processing system whose operating mechanism is inspired by biological neural circuits. Thanks to their characteristics, neural networks are the protagonists of a real revolution in machine learning systems and more generally in the context of Artificial Intelligence. An artificial neural network possesses many simple processing units variously connected to each other, according to various architectures. If we look at the schema of an ANN report, it can be seen that the hidden units communicate with the external layer, both in input and output, while the input and output units communicate only with the hidden layer of the network Each unit or node simulates the role of the neuron in biological neural networks, a node, said artificial neuron, plays a very simple operation: becomes active if the total quantity of signal, which it receives exceeds its activation threshold, defined by the so-called activation function. If a node becomesactive, it emits a signal that is transmitted along the transmission channels up to the other unit to which it is connected. A connection point acts as a filter that converts the message into an inhibitory or excitatory signal increasing or decreasing the intensity, according to their individual characteristics. The connection points simulate the biological synapses and have the fundamental function to weigh the intensity of the transmitted signals, by multiplying them by the weights whose value depends on the connection itself. ANN schematic diagram Neural network architectures The way to connect the nodes, the total number of layers, that is, the levels of nodes between input and output, define the architecture of a neural network. For example, in a multilayer networks, one can identify the artificial neurons of layers such that: Each neuron is connected with all those of the next layer There are no connections between neurons belonging to the same layer The number of layers and of neurons per layer depends on the problem to be solved Now we start our exploration of neural network models, introducing the most simple neural network model: the Single Layer Perceptron or the so-called Rosenblatt's Perceptron. Single Layer Perceptron The Single Layer Perceptron was the first neural network model proposed in 1958 by Frank Rosenblatt. In this model, the content of the local memory of the neuron consists of a vector of weights, W = (w1, w2,……, wn). The computation is performed over the calculation of a sum of the input vector X =(x1, x2,……, xn), each of which is multiplied by the corresponding element of the vector of the weights; then the value provided in the output (that is, a weighted sum) will be the input of an activation function. This function returns 1 if the result is greater than a certain threshold, otherwise it returns -1. In the following figure, the activation function is the so-called sign function: +1 x > 0 sign(x) = −1 otherwise It is possible to use other activation functions, preferably non-linear (such as the sigmoid function that we will see in the next section). The learning procedure of the net is iterative: it slightly modifies for each learning cycle (called epoch) the synaptic weights by using a selected set called training set. At each cycle, the weights must be modified so as to minimize a cost function, which is specific to the problem under consideration. Finally, when the perceptron will be trained on the training set, it will be able to be tested on other inputs (the test set) in order to verify its capacity of generalization. Schema of a Rosemblatt's Perceptron Let's now see how to implement a single layer neural network for an image classification problem using TensorFlow. The logistic regression This algorithm has nothing to do with the canonical linear regression, but it is an algorithm that allows us to solve supervised classification problems. In fact to estimate the dependent variable, now we make use of the so-called logistic function or sigmoid. It is precisely because of this feature that we call this algorithm logistic regression.The sigmoid function has this pattern: As we can see, the dependent variable takes values strictly between 0 and 1 that is precisely what serves us. In the case of the logistic regression we want, then, that our function tell us what's the probability of belonging to a particular element of our class. We recall again that the supervised learning by the neural network is configured as an iterative process of optimization of the weights; these are then modified on the basis of the network's performance of the training set. Indeed the aim is to the loss function which indicates the degree to which the behavior of the network deviates from the desired one. The performance of the network is then verified on a test set, consisting of images other than those of train. The basic steps of training that we're going to implement are as follows: The weights are initialized with random values at the beginning of the training. For each element of the training set is calculated the error, that is, the difference between the desired output and the actual output. This error is used to adjust the weights The process is repeated resubmitting to the network, in a random order, all the examples of the training set until the error made on the entire training set is not less than a certain threshold or until the number of iterations are over. Let's now see in detail how to implement the logistic regression with TensorFlow. The problem we want to solve is yet to classify images from the MNIST dataset. The TensorFlow implementation First of all, we have to import all the necessary libraries: import input_data import tensorflow as tf import matplotlib.pyplot as plt We use the input_data.readfunction, to upload the images to our problem: mnist = input_data.read_data_sets("/tmp/data/", one_hot=True) Then we set the total number of epochs for the training phase: training_epochs = 25 Also we must define other parameters necessary for the model building: learning_rate = 0.01 batch_size = 100 display_step = 1 Now we move to the construction of the model Building the model Define x as the input tensor, it represents the MNIST data image of shape 28 x 28 = 784 pixels x = tf.placeholder("float", [None, 784]) We recall that our problem consists in assigning a probability value for each of the possible classes of membership (the numbers from 0 to 9). At the end of this calculation, we will use a probability distribution, which gives us the value of what is confident with our prediction. So the output we're going to get will be an output tensor with 10 probabilities each one corresponding to a digit (of course the sum of probabilities must be one): y = tf.placeholder("float", [None, 10]) To assign probabilities to each image, we will use the so-called softmax activation function. The softmax function is specified in two main steps: Calculate the evidence that a certain image belongs to a particular class. Convert the evidence into probabilities of belonging to each of the 10 possible classes. To evaluate the evidence, we first define the weights input tensor asW: W = tf.Variable(tf.zeros([784, 10])) For a given image, we could evaluate the evidence for each class isimply multiplying the tensorWwith the input tensorx. Using TensorFlow, we should have something like this: evidence = tf.matmul(x, W) In general, the models include an extra parameter representing the bias that indicates a certain degree of uncertainty; in our case, the final formula for the evidence is: evidence = tf.matmul(x, W) + b It means that for everyi(from 0 to 9) we have aWimatrix elements784 (28 × 28), where each elementjof the matrix is multiplied by the correspondingcomponentjof the input image (784 parts) that is added and the corresponding bias elementbi. So to define the evidence, we must define the following tensor of biases: b = tf.Variable(tf.zeros([10])) The second step is finally to use thesoftmaxfunction to obtain the output vector of probabilities, namelyactivation: activation = tf.nn.softmax(tf.matmul(x, W) + b) The TensorFlow's functiontf.nn.softmaxprovides a probability based output from the input evidence tensor. Once we implement the model, we can proceed to specify the necessary code to find the W weights and biases b network through the iterative training algorithm. In each iteration, the training algorithm takes the training data, applies the neural network, and compares the result with the expected. In order to train our model and to know when we have a good one, we must know how to define the accuracy of our model. Our goal is to try to get valuesof parameters W and b that minimize the value of the metric that indicates how bad the model is. Different metrics calculate the degree of error between the desired output and output of the training data. A common measure of error is the mean squared error or the Squared Euclidean Distance. However, there are some research findings that suggest to use other metrics to a neural network like this. In this example, we use the so-called cross-entropy error function, it is defined as follows: cross_entropy = y*tf.lg(activation) In order to minimize the cross_entropy, we could use the following combination of tf.reduce_mean and tf.reduce_sum to build the cost function: cost = tf.reduce_mean (-tf.reduce_sum (cross_entropy, reduction_indices=1)) Then we must minimize it using the gradient descent optimization algorithm: optimizer = tf.train.GradientDescentOptimizer (learning_rate).minimize(cost) Few lines of code to build a neural network model! Launching the session It's the moment to build the session and launch our neural net model.We fix these lists to visualize the training session: avg_set = [] epoch_set=[] Then we initialize the TensorFlow variables: init = tf.initialize_all_variables() Start the session: with tf.Session() as sess: sess.run(init) As explained, each epoch is a training cycle: for epoch in range(training_epochs): avg_cost = 0. total_batch = int(mnist.train.num_examples/batch_size) Then we loop over all batches: for i in range(total_batch): batch_xs, batch_ys = mnist.train.next_batch(batch_size) Fit training using batch data: sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys}) Compute the average loss running the train_step function with the given image values (x) and the real output (y_): avg_cost += sess.run (cost, feed_dict={x: batch_xs, y: batch_ys})/total_batch During the computation, we display a log per epoch step: if epoch % display_step == 0: print "Epoch:", '%04d' % (epoch+1), "cost=","{:.9f}".format(avg_cost) print " Training phase finished" Let's get the accuracy of our mode.It is correct if the index with the highest y value is the same as in the real digit vector the mean of correct_prediction gives us the accuracy. We need to run the accuracy function with our test set (mnist.test). We use the keys images and labelsfor x and y_: correct_prediction = tf.equal (tf.argmax(activation, 1), tf.argmax(y, 1)) accuracy = tf.reduce_mean (tf.cast(correct_prediction, "float")) print "MODEL accuracy:", accuracy.eval({x: mnist.test.images, y: mnist.test.labels}) Test evaluation We have seen the training phase in the preceding sections; for each epoch we have printed the relative cost function: Python 2.7.10 (default, Oct 14 2015, 16:09:02) [GCC 5.2.1 20151010] on linux2 Type "copyright", "credits" or "license()" for more information. >>> ======================= RESTART ============================ >>> Extracting /tmp/data/train-images-idx3-ubyte.gz Extracting /tmp/data/train-labels-idx1-ubyte.gz Extracting /tmp/data/t10k-images-idx3-ubyte.gz Extracting /tmp/data/t10k-labels-idx1-ubyte.gz Epoch: 0001 cost= 1.174406662 Epoch: 0002 cost= 0.661956009 Epoch: 0003 cost= 0.550468774 Epoch: 0004 cost= 0.496588717 Epoch: 0005 cost= 0.463674555 Epoch: 0006 cost= 0.440907706 Epoch: 0007 cost= 0.423837747 Epoch: 0008 cost= 0.410590841 Epoch: 0009 cost= 0.399881751 Epoch: 0010 cost= 0.390916621 Epoch: 0011 cost= 0.383320325 Epoch: 0012 cost= 0.376767031 Epoch: 0013 cost= 0.371007620 Epoch: 0014 cost= 0.365922904 Epoch: 0015 cost= 0.361327561 Epoch: 0016 cost= 0.357258660 Epoch: 0017 cost= 0.353508228 Epoch: 0018 cost= 0.350164634 Epoch: 0019 cost= 0.347015593 Epoch: 0020 cost= 0.344140861 Epoch: 0021 cost= 0.341420144 Epoch: 0022 cost= 0.338980592 Epoch: 0023 cost= 0.336655581 Epoch: 0024 cost= 0.334488012 Epoch: 0025 cost= 0.332488823 Training phase finished As wesaw, during the training phase, the cost function is minimized.At the end of the test, we show how accurately the model is implemented: Model Accuracy: 0.9475 >>> Finally, using these lines of code, we could visualize the the training phase of the net: plt.plot(epoch_set,avg_set, 'o', label='Logistic Regression Training phase') plt.ylabel('cost') plt.xlabel('epoch') plt.legend() plt.show() Training phase in logistic regression Summary In this article, we learned the implementation of artificial neural networks, Single Layer Perceptron, TensorFlow. We also learned how to build the model and launch the session.

0
0
7135

How-To Tutorials - Data

Unsupervised Learning

Deep Learning and Image generation: Get Started with Generative Adversarial Networks

Language Modeling with Deep Learning

Indexes and Constraints

Preprocessing the Data

Application Logging

Consistency Conflicts

Expanding Your Data Mining Toolbox

Key Elements of Time Series Analysis

Data Extracting, Transforming, and Loading

Trending Topics

Overview of Certificate Management

MicroStrategy 10

Stacked Denoising Autoencoders

Mining Twitter with Python – Influence and Engagement

Implementing Artificial Neural Networks with TensorFlow

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading