Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-wrapping-opencv
Packt
11 Nov 2013
2 min read
Save for later

Wrapping OpenCV

Packt
11 Nov 2013
2 min read
(For more resources related to this topic, see here.) Architecture overview In this section we will examine and compare the architectures of OpenCV and Emgu CV. OpenCV In the hello-world project, we already knew our code had something to do with the bin folder in the Emgu library that we installed. Those files are OpenCV DLLs, which have the filename starting with opencv_. So the Emgu CV users need to have some basic knowledge about OpenCV. OpenCV is broadly structured into five main components. Four of them are described in the following section: The first one is the CV component, which includes the algorithms about computer vision and basic image processing. All the methods for basic processes are found here. ML is short for Machine Learning, which contains popular machine learning algorithms with clustering tools and statistical classifiers. HighGUI is designed to construct user-friendly interfaces to load and store media data. CXCore is the most important one. This component provides all the basic data structures and contents. The components can be seen in the following diagram: The preceding structure map does not include CvAux, which contains many areas. It can be divided into two parts: defunct areas and experimental algorithms. CvAux is not particularly well documented in the Wiki, but it covers many features. Some of them may migrate to CV in the future, others probably never will. Emgu CV Emgu CV can be seen as two layers on top of OpenCV, which are explained as follows: Layer 1 is the basic layer. It includes enumeration, structure, and function mappings. The namespaces are direct wrappers from OpenCV components. Layer 2 is an upper layer. It takes good advantage of .NET framework and mixes the classes together. It can be seen in the bridge from OpenCV to .NET. The architecture of Emgu CV can be seen in the following diagram, which includes more details: After we create our new Emgu CV project, the first thing we will do is add references. Now we can see what those DLLs are used for: Emgu.Util.dll: A collection of .NET utilities Emgu.CV.dll: Basic image-processing algorithms from OpenCV Emgu.CV.UI.dll: Useful tools for Emgu controls Emgu.CV.GPU.dll: GPU processing (Nvidia Cuda) Emgu.CV.ML.dll: Machine learning algorithms
Read more
  • 0
  • 1
  • 4241

article-image-installing-and-managing-multi-master-replication-managermmm-mysql-high-availability
Packt
04 May 2010
5 min read
Save for later

Installing and Managing Multi Master Replication Manager(MMM) for MySQL High Availability

Packt
04 May 2010
5 min read
(Read more interesting articles on MySQL High Availability here.) Multi Master Replication Manager (MMM): initial installation This setup is asynchronous, and a small number of transactions can be lost in the event of the failure of the master. If this is not acceptable, any asynchronous replication-based high availability technique is not suitable. Over the next few recipes, we shall configure a two-node cluster with MMM. It is possible to configure additional slaves and more complicated topologies. As the focus of this article is high availability, and in order to keep this recipe concise, we shall not mention these techniques (although, they all are documented in the manual available at http://mysql-mmm.org/). MMM consists of several separate Perl scripts, with two main ones: mmmd_mon: Runs on one node, monitors all nodes, and takes decisions. mmmd_agent: Runs on each node, monitors the node, and receives instructions from mmm_mon. In a group of MMM-managed machines, each node has a node IP, which is the normal server IP address. In addition, each node has a "read" IP and a "write" IP. Read and write IPs are moved around depending on the status of each node as detected and decided by mmmd_mon, which migrates these IP address around to ensure that the write IP address is always on an active and working master, and that all read IPs are connected to another master that is in sync (which does not have out-of-date data). mmmd_mon should not run on the same server as any of the databases to ensure good availability. Thus, the best practice would be to keep a minimum number of three nodes. In the examples of this article, we will configure two MySQL servers, node 5 and node 6 (10.0.0.5 and 6) with a virtual writable IP of 10.0.0.10 and two read-only IPs of 10.0.0.11 and 10.0.0.12, using a monitoring node node 4 (10.0.0.4). We will use RedHat / CentOS provided software where possible. If you are using the same nodes to try out any of the other recipes discussed in this article, be sure to remove MySQL Cluster RPMs and /etc/my.cnf before attempting to follow this recipe There are several phases to set up MMM. Firstly, the MySQL and monitoring nodes must have MMM installed, and each node must be configured to join the cluster. Secondly, the MySQL server nodes must have MySQL installed and must be configured in a master-master replication agreement. Thirdly, a monitoring node (which will monitor the cluster and take actions based on what it sees) must be configured. Finally, the MMM monitoring node must be allowed to take control of the cluster. In this article, each of the previous four steps is a recipe. The first recipe covers the initial installation of MMM on the nodes. How to do it... The MMM documentation provides a list of required Perl modules. With one exception, all Perl modules currently required for both monitoring agents and server nodes can be found in either the base CentOS / RHEL repositories, or the EPEL library (see the Appendices for instructions on configuration of this repository), and will be installed with the following yum command: [root@node6 ~]# yum -y install perl-Algorithm-Diff perl-Class-Singleton perl-DBD-MySQL perl-Log-Log4perl perl-Log-Dispatch perl-Proc-Daemon perl-MailTools Not all of the package names are obvious for each module; fortunately, the actual perl module name is stored in the Other field in the RPM spec file, which can be searched using this syntax: [root@node5 mysql-mmm-2.0.9]# yum whatprovides "*File::stat*"Loaded plugins: fastestmirror...4:perl-5.8.8-18.el5.x86_64 : The Perl programming languageMatched from:Other : perl(File::stat) = 1.00Filename : /usr/share/man/man3/File::stat.3pm.gz... This shows that the Perl File::stat module is included in the base perl package (this command will dump once per relevant file; in this case, the first file that matches is in fact the manual page). The first step is to download the MMM source code onto all nodes: [root@node4 ~]# mkdir mmm[root@node4 ~]# cd mmm[root@node4 mmm]# wget http://mysql-mmm.org/_media/:mmm2:mysql-mmm-2.0.9.tar.gz--13:44:45-- http://mysql-mmm.org/_media/:mmm2:mysql-mmm-2.0.9.tar.gz...13:44:45 (383 KB/s) - `mysql-mmm-2.0.9.tar.gz' saved [50104/50104] Then we extract it using the tar command: [root@node4 mmm]# tar zxvf mysql-mmm-2.0.9.tar.gzmysql-mmm-2.0.9/mysql-mmm-2.0.9/lib/...mysql-mmm-2.0.9/VERSIONmysql-mmm-2.0.9/LICENSE[root@node4 mmm]# cd mysql-mmm-2.0.9 Now, we need to install the software, which is simply done with the make file provided: [root@node4 mysql-mmm-2.0.9]# make installmkdir -p /usr/lib/perl5/vendor_perl/5.8.8/MMM /usr/bin/mysql-mmm /usr/sbin /var/log/mysql-mmm /etc /etc/mysql-mmm /usr/bin/mysql-mmm/agent/ /usr/bin/mysql-mmm/monitor/...[ -f /etc/mysql-mmm/mmm_tools.conf ] || cp etc/mysql-mmm/mmm_tools.conf /etc/mysql-mmm/ Ensure that the exit code is 0 and that there are no errors: [root@node4 mysql-mmm-2.0.9]# echo $?0 Any errors are likely caused as a result of dependencies—ensure that you have a working yum configuration (refer to Appendices) and have run the correct yum install command.
Read more
  • 0
  • 0
  • 4227

article-image-machine-learning-r
Packt
16 Feb 2016
39 min read
Save for later

Machine Learning with R

Packt
16 Feb 2016
39 min read
If science fiction stories are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers. In the early stages, computers are taught to play simple games of tic-tac-toe and chess. Later, machines are given control of traffic lights and communications, followed by military drones and missiles. The machines' evolution takes an ominous turn once the computers become sentient and learn how to teach themselves. Having no more need for human programmers, humankind is then "deleted." Thankfully, at the time of this writing, machines still require user input. Though your impressions of machine learning may be colored by these mass media depictions, today's algorithms are too application-specific to pose any danger of becoming self-aware. The goal of today's machine learning is not to create an artificial brain, but rather to assist us in making sense of the world's massive data stores. Putting popular misconceptions aside, by the end of this article, you will gain a more nuanced understanding of machine learning. You also will be introduced to the fundamental concepts that define and differentiate the most commonly used machine learning approaches. (For more resources related to this topic, see here.) You will learn: The origins and practical applications of machine learning How computers turn data into knowledge and action How to match a machine learning algorithm to your data The field of machine learning provides a set of algorithms that transform data into actionable knowledge. Keep reading to see how easy it is to use R to start applying machine learning to real-world problems. The origins of machine learning Since birth, we are inundated with data. Our body's sensors—the eyes, ears, nose, tongue, and nerves—are continually assailed with raw data that our brain translates into sights, sounds, smells, tastes, and textures. Using language, we are able to share these experiences with others. From the advent of written language, human observations have been recorded. Hunters monitored the movement of animal herds, early astronomers recorded the alignment of planets and stars, and cities recorded tax payments, births, and deaths. Today, such observations, and many more, are increasingly automated and recorded systematically in the ever-growing computerized databases. The invention of electronic sensors has additionally contributed to an explosion in the volume and richness of recorded data. Specialized sensors see, hear, smell, taste, and feel. These sensors process the data far differently than a human being would. Unlike a human's limited and subjective attention, an electronic sensor never takes a break and never lets its judgment skew its perception. Although sensors are not clouded by subjectivity, they do not necessarily report a single, definitive depiction of reality. Some have an inherent measurement error, due to hardware limitations. Others are limited by their scope. A black and white photograph provides a different depiction of its subject than one shot in color. Similarly, a microscope provides a far different depiction of reality than a telescope. Between databases and sensors, many aspects of our lives are recorded. Governments, businesses, and individuals are recording and reporting information, from the monumental to the mundane. Weather sensors record temperature and pressure data, surveillance cameras watch sidewalks and subway tunnels, and all manner of electronic behaviors are monitored: transactions, communications, friendships, and many others. This deluge of data has led some to state that we have entered an era of Big Data, but this may be a bit of a misnomer. Human beings have always been surrounded by large amounts of data. What makes the current era unique is that we have vast amounts of recorded data, much of which can be directly accessed by computers. Larger and more interesting data sets are increasingly accessible at the tips of our fingers, only a web search away. This wealth of information has the potential to inform action, given a systematic way of making sense from it all. The field of study interested in the development of computer algorithms to transform data into intelligent action is known as machine learning. This field originated in an environment where available data, statistical methods, and computing power rapidly and simultaneously evolved. Growth in data necessitated additional computing power, which in turn spurred the development of statistical methods to analyze large datasets. This created a cycle of advancement, allowing even larger and more interesting data to be collected.   A closely related sibling of machine learning, data mining, is concerned with the generation of novel insights from large databases. As the implies, data mining involves a systematic hunt for nuggets of actionable intelligence. Although there is some disagreement over how widely machine learning and data mining overlap, a potential point of distinction is that machine learning focuses on teaching computers how to use data to solve a problem, while data mining focuses on teaching computers to identify patterns that humans then use to solve a problem. Virtually all data mining involves the use of machine learning, but not all machine learning involves data mining. For example, you might apply machine learning to data mine automobile traffic data for patterns related to accident rates; on the other hand, if the computer is learning how to drive the car itself, this is purely machine learning without data mining. The phrase "data mining" is also sometimes used as a pejorative to describe the deceptive practice of cherry-picking data to support a theory. Uses and abuses of machine learning Most people have heard of the chess-playing computer Deep Blue—the first to win a game against a world champion—or Watson, the computer that defeated two human opponents on the television trivia game show Jeopardy. Based on these stunning accomplishments, some have speculated that computer intelligence will replace humans in many information technology occupations, just as machines replaced humans in the fields, and robots replaced humans on the assembly line. The truth is that even as machines reach such impressive milestones, they are still relatively limited in their ability to thoroughly understand a problem. They are pure intellectual horsepower without direction. A computer may be more capable than a human of finding subtle patterns in large databases, but it still needs a human to motivate the analysis and turn the result into meaningful action. Machines are not good at asking questions, or even knowing what questions to ask. They are much better at answering them, provided the question is stated in a way the computer can comprehend. Present-day machine learning algorithms partner with people much like a bloodhound partners with its trainer; the dog's sense of smell may be many times stronger than its master's, but without being carefully directed, the hound may end up chasing its tail.   To better understand the real-world applications of machine learning, we'll now consider some cases where it has been used successfully, some places where it still has room for improvement, and some situations where it may do more harm than good. Machine learning successes Machine learning is most successful when it augments rather than replaces the specialized knowledge of a subject-matter expert. It works with medical doctors at the forefront of the fight to eradicate cancer, assists engineers and programmers with our efforts to create smarter homes and automobiles, and helps social scientists build knowledge of how societies function. Toward these ends, it is employed in countless businesses, scientific laboratories, hospitals, and governmental organizations. Any organization that generates or aggregates data likely employs at least one machine learning algorithm to help make sense of it. Though it is impossible to list every use case of machine learning, a survey of recent success stories includes several prominent applications: Identification of unwanted spam messages in e-mail Segmentation of customer behavior for targeted advertising Forecasts of weather behavior and long-term climate changes Reduction of fraudulent credit card transactions Actuarial estimates of financial damage of storms and natural disasters Prediction of popular election outcomes Development of algorithms for auto-piloting drones and self-driving cars Optimization of energy use in homes and office buildings Projection of areas where criminal activity is most likely Discovery of genetic sequences linked to diseases The limits of machine learning Although machine learning is used widely and has tremendous potential, it is important to understand its limits. Machine learning, at this time, is not in any way a substitute for a human brain. It has very little flexibility to extrapolate outside of the strict parameters it learned and knows no common sense. With this in mind, one should be extremely careful to recognize exactly what the algorithm has learned before setting it loose in the real-world settings. Without a lifetime of past experiences to build upon, computers are also limited in their ability to make simple common sense inferences about logical next steps. Take, for instance, the banner advertisements seen on many web sites. These may be served, based on the patterns learned by data mining the browsing history of millions of users. According to this data, someone who views the websites selling shoes should see advertisements for shoes, and those viewing websites for mattresses should see advertisements for mattresses. The problem is that this becomes a never-ending cycle in which additional shoe or mattress advertisements are served rather than advertisements for shoelaces and shoe polish, or bed sheets and blankets. Many are familiar with the deficiencies of machine learning's ability to understand or translate language or to recognize speech and handwriting. Perhaps the earliest example of this type of failure is in a 1994 episode of the television show, The Simpsons, which showed a parody of the Apple Newton tablet. For its time, the Newton was known for its state-of-the-art handwriting recognition. Unfortunately for Apple, it would occasionally fail to great effect. The television episode illustrated this through a sequence in which a bully's note to Beat up Martin was misinterpreted by the Newton as Eat up Martha, as depicted in the following screenshots:   Screenshots from "Lisa on Ice" The Simpsons, 20th Century Fox (1994) Machines' ability to understand language has improved enough since 1994, such that Google, Apple, and Microsoft are all confident enough to offer virtual concierge services operated via voice recognition. Still, even these services routinely struggle to answer relatively simple questions. Even more, online translation services sometimes misinterpret sentences that a toddler would readily understand. The predictive text feature on many devices has also led to a number of humorous autocorrect fail sites that illustrate the computer's ability to understand basic language but completely misunderstand context. Some of these mistakes are to be expected, for sure. Language is complicated with multiple layers of text and subtext and even human beings, sometimes, understand the context incorrectly. This said, these types of failures in machines illustrate the important fact that machine learning is only as good as the data it learns from. If the context is not directly implicit in the input data, then just like a human, the computer will have to make its best guess. Machine learning ethics At its core, machine learning is simply a tool that assists us in making sense of the world's complex data. Like any tool, it can be used for good or evil. Machine learning may lead to problems when it is applied so broadly or callously that humans are treated as lab rats, automata, or mindless consumers. A process that may seem harmless may lead to unintended consequences when automated by an emotionless computer. For this reason, those using machine learning or data mining would be remiss not to consider the ethical implications of the art. Due to the relative youth of machine learning as a discipline and the speed at which it is progressing, the associated legal issues and social norms are often quite uncertain and constantly in flux. Caution should be exercised while obtaining or analyzing data in order to avoid breaking laws, violating terms of service or data use agreements, and abusing the trust or violating the privacy of customers or the public. The informal corporate motto of Google, an organization that collects perhaps more data on individuals than any other, is "don't be evil." While this seems clear enough, it may not be sufficient. A better approach may be to follow the Hippocratic Oath, a medical principle that states "above all, do no harm." Retailers routinely use machine learning for advertising, targeted promotions, inventory management, or the layout of the items in the store. Many have even equipped checkout lanes with devices that print coupons for promotions based on the customer's buying history. In exchange for a bit of personal data, the customer receives discounts on the specific products he or she wants to buy. At first, this appears relatively harmless. But consider what happens when this practice is taken a little bit further. One possibly apocryphal tale concerns a large retailer in the U.S. that employed machine learning to identify expectant mothers for coupon mailings. The retailer hoped that if these mothers-to-be received substantial discounts, they would become loyal customers, who would later purchase profitable items like diapers, baby formula, and toys. Equipped with machine learning methods, the retailer identified items in the customer purchase history that could be used to predict with a high degree of certainty, not only whether a woman was pregnant, but also the approximate timing for when the baby was due. After the retailer used this data for a promotional mailing, an angry man contacted the chain and demanded to know why his teenage daughter received coupons for maternity items. He was furious that the retailer seemed to be encouraging teenage pregnancy! As the story goes, when the retail chain's manager called to offer an apology, it was the father that ultimately apologized because, after confronting his daughter, he discovered that she was indeed pregnant! Whether completely true or not, the lesson learned from the preceding tale is that common sense should be applied before blindly applying the results of a machine learning analysis. This is particularly true in cases where sensitive information such as health data is concerned. With a bit more care, the retailer could have foreseen this scenario, and used greater discretion while choosing how to reveal the pattern its machine learning analysis had discovered. Certain jurisdictions may prevent you from using racial, ethnic, religious, or other protected class data for business reasons.  Keep in mind that excluding this data from your analysis may not be enough, because machine learning algorithms might inadvertently learn this information independently. For instance, if a certain segment of people generally live in a certain region, buy a certain product, or otherwise behave in a way that uniquely identifies them as a group, some machine learning algorithms can infer the protected information from these other factors. In such cases, you may need to fully "de-identify" these people by excluding any potentially identifying data in addition to the protected information. Apart from the legal consequences, using data inappropriately may hurt the bottom line. Customers may feel uncomfortable or become spooked if the aspects of their lives they consider private are made public. In recent years, several high-profile web applications have experienced a mass exodus of users who felt exploited when the applications' terms of service agreements changed, and their data was used for purposes beyond what the users had originally agreed upon. The fact that privacy expectations differ by context, age cohort, and locale adds complexity in deciding the appropriate use of personal data. It would be wise to consider the cultural implications of your work before you begin your project. The fact that you can use data for a particular end does not always mean that you should. How machines learn A formal definition of machine learning proposed by computer scientist Tom M. Mitchellstates that a machine learns whenever it is able to utilize its an experience such that its performance improves on similar experiences in the future. Although this definition is intuitive, it completely ignores the process of exactly how experience can be translated into future action—and of course learning is always easier said than done! While human brains are naturally capable of learning from birth, the conditions necessary for computers to learn must be made explicit. For this reason, although it is not strictly necessary to understand the theoretical basis of learning, this foundation helps understand, distinguish, and implement machine learning algorithms. As you compare machine learning to human learning, you may discover yourself examining your own mind in a different light. Regardless of whether the learner is a human or machine, the basic learning process is similar. It can be divided into four interrelated components: Data storage utilizes observation, memory, and recall to provide a factual basis for further reasoning. Abstraction involves the translation of stored data into broader representations and concepts. Generalization uses abstracted data to create knowledge and inferences that drive action in new contexts. Evaluation provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements.  Keep in mind that although the learning process has been conceptualized as four distinct components, they are merely organized this way for illustrative purposes. In reality, the entire learning process is inextricably linked. In human beings, the process occurs subconsciously. We recollect, deduce, induct, and intuit with the confines of our mind's eye, and because this process is hidden, any differences from person to person are attributed to a vague notion of subjectivity. In contrast, with computers these processes are explicit, and because the entire process is transparent, the learned knowledge can be examined, transferred, and utilized for future action. Data storage All learning must begin with data. Humans and computers alike utilize data storage as a foundation for more advanced reasoning. In a human being, this consists of a brain that uses electrochemical signals in a network of biological cells to store and process observations for short- and long-term future recall. Computers have similar capabilities of short- and long-term recall using hard disk drives, flash memory, and random access memory (RAM) in combination with a central processing unit (CPU). It may seem obvious to say so, but the ability to store and retrieve data alone is not sufficient for learning. Without a higher level of understanding, knowledge is limited exclusively to recall, meaning exclusively what is seen before and nothing else. The data is merely ones and zeros on a disk. They are stored memories with no broader meaning. To better understand the nuances of this idea, it may help to think about the last time you studied for a difficult test, perhaps for a university final exam or a career certification. Did you wish for an eidetic (photographic) memory? If so, you may be disappointed to learn that perfect recall is unlikely to be of much assistance. Even if you could memorize material perfectly, your rote learning is of no use, unless you know in advance the exact questions and answers that will appear in the exam. Otherwise, you would be stuck in an attempt to memorize answers to every question that could conceivably be asked. Obviously, this is an unsustainable strategy. Instead, a better approach is to spend time selectively, memorizing a small set of representative ideas while developing strategies on how the ideas relate and how to use the stored information. In this way, large ideas can be understood without needing to memorize them by rote. Abstraction This work of assigning meaning to stored data occurs during the abstraction process, in which raw data comes to have a more abstract meaning. This type of connection, say between an object and its representation, is exemplified by the famous René Magritte painting The Treachery of Images:   Source: http://collections.lacma.org/node/239578 The painting depicts a tobacco pipe with the caption Ceci n'est pas une pipe ("this is not a pipe"). The point Magritte was illustrating is that a representation of a pipe is not truly a pipe. Yet, in spite of the fact that the pipe is not real, anybody viewing the painting easily recognizes it as a pipe. This suggests that the observer's mind is able to connect the picture of a pipe to the idea of a pipe, to a memory of a physical pipe that could be held in the hand. Abstracted connections like these are the basis of knowledge representation, the formation of logical structures that assist in turning raw sensory information into a meaningful insight. During a machine's process of knowledge representation, the computer summarizes stored raw data using a model, an explicit description of the patterns within the data. Just like Magritte's pipe, the model representation takes on a life beyond the raw data. It represents an idea greater than the sum of its parts. There are many different types of models. You may be already familiar with some. Examples include: Mathematical equations Relational diagrams such as trees and graphs Logical if/else rules Groupings of data known as clusters The choice of model is typically not left up to the machine. Instead, the learning task and data on hand inform model selection. Later in this article, we will discuss methods to choose the type of model in more detail. The process of fitting a model to a dataset is known as training. When the model has been trained, the data is transformed into an abstract form that summarizes the original information. You might wonder why this step is called training rather than learning. First, note that the process of learning does not end with data abstraction; the learner must still generalize and evaluate its training. Second, the word training better connotes the fact that the human teacher trains the machine student to understand the data in a specific way. It is important to note that a learned model does not itself provide new data, yet it does result in new knowledge. How can this be? The answer is that imposing an assumed structure on the underlying data gives insight into the unseen by supposing a concept about how data elements are related. Take for instance the discovery of gravity. By fitting equations to observational data, Sir Isaac Newton inferred the concept of gravity. But the force we now know as gravity was always present. It simply wasn't recognized until Newton recognized it as an abstract concept that relates some data to others—specifically, by becoming the g term in a model that explains observations of falling objects.   Most models may not result in the development of theories that shake up scientific thought for centuries. Still, your model might result in the discovery of previously unseen relationships among data. A model trained on genomic data might find several genes that, when combined, are responsible for the onset of diabetes; banks might discover a seemingly innocuous type of transaction that systematically appears prior to fraudulent activity; and psychologists might identify a combination of personality characteristics indicating a new disorder. These underlying patterns were always present, but by simply presenting information in a different format, a new idea is conceptualized. Generalization The learning process is not complete until the learner is able to use its abstracted knowledge for future action. However, among the countless underlying patterns that might be identified during the abstraction process and the myriad ways to model these patterns, some will be more useful than others. Unless the production of abstractions is limited, the learner will be unable to proceed. It would be stuck where it started—with a large pool of information, but no actionable insight. The term generalization describes the process of turning abstracted knowledge into a form that can be utilized for future action, on tasks that are similar, but not identical, to those it has seen before. Generalization is a somewhat vague process that is a bit difficult to describe. Traditionally, it has been imagined as a search through the entire set of models (that is, theories or inferences) that could be abstracted during training. In other words, if you can imagine a hypothetical set containing every possible theory that could be established from the data, generalization involves the reduction of this set into a manageable number of important findings. In generalization, the learner is tasked with limiting the patterns it discovers to only those that will be most relevant to its future tasks. Generally, it is not feasible to reduce the number of patterns by examining them one-by-one and ranking them by future utility. Instead, machine learning algorithms generally employ shortcuts that reduce the search space more quickly. Toward this end, the algorithm will employ heuristics, which are educated guesses about where to find the most useful inferences. Because heuristics utilize approximations and other rules of thumb, they do not guarantee to find the single best model. However, without taking these shortcuts, finding useful information in a large dataset would be infeasible. Heuristics are routinely used by human beings to quickly generalize experience to new scenarios. If you have ever utilized your gut instinct to make a snap decision prior to fully evaluating your circumstances, you were intuitively using mental heuristics. The incredible human ability to make quick decisions often relies not on computer-like logic, but rather on heuristics guided by emotions. Sometimes, this can result in illogical conclusions. For example, more people express fear of airline travel versus automobile travel, despite automobiles being statistically more dangerous. This can be explained by the availability heuristic, which is the tendency of people to estimate the likelihood of an event by how easily its examples can be recalled. Accidents involving air travel are highly publicized. Being traumatic events, they are likely to be recalled very easily, whereas car accidents barely warrant a mention in the newspaper. The folly of misapplied heuristics is not limited to human beings. The heuristics employed by machine learning algorithms also sometimes result in erroneous conclusions. The algorithm is said to have a bias if the conclusions are systematically erroneous, or wrong in a predictable manner. For example, suppose that a machine learning algorithm learned to identify faces by finding two dark circles representing eyes, positioned above a straight line indicating a mouth. The algorithm might then have trouble with, or be biased against, faces that do not conform to its model. Faces with glasses, turned at an angle, looking sideways, or with various skin tones might not be detected by the algorithm. Similarly, it could be biased toward faces with certain skin tones, face shapes, or other characteristics that do not conform to its understanding of the world.   In modern usage, the word bias has come to carry quite negative connotations. Various forms of media frequently claim to be free from bias, and claim to report the facts objectively, untainted by emotion. Still, consider for a moment the possibility that a little bias might be useful. Without a bit of arbitrariness, might it be a bit difficult to decide among several competing choices, each with distinct strengths and weaknesses? Indeed, some recent studies in the field of psychology have suggested that individuals born with damage to portions of the brain responsible for emotion are ineffectual in decision making, and might spend hours debating simple decisions such as what color shirt to wear or where to eat lunch. Paradoxically, bias is what blinds us from some information while also allowing us to utilize other information for action. It is how machine learning algorithms choose among the countless ways to understand a set of data. Evaluation Bias is a necessary evil associated with the abstraction and generalization processes inherent in any learning task. In order to drive action in the face of limitless possibility, each learner must be biased in a particular way. Consequently, each learner has its weaknesses and there is no single learning algorithm to rule them all. Therefore, the final step in the generalization process is to evaluate or measure the learner's success in spite of its biases and use this information to inform additional training if needed. Once you've had success with one machine learning technique, you might be tempted to apply it to everything. It is important to resist this temptation because no machine learning approach is the best for every circumstance. This fact is described by the No Free Lunch theorem, introduced by David Wolpert in 1996. For more information, visit: http://www.no-free-lunch.org. Generally, evaluation occurs after a model has been trained on an initial training dataset. Then, the model is evaluated on a new test dataset in order to judge how well its characterization of the training data generalizes to new, unseen data. It's worth noting that it is exceedingly rare for a model to perfectly generalize to every unforeseen case. In parts, models fail to perfectly generalize due to the problem of noise, a term that describes unexplained or unexplainable variations in data. Noisy data is caused by seemingly random events, such as: Measurement error due to imprecise sensors that sometimes add or subtract a bit from the readings Issues with human subjects, such as survey respondents reporting random answers to survey questions, in order to finish more quickly Data quality problems, including missing, null, truncated, incorrectly coded, or corrupted values Phenomena that are so complex or so little understood that they impact the data in ways that appear to be unsystematic Trying to model noise is the basis of a problem called overfitting. Because most noisy data is unexplainable by definition, attempting to explain the noise will result in erroneous conclusions that do not generalize well to new cases. Efforts to explain the noise will also typically result in more complex models that will miss the true pattern that the learner tries to identify. A model that seems to perform well during training, but does poorly during evaluation, is said to be overfitted to the training dataset, as it does not generalize well to the test dataset.   Solutions to the problem of overfitting are specific to particular machine learning approaches. For now, the important point is to be aware of the issue. How well the models are able to handle noisy data is an important source of distinction among them. Machine learning in practice So far, we've focused on how machine learning works in theory. To apply the learning process to real-world tasks, we'll use a five-step process. Regardless of the task at hand, any machine learning algorithm can be deployed by following these steps: Data collection: The data collection step involves gathering the learning material an algorithm will use to generate actionable knowledge. In most cases, the data will need to be combined into a single source like a text file, spreadsheet, or database. Data exploration and preparation: The quality of any machine learning project is based largely on the quality of its input data. Thus, it is important to learn more about the data and its nuances during a practice called data exploration. Additional work is required to prepare the data for the learning process. This involves fixing or cleaning so-called "messy" data, eliminating unnecessary data, and recoding the data to conform to the learner's expected inputs. Model training: By the time the data has been prepared for analysis, you are likely to have a sense of what you are capable of learning from the data. The specific machine learning task chosen will inform the selection of an appropriate algorithm, and the algorithm will represent the data in the form of a model. Model evaluation: Because each machine learning model results in a biased solution to the learning problem, it is important to evaluate how well the algorithm learns from its experience. Depending on the type of model used, you might be able to evaluate the accuracy of the model using a test dataset or you may need to develop measures of performance specific to the intended application. Model improvement: If better performance is needed, it becomes necessary to utilize more advanced strategies to augment the performance of the model. Sometimes, it may be necessary to switch to a different type of model altogether. You may need to supplement your data with additional data or perform additional preparatory work as in step two of this process. After these steps are completed, if the model appears to be performing well, it can be deployed for its intended task. As the case may be, you might utilize your model to provide score data for predictions (possibly in real time), for projections of financial data, to generate useful insight for marketing or research, or to automate tasks such as mail delivery or flying aircraft. The successes and failures of the deployed model might even provide additional data to train your next generation learner. Types of input data The practice of machine learning involves matching the characteristics of input data to the biases of the available approaches. Thus, before applying machine learning to real-world problems, it is important to understand the terminology that distinguishes among input datasets. The phrase unit of observation is used to describe the smallest entity with measured properties of interest for a study. Commonly, the unit of observation is in the form of persons, objects or things, transactions, time points, geographic regions, or measurements. Sometimes, units of observation are combined to form units such as person-years, which denote cases where the same person is tracked over multiple years; each person-year comprises of a person's data for one year. The unit of observation is related, but not identical, to the unit of analysis, which is the smallest unit from which the inference is made. Although it is often the case, the observed and analyzed units are not always the same. For example, data observed from people might be used to analyze trends across countries. Datasets that store the units of observation and their properties can be imagined as collections of data consisting of: Examples: Instances of the unit of observation for which properties have been recorded Features: Recorded properties or attributes of examples that may be useful for learning It is easiest to understand features and examples through real-world cases. To build a learning algorithm to identify spam e-mail, the unit of observation could be e-mail messages, the examples would be specific messages, and the features might consist of the words used in the messages. For a cancer detection algorithm, the unit of observation could be patients, the examples might include a random sample of cancer patients, and the features may be the genomic markers from biopsied cells as well as the characteristics of patient such as weight, height, or blood pressure. While examples and features do not have to be collected in any specific form, they are commonly gathered in matrix format, which means that each example has exactly the same features. The following spreadsheet shows a dataset in matrix format. In matrix data, each row in the spreadsheet is an example and each column is a feature. Here, the rows indicate examples of automobiles, while the columns record various each automobile's features, such as price, mileage, color, and transmission type. Matrix format data is by far the most common form used in machine learning: Features also come in various forms. If a feature represents a characteristic measured in numbers, it is unsurprisingly called numeric. Alternatively, if a feature is an attribute that consists of a set of categories, the feature is called categorical or nominal. A special case of categorical variables is called ordinal, which designates a nominal variable with categories falling in an ordered list. Some examples of ordinal variables include clothing sizes such as small, medium, and large; or a measurement of customer satisfaction on a scale from "not at all happy" to "very happy." It is important to consider what the features represent, as the type and number of features in your dataset will assist in determining an appropriate machine learning algorithm for your task. Types of machine learning algorithms Machine learning algorithms are divided into categories according to their purpose. Understanding the categories of learning algorithms is an essential first step towards using data to drive the desired action. A predictive model is used for tasks that involve, as the name implies, the prediction of one value using other values in the dataset. The learning algorithm attempts to discover and model the relationship between the target feature (the feature being predicted) and the other features. Despite the common use of the word "prediction" to imply forecasting, predictive models need not necessarily foresee events in the future. For instance, a predictive model could be used to predict past events, such as the date of a baby's conception using the mother's present-day hormone levels. Predictive models can also be used in real time to control traffic lights during rush hours. Because predictive models are given clear instruction on what they need to learn and how they are intended to learn it, the process of training a predictive model is known as supervised learning. The supervision does not refer to human involvement, but rather to the fact that the target values provide a way for the learner to know how well it has learned the desired task. Stated more formally, given a set of data, a supervised learning algorithm attempts to optimize a function (the model) to find the combination of feature values that result in the target output. The often used supervised machine learning task of predicting which category an example belongs to is known as classification. It is easy to think of potential uses for a classifier. For instance, you could predict whether: An e-mail message is spam A person has cancer A football team will win or lose An applicant will default on a loan In classification, the target feature to be predicted is a categorical feature known as the class, and is divided into categories called levels. A class can have two or more levels, and the levels may or may not be ordinal. Because classification is so widely used in machine learning, there are many types of classification algorithms, with strengths and weaknesses suited for different types of input data. Supervised learners can also be used to predict numeric data such as income, laboratory values, test scores, or counts of items. To predict such numeric values, a common form of numeric prediction fits linear regression models to the input data. Although regression models are not the only type of numeric models, they are, by far, the most widely used. Regression methods are widely used for forecasting, as they quantify in exact terms the association between inputs and the target, including both, the magnitude and uncertainty of the relationship. Since it is easy to convert numbers into categories (for example, ages 13 to 19 are teenagers) and categories into numbers (for example, assign 1 to all males, 0 to all females), the boundary between classification models and numeric prediction models is not necessarily firm. A descriptive model is used for tasks that would benefit from the insight gained from summarizing data in new and interesting ways. As opposed to predictive models that predict a target of interest, in a descriptive model, no single feature is more important than any other. In fact, because there is no target to learn, the process of training a descriptive model is called unsupervised learning. Although it can be more difficult to think of applications for descriptive models—after all, what good is a learner that isn't learning anything in particular—they are used quite regularly for data mining. For example, the descriptive modeling task called pattern discovery is used to identify useful associations within data. Pattern discovery is often used for market basket analysis on retailers' transactional purchase data. Here, the goal is to identify items that are frequently purchased together, such that the learned information can be used to refine marketing tactics. For instance, if a retailer learns that swimming trunks are commonly purchased at the same time as sunglasses, the retailer might reposition the items more closely in the store or run a promotion to "up-sell" customers on associated items. Originally used only in retail contexts, pattern discovery is now starting to be used in quite innovative ways. For instance, it can be used to detect patterns of fraudulent behavior, screen for genetic defects, or identify hot spots for criminal activity. The descriptive modeling task of dividing a dataset into homogeneous groups is called clustering. This is sometimes used for segmentation analysis that identifies groups of individuals with similar behavior or demographic information, so that advertising campaigns could be tailored for particular audiences. Although the machine is capable of identifying the clusters, human intervention is required to interpret them. For example, given five different clusters of shoppers at a grocery store, the marketing team will need to understand the differences among the groups in order to create a promotion that best suits each group. Lastly, a class of machine learning algorithms known as meta-learners is not tied to a specific learning task, but is rather focused on learning how to learn more effectively. A meta-learning algorithm uses the result of some learnings to inform additional learning. This can be beneficial for very challenging problems or when a predictive algorithm's performance needs to be as accurate as possible. Machine learning with R Many of the algorithms needed for machine learning with R are not included as part of the base installation. Instead, the algorithms needed for machine learning are available via a large community of experts who have shared their work freely. These must be installed on top of base R manually. Thanks to R's status as free open source software, there is no additional charge for this functionality. A collection of R functions that can be shared among users is called a package. Free packages exist for each of the machine learning algorithms covered in this book. In fact, this book only covers a small portion of all of R's machine learning packages. If you are interested in the breadth of R packages, you can view a list at Comprehensive R Archive Network (CRAN), a collection of web and FTP sites located around the world to provide the most up-to-date versions of R software and packages. If you obtained the R software via download, it was most likely from CRAN at http://cran.r-project.org/index.html. If you do not already have R, the CRAN website also provides installation instructions and information on where to find help if you have trouble. The Packages link on the left side of the page will take you to a page where you can browse packages in an alphabetical order or sorted by the publication date. At the time of writing this, a total 6,779 packages were available—a jump of over 60 percent in the time since the first edition was written, and this trend shows no sign of slowing! The Task Views link on the left side of the CRAN page provides a curated list of packages as per the subject area. The task view for machine learning, which lists the packages covered in this book (and many more), is available at http://cran.r-project.org/web/views/MachineLearning.html. Installing R packages Despite the vast set of available R add-ons, the package format makes installation and use a virtually effortless process. To demonstrate the use of packages, we will install and load the RWeka package, which was developed by Kurt Hornik, Christian Buchta, and Achim Zeileis (see Open-Source Machine Learning: R Meets Weka in Computational Statistics 24: 225-232 for more information). The RWeka package provides a collection of functions that give R access to the machine learning algorithms in the Java-based Weka software package by Ian H. Witten and Eibe Frank. More information on Weka is available at http://www.cs.waikato.ac.nz/~ml/weka/ To use the RWeka package, you will need to have Java installed (many computers come with Java preinstalled). Java is a set of programming tools available for free, which allow for the use of cross-platform applications such as Weka. For more information, and to download Java on your system, you can visit http://java.com. The most direct way to install a package is via the install.packages() function. To install the RWeka package, at the R command prompt, simply type: > install.packages("RWeka") R will then connect to CRAN and download the package in the correct format for your OS. Some packages such as RWeka require additional packages to be installed before they can be used (these are called dependencies). By default, the installer will automatically download and install any dependencies. The first time you install a package, R may ask you to choose a CRAN mirror. If this happens, choose the mirror residing at a location close to you. This will generally provide the fastest download speed. The default installation options are appropriate for most systems. However, in some cases, you may want to install a package to another location. For example, if you do not have root or administrator privileges on your system, you may need to specify an alternative installation path. This can be accomplished using the lib option, as follows: > install.packages("RWeka", lib="/path/to/library") The installation function also provides additional options for installation from a local file, installation from source, or using experimental versions. You can read about these options in the help file, by using the following command: > ?install.packages More generally, the question mark operator can be used to obtain help on any R function. Simply type ? before the name of the function. Loading and unloading R packages In order to conserve memory, R does not load every installed package by default. Instead, packages are loaded by users as they are needed, using the library() function. The name of this function leads some people to incorrectly use the terms library and package interchangeably. However, to be precise, a library refers to the location where packages are installed and never to a package itself. To load the RWeka package we installed previously, you can type the following: > library(RWeka) To unload an R package, use the detach() function. For example, to unload the RWeka package shown previously use the following command: > detach("package:RWeka", unload = TRUE) This will free up any resources used by the package. Summary To learn more about Machine Learning, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Machine Learning with R - Second Edition Machine Learning with R Cookbook  Mastering Machine Learning with R  Learning Data Mining with R Resources for Article: Further resources on this subject: Getting Started with RStudio [article] Machine learning and Python – the Dream Team [article] Deep learning in R [article]
Read more
  • 0
  • 0
  • 4225

article-image-oracle-apex-42-reporting
Packt
03 Sep 2013
20 min read
Save for later

Oracle APEX 4.2 reporting

Packt
03 Sep 2013
20 min read
(For more resources related to this topic, see here.) The objective of the first chapter is to quickly introduce you to the technology and then dive deep into the understanding the fabric of the tool. The chapter also helps you set the environment, which will be used throughout the book. Chapter 1, Know Your Horse Before You Ride It, starts with discussing the various features of APEX. This is to give heads up to the readers about the features offered by the tool and to inform them about some of the strengths of the tool. In order to understand the technology better, we discuss the various web server combinations possible with APEX, namely the Internal mod_plsql, External mod_plsql, and Listener configuration. While talking about the Internal mod_plsql configuration, we see the steps to enable the XMLDB HTTP server. In the Internal mod_plsql configuration, Oracle uses a DAD defined in EPG to talk to the database and the web server. So, we try to create a miniature APEX of our own, by creating our DAD using it to talk to the database and the web server. We then move on to learn about the External mod_plsql configuration. We discuss the architecture and the roles of the configuration files, such as dads.conf and httpd.conf. We also have a look at a typical dads.conf file and draw correlations between the configurations in the Internal and External mod_plsql configuration. We then move on to talk about the wwv_flow_epg_include_mod_local procedure that can help us use the DAD of APEX to call our own stored PL/SQL procedures. We then move on to talk about APEX Listener which is a JEE alternative to mod_plsql and is Oracle’s direction for the future. Once we are through with understanding the possible configurations, we see the steps to set up our environment. We use the APEX Listener configuration and see the steps to install the APEX engine, create a Weblogic domain, set the listener in the domain, and create an APEX workspace. With our environment in place, we straight away get into understanding the anatomy of APEX by analyzing various parts of its URL. This discussion includes a natter on the session management, request handling, debugging, error handling, use of TKPROF for tracing an APEX page execution, cache management and navigation, and value passing in APEX. We also try to understand the design behind the zero session ID in this section. Our discussions till now would have given you a brief idea about the technology, so we try to dig in a little deeper and understand the mechanism used by APEX to send web requests to its PL/SQL engine in the database by decoding the APEX page submission. We see the use of the wwv_flow.accept procedure and understand the role of page submission. We try to draw an analogy of an APEX form with a simple HTML to get a thorough understanding about the concept. The next logical thing after page submission is to see the SQL and PL/SQL queries and blocks reaching the database. We turn off the database auditing and see the OWA web toolkit requests flowing to the database as soon as we open an APEX page. We then broaden our vision by quickly knowing about some of the lesser known alternatives of mod_plsql. We end the chapter with a note of caution and try to understand the most valid criticisms of the technology, by understanding the SQL injection and Cross-site Scripting (XSS). After going through the architecture we straight away spring into action and begin the process of learning how to build the reports in APEX. The objective of this chapter is to help you understand and implement the most common reporting requirements along with introducing some interesting ways to frame the analytical queries in Oracle. The chapter also hugely focuses on the methods to implement different kinds of formatting in the APEX classic reports. We start Chapter 2, Reports, by creating the objects that will be used throughout the book and installing the reference application that contains the supporting code for the topics discussed in the second chapter. We start the chapter by setting up an authentication mechanism. We discuss external table authentication in this chapter. We then move on to see the mechanism of capturing the environment variables in APEX. These variables can help us set some logic related to a user’s session and environment. The variables also help us capture some properties of the underlying database session of an APEX session. We capture the variables using the USERENV namespace, DBMS_SESSION package, and owa_util package. After having a good idea of the ways and means to capture the environment variables, we build our understanding of developing the search functionality in a classic APEX report. This is mostly a talk about the APEX classic report features, we also use this opportunity to see the process to enable sorting in the report columns, and to create a link that helps us download the report in the CSV format. We then discuss various ways to implement the group reports in APEX. The discussion shows a way to implement this purely, by using the APEX’s feature, and then talks about getting the similar results by using the Oracle database feature. We talk about the APEX’s internal grouping feature and the Oracle grouping sets. The section also shows the first use of JavaScript to manipulate the report output. It also talks about a method to use the SQL query to create the necessary HTML, which can be used to display a data column in a classic report. We take the formatting discussion further, by talking about a few advanced ways of highlighting the report data using the APEX’s classic report features, and by editing the APEX templates. After trying our hand at formatting, we try to understand the mechanism to implement matrix reports in APEX. We use matrix reports to understand the use of the database features, such as the with clause, the pivot operator, and a number of string aggregation techniques in Oracle. The discussion on the string aggregation techniques includes the talk on the LISTAGG function, the wm_concat function, and the use of the hierarchical queries for this purpose. We also see the first use of the APEX items as the substitution variables in this book. We will see more use of the APEX items as the substitution variables to solve the vexed problems in other parts of the book as well. We do justice with the frontend as well, by talking about the use of jQuery, CSS, and APEX Dynamic Actions for making the important parts of data stand out. Then, we see the implementation of the handlers, such as this.affectedElements and the use of the jQuery functions, such as this.css() in this section. We also see the advanced formatting methods using the APEX templates. We then end this part of the discussion, by creating a matrix report using the APEX Dynamic Query report region. The hierarchical reports are always intriguing, because of the enormous ability to present the relations among different rows of data, and because of the use of the hierarchical queries in a number of unrelated places to answer the business queries. We reserve a discussion on the Oracle’s hierarchical queries for a later part of the section and start it, by understanding the implementation of the hierarchical reports by linking the values in APEX using drilldowns. We see the use of the APEX items as the substitution variable to create the dynamic messages. Since we have devised a mechanism to drill down, we should also build a ladder for the user to climb up the hierarchical chain. Now, the hierarchical chain for every person will be different depending on his position in the organization, so we build a mechanism to build the dynamic bread crumbs using the PL/SQL region in APEX. We then talk about two different methods of implementing the hierarchical queries in APEX. We talk about the connect, by clause first, and then continue our discussion by learning the use of the recursive with clause for the hierarchical reporting. We end our discussion on the hierarchical reporting, by talking about creating the APEX’s Tree region that displays the hierarchical data in the form of a tree. The reports are often associated with the supporting files that give more information about the business query. A typical user might want to upload a bunch of files while executing his piece of the task and the end user of his action might want to check out the files uploaded by him/her. To understand the implementation of this requirement, we check out the various ways to implement uploading and downloading the files in APEX. We start our discussion, by devising the process of uploading the files for the employees listed in the oehr_employees table. The solution of uploading the files involves the implementation of the dynamic action to capture the ID of the employee on whose row the user has clicked. It shows the use of the APEX items as the substitution variables for creating the dynamic labels and the use of JavaScript to feed one APEX item based on the another. We extensively talk about the implementation of jQuery in Dynamic Actions in this section as well. Finally, we check out the use of the APEX’s file browse item along with the WWV_FLOW_FILES table to capture the file uploaded by the user. Discussion on the methods to upload the files is immediately followed by talking about the ways to download these files in APEX. We nest the use of the functions, such as HTF.ANCHOR and APEX_UTIL.GET_BLOB_FILE_SRC for one of the ways to download a file, and also talk about the use of dbms_lob.getlength along with the APEX format mask for downloading the files. We then engineer our own stored procedure that can download a blob stored in the database as a file. We end this discussion, by having a look at the APEX’s p process, which can also be used for downloading. AJAX is the mantra of the new age and we try our hand at it, by implementing the soft deletion in APEX. We see the mechanism of refreshing just the report and not reloading the page as soon as the user clicks to delete a row. We use JavaScript along with the APEX page process and the APEX templates to achieve this objective. Slicing and dicing along with auditing are two of the most common requirements in the reporting world. We see the implementation of both of these using both the traditional method JavaScript with the page processes and the new method of using Dynamic Actions. We extend our use case a little further and learn about a two way interaction between the JavaScript function and the page process. We learn to pass the values back and forth between the two. While most business intelligence and reporting solutions are a one way road and are focused on the data presentation, Oracle APEX can go a step further, and can give an interface to the user for the data manipulations as well. To understand this strength of APEX, we have a look at the process of creating the tabular forms in APEX. We extend our understanding of the tabular forms to see a magical use of jQuery to convert a certain sections of a report from display-only to editable textboxes. We move our focus from implementing the interesting and tricky frontend requirements to framing the queries to display the complex data types. We use the string aggregation methods to display data in a column containing a varray. Time dimension is one of the most widely used dimension in reporting circles and comparing current performance with the past records is a favorite requirement of most businesses. With this in mind, we shift our focus to understand and implement the time series reports in APEX. We start our discussion by understanding the method to implement a report that shows the contribution of each business line in every quarter. We implement this by using partitioning dimensions on the fly. We also use the analytical functions, such as ratio_to_report, lead, and lag in the process of creating the time series reports. We use our understanding of time dimension to build a report that helps a user compare one time period to the other. The report gives the user the freedom to select the time segments, which he wishes to compare. We then outwit a limitation of this report, by using the query partition clause for the data densification. We bring our discussion on reports based on time dimension, by presenting a report based on the modal clause to you. The report serves as an example to show the enormous possibilities to code, by using the modal clause in Oracle. We bring our discussion on reports based on time dimension, by presenting a report based on the modal clause to you. The report serves as an example to show the enormous possibilities to code, by using the modal clause in Oracle. Chapter 3, In the APEX Mansion – Interactive Reports is all about Interactive Reports and the dynamic reporting. While the second chapter was more about the data presentation using the complex queries and the presentation methods, this chapter is about taking the presentation part a step ahead, by creating more visually appealing Interactive Reports. We start the discussion of this chapter, by talking about the ins and outs of Interactive Reports. We postmortem this feature of APEX to learn about every possible way of using Interactive Reports for making more sense of data. The chapter has a reference application of its own, which shows the code in action. We start our discussion, by exploring at various features of the Actions menu in Interactive Reports. The discussion is on search functionality, Select Columns feature, filtering, linking and filtering Interactive Reports using URLs, customizing Rows per page feature of an IR, using Control Break, creating computations in IR, creating charts in IR, using the Flashback feature and a method to see the back end flashback query, configuring the e-mail functionality for downloading a report, and for subscription of reports and methods to understand the download of reports in HTML, CSV, and PDF formats. Once we are through with understanding the Actions menu, we move on to understand the various configuration options in an IR. We talk about the Link section, the Icon View section, the Detail section, the Column Group section, and the Advanced section of the Report Attributes page of an IR. While discussing these sections, we understand the process of setting different default views of an IR for different users. Once we are through with our dissection of an IR, we put our knowledge to action, by inserting our own item in the Actions menu of an IR using Dynamic Actions and jQuery. We continue our quest for finding the newer formatting methods, by using different combinations of SQL, CSS, APEX templates, and jQuery to achieve unfathomable results. The objectives attained in this section include formatting a column of an IR based on another column, using CSS in the page header to format the APEX data, changing the font color of the alternate rows in APEX, using a user-defined CSS class in APEX, conditionally highlighting a column in IR using CSS and jQuery, and formatting an IR using the region query. After going through a number of examples on the use of CSS and jQuery in APEX, we lay down a process to use any kind of changes in an IR. We present this process by an example that changes one of the icons used in an IR to a different icon. APEX also has a number of views for IR, which can be used for intelligent programming. We talk about an example that uses the apex_application_page_ir_rpt view that show different IR reports on the user request. After a series of discussion on Interactive Reports, we move on to find a solution to an incurable problem of IR. We see a method to put multiple IR on the same page and link them as the master-child reports. We had seen an authentication mechanism (external table authentication) and data level authorization in Chapter 2, Reports. We use this chapter to see the object level authorization in APEX. We see the method to give different kinds of rights on the data of a column in a report to different user groups. After solving a number of problems and learning a number of things, we create a visual treat for ourselves. We create an Interactive report dashboard using Dynamic Actions. This dashboard presents different views of an IR as gadgets on a separate APEX page. We conclude this chapter, by looking at advanced ways of creating the dynamic reports in APEX. We look at the use of the table function in both native and interface approach, and we also look at the method to use the APEX collections for creating the dynamic IRs in APEX. Chapter 4, The Fairy Tale Begins – Advanced Reporting, is all about advanced reporting and pretty graphs. Since we are talking about advanced reporting, we see the process of setting the LDAP authentication in APEX. We also see the use of JXplorer to help us get the necessary DN for setting up the LDAP authentication. We also see the means to authenticate an LDAP user using PL/SQL in this section. This chapter also has a reference application of its own that shows the code in action. We start the reporting building in this chapter, by creating the Sparkline reports. This report uses jQuery for producing the charts. We then move on to use another jQuery library to create a report with slider. This report lets the user set the value of the salary of any employee using a slider. We then get into the world of the HTML charts. We start our talk, by looking at the various features of creating the HTML chart regions in APEX. We understand a method to implement the Top N and Bottom N chart reports in the HTML charts. We understand the APEX’s method to implement the HTML charts and use it to create an HTML chart on our own. We extend this technique of generating HTML from region source a little further, by using XMLTYPE to create the necessary HTML for displaying a report. We take our spaceship into a different galaxy of the charts world and see the use of Google Visualizations for creating the charts in APEX. We then switch back to AnyChart, which is a flash charting solution, and has been tightly integrated with APEX. It works on an XML and we talk about customizing this XML to produce different results. We put our knowledge to action, by creating a logarithmic chart and changing the style of a few display labels in APEX. We continue our discussion on AnyChart, and use the example of Doughnut Chart to understand advanced ways of using AnyChart in APEX. We use our knowledge to create Scatter chart, 3D stacked chart, Gauge chart, Gantt chart, Candlestick chart, Flash image maps, and SQL calendars. We move out of the pretty heaven of AnyChart only to get into another beautiful space of understanding the methods of displaying the reports with the images in APEX. We implement the reports with the images using the APEX’s format masks and also using HTF.IMG with the APEX_UTIL.GET_BLOB_FILE_SRC function. We then divert our attention to advanced jQuery uses such as, the creation of the Dialog box and the Context menu in APEX. We create a master-detail report using dialog boxes, where the child report is shown in the Dialog box. We close our discussion in this chapter, by talking about creating wizards in APEX and a method to show different kinds of customized error messages for problems appearing at different points of a page process. Chapter 5, Flight to Space Station: Advanced APEX, is all about advanced APEX. The topics discussed in this chapter fall into a niche category of reporting the implementation. This chapter has two reference applications of its own that shows the code in action. We start this chapter, by creating both the client-side and server-side image maps APEX. These maps are often used where regular shapes are involved. We then see the process of creating the PL/SQL Server Pages (PSPs). PSPs are similar to JSP and are used for putting the PL/SQL and HTML code in a single file. Loadpsp then converts this file into a stored procedure with the necessary calls to owa web toolkit functions. The next utility we check out is loadjava. This utility helps us load a Java class as a database object. This utility will be helpful, if some of the Java classes are required for code processing. We had seen the use of AnyChart in the previous chapter, and this chapter introduces you to FusionChart. We create a Funnel Chart using FusionChart in this chapter. We also use our knowledge of the PL/SQL region in APEX to create a Tag Cloud. We then stroll into the world of the APEX plugins. We understand the interface functions, which have to be created for plugins. We discuss the concepts and then use them to develop an Item type plugin and a Dynamic Action plugin. We understand the process of defining a Custom Attribute in a plugin, and then see the process of using it in APEX. We move on to learn about Websheets and their various features. We acquaint ourselves with the interface of the Websheet applications and understand the concept of sharing Websheets. We also create a few reports in a Websheet application and see the process of importing and using an image in Websheets. We then spend some time to learn about data grids in Websheets and the method to create them. We have a look at the Administration and View dropdowns in a Websheet application. We get into the administration mode and understand the process of configuring the sending of mails to Gmail server from APEX. We extend our administration skills further, by understanding the ways to download an APEX application using utilities, such as oracle.apex.APEXExport. Reporting and OLAP go hand in hand, so we see the method of using the OLAP cubes in APEX. We see the process of modeling a cube and understand the mechanism to use its powerful features. We then have a talk about Oracle Advanced Queues that can enable us to do reliable communication among different systems with different workloads in the enterprise, and gives improved performance. We spend a brief time to understand some of the other features of APEX, which might not be directly related to reporting, but are good to know. Some of these features include locking and unlocking of pages in APEX, the Database Object Dependencies report, Shortcuts, the Dataloading wizard, and the APEX views. We bring this exclusive chapter to an end, by discussing about the various packages that enable us to schedule the background jobs in APEX, and by discussing various other APEX and Database API, which can help us in the development process.
Read more
  • 0
  • 0
  • 4216

article-image-stata-data-analytics-software
Packt
22 Sep 2015
11 min read
Save for later

Stata as Data Analytics Software

Packt
22 Sep 2015
11 min read
In this article by Prasad Kothari, the author of the book Data Analysis with STATA, the overall goal is to cover the STATA related topics such as data management, graphs and visualization and programming in STATA. The article will give a detailed description of STATA starting with an introduction to STATA and Data analytics and then talks about STATA programming and data management. After which it takes you through Data visualization and all the important statistical tests in STATA. Then the article will cover the Linear and the Logistics regression in STATA and in the end it will take you through few analyses like Survey analysis, Time Series analysis and Survival analysis in STATA. It also teaches different types of statistical modelling techniques and how to implement these techniques in STATA. (For more resources related to this topic, see here.) These days, many people use Stata for econometric and medical research purposes, among other things. There are many people who use different packages, such as Statistical Package for the Social Sciences (SPSS) and EViews, Micro, RATS/CATS (used by time series experts), and R for Matlab/Guass/Fortan (used for hardcore analysis). One should know the usage of Stata and then apply it in their relative fields. Stata is a command-driven language; there are over 500 different commands and menu options, and each has a particular syntax required to invoke any of the various options. Learning these commands is a time-consuming process, but it is not hard. At the end of each class, your do-file will contain all the commands that we have covered, but there is no way we will cover all of these commands in this short introductory course. Stata is a combined statistical analytical tool that is intended for use by research scholars and analytics practitioners. Stata has many strengths, but we are going to talk about the most important one: managing, adjusting, and arranging large sets of data. Stata has many versions, and with every version, it keeps on improving; for example, in Stata versions 11 to 14, there are changes and progress in the computing speed, capabilities and functionalities, as well as flexible graphic capabilities. Over a period of time, Stata keeps on changing and updating the model as per users' suggestions. In short, the regression method is based on a nonstandard feature, which means that you can easily get help from the Web if another person has written a program that can be integrated with their software for the purpose of analysis. The following topics will be covered in this articler: Introducing Data analytics Introducing the Stata interface and basic techniques Introducing data analytics We analyze data everyday for various reasons. To predict an event or forecast the key indicators, such as the revenue for given organization, is fast becoming a major requirement in the industry. There are various types of techniques and tools that can be leveraged to analyze the data. Here are the techniques that will be covered in this article using Stata as a tool: Stata Programming and Data management: Before predicting anything, we need to manage and massage the data in order to make it good enough to be something through which insights can be derived. The programming aspect helps in creating new variables to treat data in such a way that finding patterns in historical data or predicting the outcome of given event becomes much easier. Data visualization: After the data preparation, we need to visualize the data for the the following: To view what patterns in the data look like To check whether there are any outliers in the data To understand the data better To draw preliminary insights from the data Important statistical tests in Stata: After data visualization, based on observations, you can try to come up with various hypotheses about the data. We need to test these hypotheses on the datasets to check whether they are statistically significant and whether we can depend on and apply these hypotheses in future situations as well. Linear regression in Stata: Once done with the hypothesis testing, there is always a business need to predict one of the variables, such as what the revenue of the financial organization will be given the specific conditions, and so on. These predictions about continuous variables, such as the revenue, the default amount on the credit card, and the number of items sold in a given store, come through linear regression. Linear regression is the most basic and widely used prediction methodology. We will go into details of linear regression in a later chapter. Logistic regression in Stata: When you need to predict the outcome of a particular event along with the probability, logistic regression is the best and most acknowledged method by far. Predicting which team will win the match in football or cricket or predicting whether a customer will default on a loan payment can be decided through the probabilities given by logistic regression. Survey analysis in Stata: Understanding the customer sentiment and consumer experience is one of the biggest requirements of the retail industry. The research industry also needs data about people's opinion in order to derive the effect of a certain event or the sentiments of the affected people. All of these can be achieved by conducting and analyzing survey datasets. Survey analysis can have various subtechniques, such as factor analysis, principle component analysis, panel data analysis, and so on. Time series analysis in Stata: When you try to forecast a time-dependent variable with reasonable cyclic behavior of seasonality, time series analysis comes handy. There are many techniques of time series analysis, but we will talk about a couple of them: Autoregressive Integrated Moving Average (ARIMA) and Box Jenkins. Forecasting the amount of rainfall depending on the amount of rainfall in the past 5 years is a classic time series analysis problem. Survival analysis in Stata: These days, lots of customers attrite from telecom plans, healthcare plans, and so on and join the competitors. When you need to develop a churn model or attrition model to check who will attrite, survival analysis is the best model. The Stata interface Let's discuss the location and layout of Stata. It is very easy to locate Stata on a computer or laptop; after installing the software, go to the start menu, go to the search menu, and type Stata. You can find out the path where the file is saved. This depends on which version has been installed. Another way to find Stata on computer is through the quick launch button as well as through start programs. The preceding diagram represents the Stata layout. The four types of processors in Stata are multiprocessor (two or four), special edition processor (flavors), intercooled, and small processor. The multiprocessor is one of the most efficient processors. Though all processor versions function in a similar fashion, only variables' repressors frequency increases with each new version. At present, Stata version 11 is in demand and is being used on various computers. It is a type of software that runs on commands. In the new versions of Stata, new ways, such as menus that can search Stata, have come in the market; however, typing a command is the most simple and quick way to learn Stata. The more you leverage the functionality of typing the command, the better your learning is. Through the typing technique method, programming becomes easy and simple for analytics. Sometimes, it is difficult to find the exact syntax in commands; therefore, it is advisable that the menu command be used. Later on, you just copy the same command for further use. There are three ways to enter the commands, as follows: Use the do-file program. This is a type of program in which one has to inform the computer (through a command) that it needs to use the do-file type. Type the command manually through typing. Enter the command interactively; just click on the menu screen. Though all the three types discussed in the preceding bullets are used, the do-file type is the most frequently used one. The reason is that for a bigger file, it is faster as compared to manual typing. Secondly, it can store the data and keep it in the same format in which it was stored. Suppose you make a mistake and want to rectify it; what would you do? In this case, do-file is useful; one can correct it and run the program once again. Generally, an interactive command is used to find out the problem and later on, do-file is used to solve it. The following is an example of an interactive command: Data-storing techniques in Stata Stata is a multipurpose program, which can serve not only its own data, but also other data in a simple format, for example, ASCII. Regardless of the data type format (Excel/statistical package), it gets automatically exported to the ASCII file. This means that all the data can now easily be imported to Stata. The data entered in Stata is in different types of variables, such as vectors with individual observations in every row; it also holds strings and numeric strings. Every row has a detailed observation of the individual, country, firm, or whatever information is entered in Stata. As the data is stored in variables, it makes Stata the most efficient way to store information. Sometimes, it is better to save the data in a different storage form, such as the following: Matrices Macros Matrices should be used carefully as they consume more memory as compared to variables, so there might be a possibility of low space memory before work is started. Another form is macros; these are similar to variables in other programming languages and are named containers, which means they contain information of any type. There are two flavors of macros: local/temporary and global. Global macros are flexible and easy to manage; once they are defined in a computer or laptop, they can be easily opened through all commands. On the other hand, local macros are temporary objects that are formed for a particular environment and cannot be use in another area. For example, if you use a local macro for do-file, that code will only exist in that particular environment. Directories and folders in Stata Stata has a tree-style structure to organize directories as well as folders similar to other operating systems, such as Windows, Linux, Unix, and Mac OS. This makes things easy and can be retrieved later on dates that are convenient. For example, the data folder is used to save entire datasets, subfolders for every single dataset, and so on. In Stata, the following commands can be leveraged: Dos Linux Unix For example, if you need to change the directory, you can use the CD command for example: CD C:Stataforlder You can also generate a new directory along with the current directory you have been using. For example: mkdir "newstata". You can leverage the dir command to get the details of the directory. If you need the current directory name along with the directory, you can utilize the pwd or cd command. The use of paths in Stata depends on the type of data; usually, there are two paths: absolute and relative. The absolute path contains the full address, denoting the folder. In the command you have seen in the earlier example, we leveraged the CD command using the path that is absolute. On the contrary, the relative path provides us with the location of the file. The following example of mkdir has used the relative path: mkdir "EStata|Stata1" The use of the relative path will be beneficial, especially when working on different devices, such as a PC at home or a library or server. To separate folders, Windows and Dos use a backslash (), whereas Linux and Unix use a slash (/). Sometimes, these connotations might be troublesome when working on the server where Stata is installed. As a general rule, it is advisable that you use slashes in the relative path as Stata can easily understand slash as a separator. The following is an example of this: mkdir "/Stata1/Data" – this is how you create the new folder for your STATA work. Summary In this Article we discussed lots of basic commands, which can be leveraged while performing Stata programming. Read Data Analysis with Stata to gain detailed knowledge of the different data management techniques and programming in detail. As you learn more about Stata, you will understand the various commands and functions and their business applications. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [article] Financial Management with Microsoft Dynamics AX 2012 R3 [article] Taming Big Data using HDInsight [article]
Read more
  • 0
  • 0
  • 4208

article-image-transactions-redis
Packt
07 Jul 2015
9 min read
Save for later

Transactions in Redis

Packt
07 Jul 2015
9 min read
In this article by Vinoo Das author of the book Learning Redis, we will see how Redis as a NOSQL data store, provides a loose sense of transaction. As in a traditional RDBMS, the transaction starts with a BEGIN and ends with either COMMIT or ROLLBACK. All these RDBMS servers are multithreaded, so when a thread locks a resource, it cannot be manipulated by another thread unless and until the lock is released. Redis by default has MULTI to start and EXEC to execute the commands. In case of a transaction, the first command is always MULTI, and after that all the commands are stored, and when EXEC command is received, all the stored commands are executed in sequence. So inside the hood, once Redis receives the EXEC command, all the commands are executed as a single isolated operation. Following are the commands that can be used in Redis for transaction: MULTI: This marks the start of a transaction block EXEC: This executes all the commands in the pipeline after MULTI WATCH: This watches the keys for conditional execution of a transaction UNWATCH: This removes the WATCH keys of a transaction DISCARD: This flushes all the previously queued commands in the pipeline (For more resources related to this topic, see here.) The following figure represents how transaction in Redis works: Transaction in Redis Pipeline versus transaction As we have seen for many generic terms in pipeline the commands are grouped and executed, and the responses are queued in a block and sent. But in transaction, until the EXEC command is received, all the commands received after MULTI are queued and then executed. To understand that, it is important to take a case where we have a multithreaded environment and see the outcome. In the first case, we take two threads firing pipelined commands at Redis. In this sample, the first thread fires a pipelined command, which is going to change the value of a key multiple number of times, and the second thread will try to read the value of that key. Following is the class which is going to fire the two threads at Redis: MultiThreadedPipelineCommandTest.java: package org.learningRedis.chapter.four.pipelineandtx; public class MultiThreadedPipelineCommandTest { public static void main(String[] args) throws InterruptedException {    Thread pipelineClient = new Thread(new PipelineCommand());    Thread singleCommandClient = new Thread(new SingleCommand());    pipelineClient.start();    Thread.currentThread().sleep(50);    singleCommandClient.start(); } } The code for the client which is going to fire the pipeline commands is as follows: package org.learningRedis.chapter.four.pipelineandtx; import java.util.Set; import Redis.clients.jedis.Jedis; import Redis.clients.jedis.Pipeline; public class PipelineCommand implements Runnable{ Jedis jedis = ConnectionManager.get(); @Override public void run() {      long start = System.currentTimeMillis();      Pipeline commandpipe = jedis.pipelined();      for(int nv=0;nv<300000;nv++){        commandpipe.sadd("keys-1", "name"+nv);      }      commandpipe.sync();      Set<String> data= jedis.smembers("keys-1");      System.out.println("The return value of nv1 after pipeline [ " + data.size() + " ]");    System.out.println("The time taken for executing client(Thread-1) "+ (System.currentTimeMillis()-start));    ConnectionManager.set(jedis); } } The code for the client which is going to read the value of the key when pipeline is executed is as follows: package org.learningRedis.chapter.four.pipelineandtx; import java.util.Set; import Redis.clients.jedis.Jedis; public class SingleCommand implements Runnable { Jedis jedis = ConnectionManager.get(); @Override public void run() {    Set<String> data= jedis.smembers("keys-1");    System.out.println("The return value of nv1 is [ " + data.size() + " ]");    ConnectionManager.set(jedis); } } The result will vary as per machine configuration but by changing the thread sleep time and running the program couple of times, the result will be similar to the one shown as follows: The return value of nv1 is [ 3508 ] The return value of nv1 after pipeline [ 300000 ] The time taken for executing client(Thread-1) 3718 Please fire FLUSHDB command every time you run the test, otherwise you end up seeing the value of the previous test run, that is 300,000 Now we will run the sample in a transaction mode, where the command pipeline will be preceded by MULTI keyword and succeeded by EXEC command. This client is similar to the previous sample where two clients in separate threads will fire commands to a single key on Redis. The following program is a test client that gives two threads one with commands in transaction mode and the second thread will try to read and modify the same resource: package org.learningRedis.chapter.four.pipelineandtx; public class MultiThreadedTransactionCommandTest { public static void main(String[] args) throws InterruptedException {    Thread transactionClient = new Thread(new TransactionCommand());    Thread singleCommandClient = new Thread(new SingleCommand());    transactionClient.start();    Thread.currentThread().sleep(30);    singleCommandClient.start(); } } This program will try to modify the resource and read the resource while the transaction is going on: package org.learningRedis.chapter.four.pipelineandtx; import java.util.Set; import Redis.clients.jedis.Jedis; public class SingleCommand implements Runnable { Jedis jedis = ConnectionManager.get(); @Override public void run() {    Set<String> data= jedis.smembers("keys-1");    System.out.println("The return value of nv1 is [ " + data.size() + " ]");    ConnectionManager.set(jedis); } } This program will start with MULTI command, try to modify the resource, end it with EXEC command, and later read the value of the resource: package org.learningRedis.chapter.four.pipelineandtx; import java.util.Set; import Redis.clients.jedis.Jedis; import Redis.clients.jedis.Transaction; import chapter.four.pubsub.ConnectionManager; public class TransactionCommand implements Runnable { Jedis jedis = ConnectionManager.get(); @Override public void run() {      long start = System.currentTimeMillis();      Transaction transactionableCommands = jedis.multi();      for(int nv=0;nv<300000;nv++){        transactionableCommands.sadd("keys-1", "name"+nv);      }      transactionableCommands.exec();      Set<String> data= jedis.smembers("keys-1");      System.out.println("The return value nv1 after tx [ " + data.size() + " ]");    System.out.println("The time taken for executing client(Thread-1) "+ (System.currentTimeMillis()-start));    ConnectionManager.set(jedis); } } The result of the preceding program will vary as per machine configuration but by changing the thread sleep time and running the program couple of times, the result will be similar to the one shown as follows: The return code is [ 1 ] The return value of nv1 is [ null ] The return value nv1 after tx [ 300000 ] The time taken for executing client(Thread-1) 7078 Fire the FLUSHDB command every time you run the test. The idea is that the program should not pick up a value obtained because of a previous run of the program. The proof that the single command program is able to write to the key is if we see the following line: The return code is [1]. Let's analyze the result. In case of pipeline, a single command reads the value and the pipeline command sets a new value to that key as evident in the following result: The return value of nv1 is [ 3508 ] Now compare this with what happened in case of transaction when a single command tried to read the value but it was blocked because of the transaction. Hence the value will be NULL or 300,000. The return value of nv1 after tx [0] or The return value of nv1 after tx [300000] So the difference in output can be attributed to the fact that in a transaction, if we have started a MULTI command, and are still in the process of queueing commands (that is, we haven't given the server the EXEC request yet), then any other client can still come in and make a request, and the response would be sent to the other client. Once the client gives the EXEC command, then all other clients are blocked while all of the queued transaction commands are executed. Pipeline and transaction To have a better understanding, let's analyze what happened in case of pipeline. When two different connections made requests to the Redis for the same resource, we saw a result where client-2 picked up the value while client-1 was still executing: Pipeline in Redis in a multi connection environment What it tells us is that requests from the first connection which is pipeline command is stacked as one command in its execution stack, and the command from the other connection is kept in its own stack specific to that connection. The Redis execution thread time slices between these two executions stacks, and that is why client-2 was able to print a value when the client-1 was still executing. Let's analyze what happened in case of transaction here. Again the two commands (transaction commands and GET commands) were kept in their own execution stacks, but when the Redis execution thread gave time to the GET command, and it went to read the value, seeing the lock it was not allowed to read the value and was blocked. The Redis execution thread again went back to executing the transaction commands, and again it came back to GET command where it was again blocked. This process kept happening until the transaction command released the lock on the resource and then the GET command was able to get the value. If by any chance, the GET command was able to reach the resource before the transaction lock, it got a null value. Please bear in mind that Redis does not block execution to other clients while queuing transaction commands but blocks only during executing them. Transaction in Redis multi connection environment This exercise gave us an insight into what happens in the case of pipeline and transaction. Summary In this article, we saw in brief how to use Redis, not simply as a datastore, but also as pipeline the commands which is so much more like bulk processing. Apart from that, we covered areas such as transaction, messaging, and scripting. We also saw how to combine messaging and scripting, and create reliable messaging in Redis. This capability of Redis makes it different from some of the other datastore solutions. Resources for Article: Further resources on this subject: Implementing persistence in Redis (Intermediate) [article] Using Socket.IO and Express together [article] Exploring streams [article]
Read more
  • 0
  • 1
  • 4199
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-advanced-hadoop-mapreduce-administration
Packt
08 Apr 2013
6 min read
Save for later

Advanced Hadoop MapReduce Administration

Packt
08 Apr 2013
6 min read
(For more resources related to this topic, see here.) Tuning Hadoop configurations for cluster deployments Getting ready Shut down the Hadoop cluster if it is already running, by executing the bin/stop-dfs.sh and bin/stop-mapred.sh commands from HADOOP_HOME. How to do it... We can control Hadoop configurations through the following three configuration files: conf/core-site.xml: This contains the configurations common to whole Hadoop distribution conf/hdfs-site.xml: This contains configurations for HDFS conf/mapred-site.xml: This contains configurations for MapReduce Each configuration file has name-value pairs expressed in an XML format, and they define the workings of different aspects of Hadoop. The following code snippet shows an example of a property in the configuration file. Here, the <configuration> tag is the top-level XML container, and the <property> tags that define individual properties go as child elements of the <configuration> tag. <configuration><property><name>mapred.reduce.parallel.copies</name><value>20</value></property>...</configuration> The following instructions show how to change the directory to which we write Hadoop logs and configure the maximum number of map and reduce tasks: Create a directory to store the logfiles. For example, /root/hadoop_logs. Uncomment the line that includes HADOOP_LOG_DIR in HADOOP_HOME/conf/ hadoop-env.sh and point it to the new directory. Add the following lines to the HADOOP_HOME/conf/mapred-site.xml file: <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2 </value></property><property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>2 </value></property> Restart the Hadoop cluster by running the bin/stop-mapred.sh and bin/start-mapred.sh commands from the HADOOP_HOME directory. You can verify the number of processes created using OS process monitoring tools. If you are in Linux, run the watch ps –ef|grep hadoop command. If you are in Windows or MacOS use the Task Manager. How it works... HADOOP_LOG_DIR redefines the location to which Hadoop writes its logs. The mapred. tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks. maximum properties define the maximum number of map and reduce tasks that can run within a single TaskTracker at a given moment. These and other server-side parameters are defined in the HADOOP_HOME/conf/*-site. xml files. Hadoop reloads these configurations after a restart. There's more... There are many similar configuration properties defined in Hadoop. You can see some of them in the following tables. The configuration properties for conf/core-site.xml are listed in the following table: Name Default value Description fs.inmemory.size.mb 100 This is the amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs. io.sort.factor 100 This is the maximum number of streams merged while sorting files. io.file.buffer.size 131072 This is the size of the read/write buffer used by sequence files. The configuration properties for conf/mapred-site.xml are listed in the following table: Name Default value Description mapred.reduce. parallel.copies 5 This is the maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs. mapred.map.child.java. opts -Xmx200M This is for passing Java options into the map JVM. mapred.reduce.child. java.opts -Xmx200M This is for passing Java options into the reduce JVM. io.sort.mb 200 The memory limit while sorting data in MBs. The configuration properties for conf/hdfs-site.xml are listed in the following table: Name Default value Description dfs.block.size 67108864 This is the HDFS block size. dfs.namenode.handler. count 40 This is the number of server threads to handle RPC calls in the NameNode. Running benchmarks to verify the Hadoop installation The Hadoop distribution comes with several benchmarks. We can use them to verify our Hadoop installation and measure Hadoop's performance. This recipe introduces these benchmarks and explains how to run them. Getting ready Start the Hadoop cluster. You can run these benchmarks either on a cluster setup or on a pseudo-distributed setup. How to do it... Let us run the sort benchmark. The sort benchmark consists of two jobs. First, we generate some random data using the randomwriter Hadoop job and then sort them using the sort sample. Change the directory to HADOOP_HOME. Run the randomwriter Hadoop job using the following command: >bin/hadoop jar hadoop-examples-1.0.0.jarrandomwriter-Dtest.randomwrite.bytes_per_map=100-Dtest.randomwriter.maps_per_host=10 /data/unsorted-data Here the two parameters, test.randomwrite.bytes_per_map and test. randomwriter.maps_per_host specify the size of data generated by a map and the number of maps respectively. Run the sort program: >bin/hadoop jar hadoop-examples-1.0.0.jar sort /data/unsorted-data/data/sorted-data Verify the final results by running the following command: >bin/hadoop jar hadoop-test-1.0.0.jar testmapredsort -sortInput /data/unsorted-data -sortOutput /data/sorted-data Finally, when everything is successful, the following message will be displayed: The job took 66 seconds.SUCCESS! Validated the MapReduce framework's 'sort' successfully. How it works... First, the randomwriter application runs a Hadoop job to generate random data that can be used by the second sort program. Then, we verify the results through testmapredsort job. If your computer has more capacity, you may run the initial randomwriter step with increased output sizes. There's more... Hadoop includes several other benchmarks. TestDFSIO: This tests the input output (I/O) performance of HDFS nnbench: This checks the NameNode hardware mrbench: This runs many small jobs TeraSort: This sorts a one terabyte of data More information about these benchmarks can be found at http://www.michaelnoll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoopcluster- with-terasort-testdfsio-nnbench-mrbench/. Reusing Java VMs to improve the performance In its default configuration, Hadoop starts a new JVM for each map or reduce task. However, running multiple tasks from the same JVM can sometimes significantly speed up the execution. This recipe explains how to control this behavior. How to do it... Run the WordCount sample by passing the following option as an argument: >bin/hadoop jar hadoop-examples-1.0.0.jar wordcount –Dmapred.job.reuse.jvm.num.tasks=-1 /data/input1 /data/output1 Monitor the number of processes created by Hadoop (through ps –ef|grephadoop command in Unix or task manager in Windows). Hadoop starts only a single JVM per task slot and then reuses it for an unlimited number of tasks in the job. However, passing arguments through the –D option only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set the option through the JobConf.setNumTasksToExecutePerJvm(-1) method. How it works... By setting the job configuration property through mapred.job.reuse.jvm.num.tasks, we can control the number of tasks for the JVM run by Hadoop. When the value is set to -1, Hadoop runs the tasks in the same JVM.
Read more
  • 0
  • 0
  • 4186

article-image-indexing-data
Packt
18 Apr 2014
10 min read
Save for later

Indexing the Data

Packt
18 Apr 2014
10 min read
(For more resources related to this topic, see here.) Elasticsearch indexing We have our Elasticsearch cluster up and running, and we also know how to use the Elasticsearch REST API to index our data, delete it, and retrieve it. We also know how to use search to get our documents. If you are used to SQL databases, you might know that before you can start putting the data there, you need to create a structure, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them). Before we look closer at the available API methods, let's see what the indexing process looks like. Shards and replicas The Elasticsearch index is built of one or more shards and each of them contains part of your document set. Each of these shards can also have replicas, which are exact copies of the shard. During index creation, we can specify how many shards and replicas should be created. We can also omit this information and use the default values either defined in the global configuration file (elasticsearch.yml) or implemented in Elasticsearch internals. If we rely on Elasticsearch defaults, our index will end up with five shards and one replica. What does that mean? To put it simply, we will end up with having 10 Lucene indices distributed among the cluster. Are you wondering how we did the calculation and got 10 Lucene indices from five shards and one replica? The term "replica" is somewhat misleading. It means that every shard has its copy, so it means there are five shards and five copies. Having a shard and its replica, in general, means that when we index a document, we will modify them both. That's because to have an exact copy of a shard, Elasticsearch needs to inform all the replicas about the change in shard contents. In the case of fetching a document, we can use either the shard or its copy. In a system with many physical nodes, we will be able to place the shards and their copies on different nodes and thus use more processing power (such as disk I/O or CPU). To sum up, the conclusions are as follows: More shards allow us to spread indices to more servers, which means we can handle more documents without losing performance. More shards means that fewer resources are required to fetch a particular document because fewer documents are stored in a single shard compared to the documents stored in a deployment with fewer shards. More shards means more problems when searching across the index because we have to merge results from more shards and thus the aggregation phase of the query can be more resource intensive. Having more replicas results in a fault tolerance cluster, because when the original shard is not available, its copy will take the role of the original shard. Having a single replica, the cluster may lose the shard without data loss. When we have two replicas, we can lose the primary shard and its single replica and still everything will work well. The more the replicas, the higher the query throughput will be. That's because the query can use either a shard or any of its copies to execute the query. Of course, these are not the only relationships between the number of shards and replicas in Elasticsearch. So, how many shards and replicas should we have for our indices? That depends. We believe that the defaults are quite good but nothing can replace a good test. Note that the number of replicas is less important because you can adjust it on a live cluster after index creation. You can remove and add them if you want and have the resources to run them. Unfortunately, this is not true when it comes to the number of shards. Once you have your index created, the only way to change the number of shards is to create another index and reindex your data. Creating indices When we created our first document in Elasticsearch, we didn't care about index creation at all. We just used the following command: curl -XPUT http://localhost:9200/blog/article/1 -d '{"title": "New   version of Elasticsearch released!", "content": "...", "tags":["announce", "elasticsearch", "release"] }' This is fine. If such an index does not exist, Elasticsearch automatically creates the index for us. We can also create the index ourselves by running the following command: curl -XPUT http://localhost:9200/blog/ We just told Elasticsearch that we want to create the index with the blog name. If everything goes right, you will see the following response from Elasticsearch: {"acknowledged":true} When is manual index creation necessary? There are many situations. One of them can be the inclusion of additional settings such as the index structure or the number of shards. Altering automatic index creation Sometimes, you can come to the  conclusion that automatic index creation is a bad thing. When you have a big system with many processes sending data into Elasticsearch, a simple typo in the index name can destroy hours of script work. You can turn off automatic index creation by adding the following line in the elasticsearch.yml configuration file: action.auto_create_index: false Note that action.auto_create_index is more complex than it looks. The value can be set to not only false or true. We can also use index name patterns to specify whether an index with a given name can be created automatically if it doesn't exist. For example, the following definition allows automatic creation of indices with the names beginning with a, but disallows the creation of indices starting with an. The other indices aren't allowed and must be created manually (because of -*). action.auto_create_index: -an*,+a*,-* Note that the order of pattern definitions matters. Elasticsearch checks the patterns up to the first pattern that matches, so if you move -an* to the end, it won't be used because of +a* , which will be checked first. Settings for a newly created index The manual creation of an index is also necessary when you want to set some configuration options, such as the number of shards and replicas. Let's look at the following example: curl -XPUT http://localhost:9200/blog/ -d '{     "settings" : {         "number_of_shards" : 1,         "number_of_replicas" : 2     } }' The preceding command will result in the creation of the blog index with one shard and two replicas, so it makes a total of three physical Lucene indices. Also, there are other values that can be set in this way. So, we already have our new, shiny index. But there is a problem; we forgot to provide the mappings, which are responsible for describing the index structure. What can we do? Since we have no data at all, we'll go for the simplest approach – we will just delete the index. To do that, we will run a command similar to the preceding one, but instead of using the PUT HTTP method, we use DELETE. So the actual command is as follows: curl –XDELETE http://localhost:9200/posts And the response will be the same as the one we saw earlier, as follows: {"acknowledged":true} Now that we know what an index is, how to create it, and how to delete it, we are ready to create indices with the mappings we have defined. It is a very important part because data indexation will affect the search process and the way in which documents are matched. Mappings configuration If you are used to SQL databases, you may know that before you can start inserting the data in the database, you need to create a schema, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fl y, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them) and how to create mappings that suit your needs and match your data structure. Type determining mechanism Before we  start describing how to create mappings  manually, we wanted to write about one thing. Elasticsearch can guess the document structure by looking at JSON, which defines the document. In JSON, strings are surrounded by quotation marks, Booleans are defined using specific words, and numbers are just a few digits. This is a simple trick, but it usually works. For example, let's look at the following document: {   "field1": 10, "field2": "10" } The preceding document has two fields. The field1 field will be determined as a number (to be precise, as long type), but field2 will be determined as a string, because it is surrounded by quotation marks. Of course, this can be the desired behavior, but sometimes the data source may omit the information about the data type and everything may be present as strings. The solution to this is to enable more aggressive text checking in the mapping definition by setting the numeric_detection property to true. For example, we can execute the following command during the creation of the index: curl -XPUT http://localhost:9200/blog/?pretty -d '{   "mappings" : {     "article": {       "numeric_detection" : true     }   } }' Unfortunately, the problem still exists if we want the Boolean type to be guessed. There is no option to force the guessing of Boolean types from the text. In such cases, when a change of source format is impossible, we can only define the field directly in the mappings definition. Another type that causes trouble is a date-based one. Elasticsearch tries to guess dates given as timestamps or strings that match the date format. We can define the list of recognized date formats using the dynamic_date_formats property, which allows us to specify the formats array. Let's look at the following command for creating the index and type: curl -XPUT 'http://localhost:9200/blog/' -d '{   "mappings" : {     "article" : {       "dynamic_date_formats" : ["yyyy-MM-dd hh:mm"]     }   } }' The preceding command will result in the creation of an index called blog with the single type called article. We've also used the dynamic_date_formats property with a single date format that will result in Elasticsearch using the date core type for fields matching the defined format. Elasticsearch uses the joda-time library to define date formats, so please visit http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html if you are interested in finding out more about them. Remember that the dynamic_date_format property accepts an array of values. That means that we can handle several date formats simultaneously.
Read more
  • 0
  • 0
  • 4183

article-image-learning-random-forest-using-mahout
Packt
05 Mar 2015
11 min read
Save for later

Learning Random Forest Using Mahout

Packt
05 Mar 2015
11 min read
In this article by Ashish Gupta, author of the book Learning Apache Mahout Classification, we will learn about Random forest, which is one of the most popular techniques in classification. It starts with a machine learning technique called decision tree. In this article, we will explore the following topics: Decision tree Random forest Using Mahout for Random forest (For more resources related to this topic, see here.) Decision tree A decision tree is used for classification and regression problems. In simple terms, it is a predictive model that uses binary rules to calculate the target variable. In a decision tree, we use an iterative process of splitting the data into partitions, then we split it further on branches. As in other classification model creation processes, we start with the training dataset in which target variables or class labels are defined. The algorithm tries to break all the records in training datasets into two parts based on one of the explanatory variables. The partitioning is then applied to each new partition, and this process is continued until no more partitioning can be done. The core of the algorithm is to find out the rule that determines the initial split. There are algorithms to create decision trees, such as Iterative Dichotomiser 3 (ID3), Classification and Regression Tree (CART), Chi-squared Automatic Interaction Detector (CHAID), and so on. A good explanation for ID3 can be found at http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html. Forming the explanatory variables to choose the best splitter in a node, the algorithm considers each variable in turn. Every possible split is considered and tried, and the best split is the one that produces the largest decrease in diversity of the classification label within each partition. This is repeated for all variables, and the winner is chosen as the best splitter for that node. The process is continued in the next node until we reach a node where we can make the decision. We create a decision tree from a training dataset so it can suffer from the overfitting problem. This behavior creates a problem with real datasets. To improve this situation, a process called pruning is used. In this process, we remove the branches and leaves of the tree to improve the performance. Algorithms used to build the tree work best at the starting or root node since all the information is available there. Later on, with each split, data is less and towards the end of the tree, a particular node can show patterns that are related to the set of data which is used to split. These patterns create problems when we use them to predict the real dataset. Pruning methods let the tree grow and remove the smaller branches that fail to generalize. Now take an example to understand the decision tree. Consider we have a iris flower dataset. This dataset is hugely popular in the machine learning field. It was introduced by Sir Ronald Fisher. It contains 50 samples from each of three species of iris flower (Iris setosa, Iris virginica, and Iris versicolor). The four explanatory variables are the length and width of the sepals and petals in centimeters, and the target variable is the class to which the flower belongs. As you can see in the preceding diagram, all the groups were earlier considered as Sentosa species and then the explanatory variable and petal length were further used to divide the groups. At each step, the calculation for misclassified items was also done, which shows how many items were wrongly classified. Moreover, the petal width variable was taken into account. Usually, items at leaf nodes are correctly classified. Random forest The Random forest algorithm was developed by Leo Breiman and Adele Cutler. Random forests grow many classification trees. They are an ensemble learning method for classification and regression that constructs a number of decision trees at training time and also outputs the class that is the mode of the classes outputted by individual trees. Single decision trees show the bias–variance tradeoff. So they usually have high variance or high bias. The following are the parameters in the algorithm: Bias: This is an error caused by an erroneous assumption in the learning algorithm Variance: This is an error that ranges from sensitivity to small fluctuations in the training set Random forests attempt to mitigate this problem by averaging to find a natural balance between two extremes. A Random forest works on the idea of bagging, which is to average noisy and unbiased models to create a model with low variance. A Random forest algorithm works as a large collection of decorrelated decision trees. To understand the idea of a Random forest algorithm, let's work with an example. Consider we have a training dataset that has lots of features (explanatory variables) and target variables or classes: We create a sample set from the given dataset: A different set of random features were taken into account to create the random sub-dataset. Now, from these sub-datasets, different decision trees will be created. So actually we have created a forest of the different decision trees. Using these different trees, we will create a ranking system for all the classifiers. To predict the class of a new unknown item, we will use all the decision trees and separately find out which class these trees are predicting. See the following diagram for a better understanding of this concept: Different decision trees to predict the class of an unknown item In this particular case, we have four different decision trees. We predict the class of an unknown dataset with each of the trees. As per the preceding figure, the first decision tree provides class 2 as the predicted class, the second decision tree predicts class 5, the third decision tree predicts class 5, and the fourth decision tree predicts class 3. Now, a Random forest will vote for each class. So we have one vote each for class 2 and class 3 and two votes for class 5. Therefore, it has decided that for the new unknown dataset, the predicted class is class 5. So the class that gets a higher vote is decided for the new dataset. A Random forest has a lot of benefits in classification and a few of them are mentioned in the following list: Combination of learning models increases the accuracy of the classification Runs effectively on large datasets as well The generated forest can be saved and used for other datasets as well Can handle a large amount of explanatory variables Now that we have understood the Random forest theoretically, let's move on to Mahout and use the Random forest algorithm, which is available in Apache Mahout. Using Mahout for Random forest Mahout has implementation for the Random forest algorithm. It is very easy to understand and use. So let's get started. Dataset We will use the NSL-KDD dataset. Since 1999, KDD'99 has been the most widely used dataset for the evaluation of anomaly detection methods. This dataset is prepared by S. J. Stolfo and is built based on the data captured in the DARPA'98 IDS evaluation program (R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall, D. McClung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cunningham, and M. A. Zissman, "Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation," discex, vol. 02, p. 1012, 2000). DARPA'98 is about 4 GB of compressed raw (binary) tcp dump data of 7 weeks of network traffic, which can be processed into about 5 million connection records, each with about 100 bytes. The two weeks of test data have around 2 million connection records. The KDD training dataset consists of approximately 4,900,000 single connection vectors, each of which contains 41 features and is labeled as either normal or an attack, with exactly one specific attack type. NSL-KDD is a dataset suggested to solve some of the inherent problems of the KDD'99 dataset. You can download this dataset from http://nsl.cs.unb.ca/NSL-KDD/. We will download the KDDTrain+_20Percent.ARFF and KDDTest+.ARFF datasets. In KDDTrain+_20Percent.ARFF and KDDTest+.ARFF, remove the first 44 lines (that is, all lines starting with @attribute). If this is not done, we will not be able to generate a descriptor file. Steps to use the Random forest algorithm in Mahout The steps to implement the Random forest algorithm in Apache Mahout are as follows: Transfer the test and training datasets to hdfs using the following commands: hadoop fs -mkdir /user/hue/KDDTrainhadoop fs -mkdir /user/hue/KDDTesthadoop fs –put /tmp/KDDTrain+_20Percent.arff /user/hue/KDDTrainhadoop fs –put /tmp/KDDTest+.arff /user/hue/KDDTest Generate the descriptor file. Before you build a Random forest model based on the training data in KDDTrain+.arff, a descriptor file is required. This is because all information in the training dataset needs to be labeled. From the labeled dataset, the algorithm can understand which one is numerical and categorical. Use the following command to generate descriptor file: hadoop jar $MAHOUT_HOME/core/target/mahout-core-xyz.job.jarorg.apache.mahout.classifier.df.tools.Describe-p /user/hue/KDDTrain/KDDTrain+_20Percent.arff-f /user/hue/KDDTrain/KDDTrain+.info-d N 3 C 2 N C 4 N C 8 N 2 C 19 N L Jar: Mahout core jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The main class Describe is used here and it takes three parameters: The p path for the data to be described. The f location for the generated descriptor file. d is the information for the attribute on the data. N 3 C 2 N C 4 N C 8 N 2 C 19 N L defines that the dataset is starting with a numeric (N), followed by three categorical attributes, and so on. In the last, L defines the label. The output of the previous command is shown in the following screenshot: Build the Random forest using the following command: hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest-Dmapred.max.split.size=1874231 -d /user/hue/KDDTrain/KDDTrain+_20Percent.arff-ds /user/hue/KDDTrain/KDDTrain+.info-sl 5 -p -t 100 –o /user/hue/ nsl-forest Jar: Mahout example jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The main class build forest is used to build the forest with other arguments, which are shown in the following list: Dmapred.max.split.size indicates to Hadoop the maximum size of each partition. d stands for the data path. ds stands for the location of the descriptor file. sl is a variable to select randomly at each tree node. Here, each tree is built using five randomly selected attributes per node. p uses partial data implementation. t stands for the number of trees to grow. Here, the commands build 100 trees using partial implementation. o stands for the output path that will contain the decision forest. In the end, the process will show the following result: Use this model to classify the new dataset: hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest-i /user/hue/KDDTest/KDDTest+.arff-ds /user/hue/KDDTrain/KDDTrain+.info -m /user/hue/nsl-forest -a –mr-o /user/hue/predictions Jar: Mahout example jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The class to test the forest has the following parameters: I indicates the path for the test data ds stands for the location of the descriptor file m stands for the location of the generated forest from the previous command a informs to run the analyzer to compute the confusion matrix mr informs Hadoop to distribute the classification o stands for the location to store the predictions in The job provides the following confusion matrix: So, from the confusion matrix, it is clear that 9,396 instances were correctly classified and 315 normal instances were incorrectly classified as anomalies. And the accuracy percentage is 77.7635 (correctly classified instances by the model / classified instances). The output file in the prediction folder contains the list where 0 and 1. 0 defines the normal dataset and 1 defines the anomaly. Summary In this article, we discussed the Random forest algorithm. We started our discussion by understanding the decision tree and continued with an understanding of the Random forest. We took up the NSL-KDD dataset, which is used to build predictive systems for cyber security. We used Mahout to build the Random forest tree, and used it with the test dataset and generated the confusion matrix and other statistics for the output. Resources for Article: Further resources on this subject: Implementing the Naïve Bayes classifier in Mahout [article] About Cassandra [article] Tuning Solr JVM and Container [article]
Read more
  • 0
  • 1
  • 4176

article-image-practical-applications-deep-learning
Packt
14 Jan 2016
20 min read
Save for later

Practical Applications of Deep Learning

Packt
14 Jan 2016
20 min read
In this article, Yusuke Sugomori, the author of the book Deep Learning with Java, we’ll first see how deep learning is actually applied. Here, you will see that the actual cases where deep learning is utilized are still very few. But why aren't there many cases even though it is such an innovative method? What is the problem? Later on, we’ll think about the reasons. Furthermore, going forward we will also consider which fields we can apply deep learning to and will have the chance to apply deep learning and all the related areas of artificial intelligence. The topics covered in this article include: The difficulties of turning deep learning models into practical applications The possible fields where deep learning can be applied, and ideas on how to approach these fields We'll explore the potential of this big AI boom, which will lead to ideas and hints that you can utilize in deep learning for your research, business, and many sorts of activities. (For more resources related to this topic, see here.) The difficulties of deep learning Deep learning has already got higher precision than humans in the image recognition field and has been applied to quite a lot of practical applications. Similarly, in the NLP field, many models have been researched. Then, how much deep learning is utilized in other fields? Surprisingly, there are still few fields where deep learning is successfully utilized. This is because deep learning is indeed innovative compared to past algorithms and definitely lets us take a big step towards materializing AI; however, it has some problems to be used for practical applications. The first problem is that there are too many model parameters in deep learning algorithms. We didn't look into detail when you learned about the theory and implementation of algorithms, but actually deep neural networks have many hyper parameters that need to be decided compared to the past neural networks or other machine learning algorithms. This means we have to go through more trial and error to get high precision. Combinations of parameters that define a structure of neural networks, such as how many hidden layers are to be set or how many units each hidden layer should have, need lots of experiments. Also, the parameters for training and test configurations such as the learning rateneed to be determined. Furthermore, peculiar parameters for each algorithm such as the corruption level in SDA and the size of kernels in CNN need additional trial and error. Thus, the great performance that deep learning provides is supported by steady parameter-tuning. However, people only look at one side of deep learning—that it can get great precision— and they tend to forget the hard process required to reach to the point. Deep learning is not magic. In addition, deep learning often fails to train and classify data from simple problems. The shape of deep neural networks is so deep and complicated that the weights can't be well optimized. In terms of optimization, data quantities are also important. This means that deep neural networks require a significant amount of time for each training. To sum up, deep learning shows its worth when: It solves complicated and hard problems when people have no idea what feature they can be classified as There is sufficient training data to properly optimize deep neural networks Compared to applications that constantly update a model using continuously updated data, once a model is built using a large-scale data set that doesn't change drastically, applications that use the model universally are rather well suited for deep learning. Therefore, when you look at business fields, you can say that there are more cases where the existing machine learning can get better results than using deep learning. For example, let's assume we would like to recommend appropriate products to users in an EC. In this EC, many users buy a lot of products daily, so purchase data is largely updated daily. In this case, do you use deep learning to get high-precision classification and recommendations to increase the conversion rates of users' purchases using this data? Probably not, because using the existing machine learning algorithms such as Naive Bayes, collaborative filtering, SVM, and so on, we can get sufficient precision from a practical perspective and can update the model and calculate quicker, which is usually more appreciated. This is why deep learning is not applied much in business fields. Of course, getting higher precision is better in any field, but in reality, higher precision and the necessary calculation time are in a trade-off relationship. Although deep learning is significant in the research field, it has many hurdles yet to clear considering practical applications. Besides, deep learning algorithms are not perfect, and they still need many improvements to their model itself. For example, RNN, as mentioned earlier, can only satisfy either how past information can be reflected to a network or how precision can be obtained, although it's contrived with techniques such as LSTM. Also, deep learning is still far from the true AI, although it's definitely a great technique compared to the past algorithms. Research on algorithms is progressing actively, but in the meantime, we need one more breakthrough to spread out and infiltrate deep learning into broader society. Maybe this is not just the problem of a model. Deep learning is suddenly booming because it is reinforced by huge developments in hardware and software. Deep learning is closely related to development of the surrounding technology. As mentioned earlier, there are still many hurdles to clear before deep learning can be applied more practically in the real world, but this is not impossible to achieve. It isn't possible to suddenly invent AI to achieve technological singularity, but there are some fields and methods where deep learning can be applied right away. In the next section, we’ll think about what kinds of industries deep learning can be utilized in. Hopefully, it will sew the seeds for new ideas in your business or research fields. The approaches to maximize deep learning possibilities and abilities There are several approaches on how we can apply deep learning to various industries. While it is true that an approach could be different depending on the task or purpose, we can briefly categorized the approaches as the following three: Field-oriented approach: This utilizes deep learning algorithms or models that are already thoroughly researched and can lead to a great performance Breakdown-oriented approach: This replaces problems to be solved that deep learning can apparently be applied with a different problem that deep learning can be well adopted Output-oriented approach: This explores new ways on how we express the output with deep learning These approaches are all explained in detail in the following subsections. Each approach is divided into its suitable industries or not for its use, but any of them could be a big hint for your activities going forward. There are still very few use cases of deep learning and bias against fields of use, but this means there should be many chances to create innovative and new things. Start-ups who utilize deep learning have been emerging recently and some of them have already achieved success to some extent. You can have a significant impact on the world depending on your ideas. Field-oriented approach This approach doesn't require new techniques or algorithms. There are obviously fields that are well suited to the current deep learning techniques, and the concept here is to dive into these fields. As explained previously, since deep learning algorithms that have been practically studied and developed are mostly in image recognition and NLP, we'll explore some fields that can work in great harmony with them. Medicine Medical fields should be developed by deep learning. Tumors or cancers are detected on scanned images. This means nothing else but being able to utilize one of the strongest features of deep learning—the technique of image recognition. It is possible to dramatically increase precision using deep learning to help with the early detection of an illness and identifying the kind of illness. Since CNN can be applied to 3D images, 3D scanned images should be able to be analyzed relatively easily. By adopting deep learning more in the current medical field, deep learning should greatly contribute. We can also say that deep learning can be significantly useful for the future medical field. The medical field has been under strict regulations, however, there is a movement progressing to ease the regulations in some countries, probably because of the recent development of IT and its potential. Therefore, there will be opportunities in business for the medical field and IT to have a synergy effect. For example, if telemedicine is more infiltrated, there is the possibility that diagnosing or identifying a disease can be done not only by a scanned image, but also by an image shown in real time on a display. Also, if electronic charts become widespread, it would be easier to analyze medical data using deep learning. This is because medical records are compatible with deep learning as they are a dataset of texts and images. Then, the symptom of unknown disease can be found. Automobiles We can say that surroundings off running cars are image sequences and texts. Other cars and views are images and a road sign has texts. This means we can also utilize deep learning techniques here, and it is possible to reduce the risk of accidents by improving driving assistance functions. It can be said that the ultimate type of driving assistance is self-driving cars, which is being tackled mainly by Google and Tesla. An example that is both famous and fascinating was when George Hotz, the first person to hack the iPhone, built a self-driving car in his garage. The appearance of the car was introduced in an article by Bloomberg Business (http://www.bloomberg.com/features/2015-george-hotz-self-driving-car/), and the following image was included in the article: Self-driving cars have been already tested in the U.S., but since other countries have different traffic rules and road conditions, this idea requires further studying and development before self-driving cars are commonly used worldwide. The key to success in this field is in learning and recognizing surrounding cars, people, views, and traffic signs, and properly judging how to process them. In the meantime, we don't have to just focus on utilizing deep learning techniques for the actual body of a car. Let's assume we could develop a smartphone app that has the same function as we just described, that is, recognizing and classifying surrounding images and text. Then, if you just set up the smartphone in your car, you could utilize it as a car-navigation app. In addition, for example, it could be used as a navigation app for blind people, providing them with good reliable directions. Advert technologies Advert (ad) technologies could expand their coverage with deep learning. When we say ad technologies, this currently means recommendation or ad networks that optimize ad banners or products to be shown. On the other hand, when we say advertising, this doesn't only mean banners or ad networks. There are various kinds of ads in the world depending on the type of media such as TV ads, radio ads, newspaper ads, posters, flyers, and so on. We have also digital ad campaigns with YouTube, Vine, Facebook, Twitter, Snapchat, and so on. Advertising itself has changed its definition and content, but all ads have one thing in common, they consist of images and/or language. This means they are fields that deep learning is good at. Until now, we could only use user-behavior-based indicators, such as page view (PV), click through rate (CTR), and conversion rate (CVR), to estimate the effect of an ad, but if we apply deep learning technologies, we might be able to analyze the actual content of an ad and autogenerate ads going forward. Especially since movies and videos can only be analyzed as a result of image recognition and NLP, video recognition, not image recognition, will gather momentum besides ad technologies. Profession or practice Professions such as doctor, lawyer, patent attorney, and accountant are considered to be roles that deep learning can replace. For example, if NLP's precision and accuracy gets higher, any perusal that requires expertise can be left to a machine. As a machine can cover these time-consuming reading tasks, people can focus more on high-value tasks. In addition, if a machine classifies past judicial cases or medical cases on what disease caused what symptoms and so on, we would be able to build an app like Apple’s Siri that answers simple questions that usually require professional knowledge. Then, the machine could handle these professional cases to some extent if a doctor or a lawyer is too busy to help in a timely manner. It's often said that AI takes away human’s jobs, but personally, this seems incorrect. Rather, a machine takes away menial work, which should support humans. A software engineer who works on AI programming can described as having a professional job, but this work will also be changed in the future. For example, think about a car-related job, where the current work is building standard automobiles, but in the future engineers will be in a position just like pit crews for Formula 1 cars. Sports Deep learning can certainly contribute to sports as well. In the study field known as sports science, it has become increasingly important to analyze and examine data from sports. As an example, you may know the book or movie Moneyball. In this film, they hugely increased the win percentage of the team by adopting a regression model in baseball. Watching sports itself is very exciting, but on the other hand, sport can be seen as a chunk of image sequences and number data. Since deep learning is good at identifying features that humans can't find, it will become easier to find out why certain players get good scores while others don't. These fields we have mentioned are only a small part of the many fields where deep learning is capable of significantly contributing to development. We have looked into these fields from the perspective of whether a field has images or text, but of course deep learning should also show great performance for simple analysis with general number data. It should be possible to apply deep learning to various other fields such as bioinformatics, finance, agriculture, chemistry, astronomy, economy, and more. Breakdown-oriented approach This approach might be similar to the approach considered in traditional machine learning algorithms. We already talked about how feature engineering is the key to improving precision in machine learning. Now we can say that this feature engineering can be divided into the following two parts: Engineering under the constraints of a machine learning model. The typical case is to make inputs discrete or continuous. Feature engineering to increase precision by machine learning. This tends to rely on the sense of a researcher. In a narrower meaning, feature engineering is considered as the second one, and this is the part that deep learning doesn't have to focus on, whereas the first one is definitely the important part even for deep learning. For example, it's difficult to predict stock prices using deep learning. Stock prices are volatile and it’s difficult to define inputs. Besides, how to apply an output value is also a difficult problem. Enabling deep learning to handle these inputs and outputs is also said to be feature engineering in the wider sense. If there is no limitation to the value of original data and/or data you would like to predict, it’s difficult to insert these datasets into machine learning and deep learning algorithms, including neural networks. However, we can take a certain approach and apply a model to these previous problems by breaking down the inputs and/or outputs. In terms of NLP as explained earlier, you might have thought, for example, that it would be impossible to put numberless words into features in the first place, but as you already know, we can train feed-forward neural networks with words by representing them with sparse vectors and combining N-grams into them. Of course, we can not only use neural networks, but also other machine learning algorithms such as SVM here. Thus, we can cultivate a new field where deep learning hasn't been applied by engineering to fit features well into deep learning models. In the meantime, when we focus on NLP, we can see that RNN and LSTM were developed to properly resolve the difficulties or tasks encountered in NLP. This can be considered as the opposite approach to feature engineering because in this case, the problem is solved by breaking down a model to fit into features. Then, how do we do engineering for stock prediction as we just mentioned? It's actually not difficult to think of inputs, that is, features. For example, if you predict stock prices daily, it’s hard to calculate if you use daily stock prices as features, but if you use a rate of price change between a day and the day before, then it should be much easier to process as the price stays within a certain range and the gradients won't explode easily. Meanwhile, what is difficult is how to deal with outputs. Stock prices are of course continuous values, hence outputs can be various values. This means that in neural network model where the number of units in the output layer is fixed, they can't handle this problem. What should we do here—should we give up?! No, wait a minute. Unfortunately, we can't predict a stock price itself, but there is an alternative prediction method. Here, the problem is that we can classify stock prices to be predicted into infinite patterns. Then, can we make them into limited patterns? Yes, we can. Let's forcibly make them. Think about the most extreme but easy to understand case: predicting whether a tomorrow's stock price, strictly speaking a close price, is up or down using the data from the stock price up to today. For this case, we can show it with a deep learning model as follows: In the preceding image, denotes the open price of a day, ; denotes the close price, is the high price, and is the actual price. The features used here are mere examples, and need to be fine-tuned when applied to real applications. The point here is that replacing the original task with this type of problem enables deep neural networks to theoretically classify data. Furthermore, if you classify the data by how much it will go up or down, you could make more detailed predictions. For example, you could classify data as shown in the following table: Class Description Class 1 Up more than 3 percent from the closing price Class 2 Up more than 1~3 percent from the closing price Class 3 Up more than 0~1 percent from the closing price Class 4 Down more than 0~-1 percent from the closing price Class 5 Down more than -1~-3 percent from the closing price Class 6 Down more than -3 percent from the closing price Whether the prediction actually works, in other words whether the classification works, is unknown until we examine it, but the fluctuation of stock prices can be predicted in a quite narrow range by dividing the outputs into multiple classes. Once we can adopt the task into neural networks, then what we should do is just examine which model gets better results. In this example, we may apply RNN because the stock price is time sequential data. If we look at charts showing the price as image data, we can also use CNN to predict the future price. So now we've thought about the approach by referring to examples, but to sum up in general, we can say that: Feature engineering for models: This is designing inputs or adjusting values to fit deep learning models, or enabling classification by setting a limitation for the outputs Model engineering for features: This is devising new neural network models or algorithms to solve problems in a focused field The first one needs ideas for the part of designing inputs and outputs to fit to a model, whereas the second one needs to take a mathematical approach. Feature engineering might be easier to start if you are conscious of making an item prediction-limited. Output-oriented approach The previously mentioned two approaches are to increase the percentage of correct answers for a certain field's task or problem using deep learning. Of course, it is essential and the part where deep learning proves its worth, however, increasing precision to the ultimate level may not be the only way of utilizing deep learning. Another approach is to devise the outputs using deep learning by slightly changing the point of view. Let's see what this means. Deep learning is applauded as an innovative approach among researchers and technical experts of AI, but the world in general doesn't know much about its greatness yet. Rather, they pay attention to what a machine can't do. For example, people don't really focus on the image recognition capabilities of MNIST using CNN, which generates a lower error rate than humans, but they criticize that a machine can't recognize images perfectly. This is probably because people expect a lot when they hear and imagine AI. We might need to change this mindset. Let's consider DORAEMON, a Japanese national cartoon character who is also famous worldwide—a robot who has high intelligence and AI, but often makes silly mistakes. Do we criticize him? No, we just laugh it off or take it as a joke and don’t get serious. Also, think about DUMMY / DUM-E, the robot arm in the movie Iron Man. It has AI as well, but makes silly mistakes. See, they make mistakes but we still like them. In this way, it might be better to emphasize the point that machines make mistakes. Changing the expression part of a user interface could be the trigger for people to adopt AI rather than just studying an algorithm the most. Who knows? It’s highly likely that you can gain the world’s interest by thinking of ideas in creative fields, not from the perspective of precision. Deep Dream by Google is one good example. We can do more exciting things when art or design and deep learning collaborate. Summary And …congratulations! You’ve just accomplished the learning part of deep learning with Java. Although there are still some models that have not been mentioned, you can be sure there will be no problem in acquiring and utilizing them. Resources for Article: Further resources on this subject: Setup Routine for an Enterprise Spring Application[article] Support for Developers of Spring Web Flow 2[article] Using Spring JMX within Java Applications[article]
Read more
  • 0
  • 0
  • 4173
article-image-introduction-scikit-learn
Packt
16 Feb 2016
5 min read
Save for later

Introduction to Scikit-Learn

Packt
16 Feb 2016
5 min read
Introduction Since its release in 2007, scikit-learn has become one of the most popular open source machine learning libraries for Python. scikit-learn provides algorithms for machine learning tasks including classification, regression, dimensionality reduction, and clustering. It also provides modules for extracting features, processing data, and evaluating models. (For more resources related to this topic, see here.) Conceived as an extension to the SciPy library, scikit-learn is built on the popular Python libraries NumPy and matplotlib. NumPy extends Python to support efficient operations on large arrays and multidimensional matrices. matplotlib provides visualization tools, and SciPy provides modules for scientific computing. scikit-learn is popular for academic research because it has a well-documented, easy-to-use, and versatile API. Developers can use scikit-learn to experiment with different algorithms by changing only a few lines of the code. scikit-learn wraps some popular implementations of machine learning algorithms, such as LIBSVM and LIBLINEAR. Other Python libraries, including NLTK, include wrappers for scikit-learn. scikit-learn also includes a variety of datasets, allowing developers to focus on algorithms rather than obtaining and cleaning data. Licensed under the permissive BSD license, scikit-learn can be used in commercial applications without restrictions. Many of scikit-learn's algorithms are fast and scalable to all but massive datasets. Finally, scikit-learn is noted for its reliability; much of the library is covered by automated tests. Installing scikit-learn This book is written for version 0.15.1 of scikit-learn; use this version to ensure that the examples run correctly. If you have previously installed scikit-learn, you can retrieve the version number with the following code: >>> import sklearn >>> sklearn.__version__ '0.15.1' If you have not previously installed scikit-learn, you can install it from a package manager or build it from the source. We will review the installation processes for Linux, OS X, and Windows in the following sections, but refer to http://scikit-learn.org/stable/install.html for the latest instructions. The following instructions only assume that you have installed Python 2.6, Python 2.7, or Python 3.2 or newer. Go to http://www.python.org/download/ for instructions on how to install Python. Installing scikit-learn on Windows scikit-learn requires Setuptools, a third-party package that supports packaging and installing software for Python. Setuptools can be installed on Windows by running the bootstrap script at https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py. Windows binaries for the 32- and 64-bit versions of scikit-learn are also available. If you cannot determine which version you need, install the 32-bit version. Both versions depend on NumPy 1.3 or newer. The 32-bit version of NumPy can be downloaded from http://sourceforge.net/projects/numpy/files/NumPy/. The 64-bit version can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. A Windows installer for the 32-bit version of scikit-learn can be downloaded from http://sourceforge.net/projects/scikit-learn/files/. An installer for the 64-bit version of scikit-learn can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. scikit-learn can also be built from the source code on Windows. Building requires a C/C++ compiler such as MinGW (http://www.mingw.org/), NumPy, SciPy, and Setuptools. To build, clone the Git repository from https://github.com/scikit-learn/scikit-learn and execute the following command: python setup.py install Installing scikit-learn on Linux There are several options to install scikit-learn on Linux, depending on your distribution. The preferred option to install scikit-learn on Linux is to use pip. You may also install it using a package manager, or build scikit-learn from its source. To install scikit-learn using pip, execute the following command: sudo pip install scikit-learn To build scikit-learn, clone the Git repository from https://github.com/scikit-learn/scikit-learn. Then install the following dependencies: sudo apt-get install python-dev python-numpy python-numpy-dev python-setuptools python-numpy-dev python-scipy libatlas-dev g++ Navigate to the repository's directory and execute the following command: python setup.py install Installing scikit-learn on OS X scikit-learn can be installed on OS X using Macports: sudo port install py26-sklearn If Python 2.7 is installed, run the following command: sudo port install py27-sklearn scikit-learn can also be installed using pip with the following command: pip install scikit-learn Verifying the installation To verify that scikit-learn has been installed correctly, open a Python console and execute the following: >>> import sklearn >>> sklearn.__version__ '0.15.1' To run scikit-learn's unit tests, first install the nose library. Then execute the following: nosetest sklearn –exe Congratulations! You've successfully installed scikit-learn. Summary In this article, we had a brief introduction of Scikit. We also covered the installation of Scikit on various operating system Windows, Linux, OS X. You can also refer the following books on the similar topics: scikit-learn Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/scikit-learn-cookbook) Learning scikit-learn: Machine Learning in Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-scikit-learn-machine-learning-python) Resources for Article: Further resources on this subject: Our First Machine Learning Method – Linear Classification[article] Specialized Machine Learning Topics[article] Machine Learning[article]
Read more
  • 0
  • 0
  • 4154

article-image-getting-started-oracle-data-guard
Packt
02 Jul 2013
13 min read
Save for later

Getting Started with Oracle Data Guard

Packt
02 Jul 2013
13 min read
(For more resources related to this topic, see here.) What is Data Guard? Data Guard, which was introduced as the standby database in Oracle database Version 7.3 under the name of Data Guard with Version 9 i , is a data protection and availability solution for Oracle databases. The basic function of Oracle Data Guard is to keep a synchronized copy of a database as standby, in order to make provision, incase the primary database is inaccessible to end users. These cases are hardware errors, natural disasters, and so on. Each new Oracle release added new functionalities to Data Guard and the product became more and more popular with offerings such as data protection, high availability, and disaster recovery for Oracle databases. Using Oracle Data Guard, it's possible to direct user connections to a Data Guard standby database automatically with no data loss, in case of an outage in the primary database. Data Guard also offers taking advantage of the standby database for reporting, test, and backup offloading. Corruptions on the primary database may be fixed automatically by using the non-corrupted data blocks on the standby database. There will be minimal outages (seconds to minutes) on the primary database in planned maintenances such as patching and hardware changes by using the switchover feature of Data Guard, which changes the roles of the primary and standby databases. All of these features are available with Data Guard, which doesn't require an installation but a cloning and configuration of the Oracle database. A Data Guard configuration consists of two main components: primary database and standby database. The primary database is the database for which we want to take precaution for its inaccessibility. Fundamentally, changes on the data of the primary database are passed through the standby database and these changes are applied to the standby database in order to keep it synchronized. The following figure shows the general structure of Data Guard: Let's look at the standby database and its properties more closely. Standby database It is possible to configure a standby database simply by copying, cloning, or restoring a primary database to a different server. Then the Data Guard configurations are made on the databases in order to start the transfer of redo information from primary to standby and also to start the apply process on the standby database. Primary and standby databases may exist on the same server; however, this kind of configuration should only be used for testing. In a production environment, the primary and standby database servers are generally preferred to be on separate data centers. Data Guard keeps the primary and standby databases synchronized by using redo information. As you may know, transactions on an Oracle database produce redo records. This redo information keeps all of the changes made to the database. The Oracle database first creates redo information in memory (redo log buffers). Then they're written into online redo logfiles, and when an online redo logfile is full, its content is written into an archived redo log. An Oracle database can run in the ARCHIVELOG mode or the NOARCHIVELOG mode. In the ARCHIVELOG mode, online redo logfiles are written into archived redo logs and in the NOARCHIVELOG mode, redo logfiles are overwritten without being archived as they become full. In a Data Guard environment, the primary database must be in the ARCHIVELOG mode. In Data Guard, transfer of the changed data from the primary to standby database is achieved by redo with no alternative. However, the apply process of the redo content to the standby database may vary. The different methods on the apply process reveal different type of standby databases. There were two kinds of standby databases before Oracle database Version 11 g , which were: physical standby database and logical standby database. Within Version 11 g we should mention a third type of standby database which is snapshot standby. Let's look at the properties of these standby database types. Physical standby database The Physical standby database is a block-based copy of the primary database. In a physical standby environment, in addition to containing the same database objects and same data, the primary and standby databases are identical on a block-for-block basis. Physical standby databases use Redo Apply method to apply changes. Redo Apply uses Managed recovery process ( MRP ) in order to manage application of the change in information on redo. In Version 11 g , a physical standby database can be accessible in read-only mode while Redo Apply is working, which is called Active Data Guard. Using the Active Data Guard feature, we can offload report jobs from the primary to physical standby database. Physical standby database is the only option that has no limitation on storage vendor or data types to keep a synchronized copy of the primary database. Logical standby database Logical standby database is a feature introduced in Version 9 i R2. In this configuration, redo data is first converted into SQL statements and then applied to the standby database. This process is called SQL Apply. This method makes it possible to access the standby database permanently and allows read/write while the replication of data is active. Thus, you're also able to create database objects on the standby database that don't exist on the primary database. So a logical standby database can be used for many other purposes along with high availability and disaster recovery. Due to the basics of SQL Apply, a logical standby database will contain the same data as the primary database but in a different structure on the disks. One discouraging aspect of the logical standby database is the unsupported data types, objects, and DDLs. The following data types are not supported to be replicated in a logical standby environment: BFILE Collections (including VARRAYS and nested tables) Multimedia data types (including Spatial, Image, and Oracle Text) ROWID and UROWID User-defined types The logical standby database doesn't guarantee to contain all primary data because of the unsupported data types, objects, and DDLs. Also, SQL Apply consumes more hardware resources. Therefore, it certainly brings more performance issues and administrative complexities than Redo Apply. Snapshot standby database Principally, a snapshot standby database is a special condition of a physical standby database. Snapshot standby is a feature that is available with Oracle Database Version 11 g . When you convert a Physical standby database into a snapshot standby database, it becomes accessible for read/write. You can run tests on this database and change the data. When you're finished with the snapshot standby database, it's possible to reverse all the changes made to the database and turn it back to a physical standby again. An important point here is that a snapshot standby database can't run Redo Apply. Redo transfer continues but standby is not able to apply redo. Oracle Data Guard evolution It has been a long time that the Oracle Data Guard technology has been in the database administrator's life and it apparently evolved from the beginning until 11 g R2. Let's look at this evolution closely through the different database versions. Version 7.3 – stone age The functionality of keeping a duplicate database in a separate server, which can be synchronized with the primary database, came with Oracle database Version 7.3 under the name of standby database. This standby database was constantly in recovery mode waiting for the archived redo logs to be synchronized. However, this feature was not able to automate the transfer of archived redo logs. Database administrators had to find a way to transfer archived redo logs and apply them to the standby server continuously. This was generally accomplished by a script running in the background. The only aim of Version 7.3 of the standby database was disaster recovery. It was not possible to query the standby database or to open it for any purpose other than activating it in the event of failure of the primary database. Once the standby database was activated, it couldn't be returned to the standby recovery mode again. Version 8 i – first age Oracle database Version 8 i brought the much-awaited features to the standby database and made the archived log shipping and apply process automatic, which is now called managed standby environment and managed recovery, respectively. However, some users were choosing to apply the archived logs manually because it was not possible to set a delay in the managed recovery mode. This mode was bringing the risk of the accidental operations to reflect standby database quickly. Along with the "managed" modes, 8 i made it possible to open a standby database with the read-only option and allowed it to be used as a reporting database. Even though there were new features that made the tool more manageable and practical, there were still serious deficiencies. For example, when we added a datafile or created a tablespace on the primary database, these changes were not being replicated to the standby database. Database administrators had to take care of this maintenance on the standby database. Also when we opened the primary database with resetlogs or restored a backup control file, we had to re-create the standby database. Version 9 i – middle age First of all, with this version Oracle8 i standby database was renamed to Oracle9 i Data Guard. 9 i Data Guard includes very important new features, which makes the product much more reliable and functional. The following features were included: Oracle Data Guard Broker management framework, which is used to centralize and automate the configuration, monitoring, and management of Oracle Data Guard installations, was introduced with this version. Zero data loss on failover was guaranteed as a configuration option. Switchover was introduced, which made it possible to change the roles of primary and standby. This made it possible to accomplish a planned maintenance on the primary database with very less service outage. Standby database administration became simpler because new datafiles on the primary database are created automatically on standby and if there are missing archived logs on standby, which is called gap; Data Guard detects and transmits the missing logs to standby automatically. Delay option was added, which made it possible to configure a standby database that is always behind the primary in a specified time delay. Parallel recovery increased recovery performance on the standby database. In Version 9 i Release 2, which was introduced in May 2002, one year after Release 1, there were again very important features announced. They are as follows: Logical standby database was introduced, which we've mentioned earlier in this article Three data protection modes were ready to use: Maximum Protection, Maximum Availability, and Maximum Performance, which offered more flexibility on configuration The Cascade standby database feature made it possible to configure a second standby database, which receives its redo data from the first standby database Version 10 g – new age The 10 g version again introduced important features of Data Guard but we can say that it perhaps fell behind expectations because of the revolutionary changes in release 9 i . The following new features were introduces in Version 10 g : One of the most important features of 10 g was the Real-Time Apply. When running in Real-Time Apply mode, the standby database applies changes on the redo immediately after receiving it. Standby does not wait for the standby redo logfile to be archived. This provides faster switchover and failover. Flashback database support was introduced, which made it unnecessary to configure a delay in the Data Guard configuration. Using flashback technology, it was possible to flash back a standby database to a point in time. With 10 g Data Guard, if we open a primary database with resetlogs it was not required to re-create the standby database. Standby was able to recover through resetlogs. Version 10 g made it possible to use logical standby databases in the database software rolling upgrades of the primary database. This method made it possible to lessen the service outage time by performing switchover to the logical standby database. 10 g Release 2 also introduced new features to Data Guard, but these features again were not satisfactory enough to make a jump to the Data Guard technology. The two most important features were Fast-Start Failover and the use of Guaranteed restore point: Fast-start failover automated and accelerated the failover operation when the primary database was lost. This option strengthened the disaster recovery role of Oracle Data Guard. Guaranteed restore point was not actually a Data Guard feature. It was a database feature, which made it possible to revert a database to the moment that Guaranteed restore point was created, as long as there is sufficient disk space for the flashback logs. Using this feature following scenario became possible: Activate a physical standby database after stopping Redo Apply, use it for testing with read/write operations, then revert the changes, make it standby again and synchronize it with the primary. Using a standby database read/write was offering a great flexibility to users but the archived log shipping was not able to continue while the standby is read/write and this was causing data loss on the possible primary database failure. Version 11 g – modern age Oracle database version 11 g offered the expected jump in the Data Guard technology, especially with two new features, which are called Active Data Guard and snapshot standby. The following features were introduced: Active Data Guard has been a milestone in Data Guard history, which enables a query from a physical standby database while the media recovery is active. Snapshot standby is a feature to use a physical standby database read/write for test purposes. As we mentioned, this was possible with 10 g R2 Guaranteed restore point feature but 11 g provided the continuous archived log shipping in the time period that standby is read/write with snapshot standby. It has been possible to compress redo traffic in a Data Guard configuration, which is useful in excessive redo generation rates and resolving gaps. Compression of redo when resolving gaps was introduced in 11 g R1 and compression of all redo data was introduced in 11 g R2. Use of the physical standby databases for the rolling upgrades of database software was enabled, aka Transient Logical Standby. It became possible to include different operating systems in a Data Guard configuration such as Windows and Linux. Lost-write, which is a serious data corruption type arising from the misinformation of storage subsystem on completing the write of a block, can be detected in an 11 g Data Guard configuration. Recovery is automatically stopped in such a case. RMAN fast incremental backup feature "Block Change Tracking" can be run on an Active Data Guard enabled standby database. Another very important enhancement in 11 g was Automatic Block Corruption Repair feature that was introduced with 11 g R2. With this feature, a corrupted data block in the primary database can be automatically replaced with an uncorrupted copy from a physical standby database in Active Data Guard mode and vice versa. We've gone through the evolution of Oracle Data Guard from its beginning until today. As you may notice, Data Guard started its life as a very simple database property revealed to keep a synchronized database copy with a lot of manual work and now it's a complicated tool with advanced automation, precaution, and monitoring features. Now let's move on with the architecture and components of Oracle Data Guard 11 g R2.
Read more
  • 0
  • 0
  • 4153

article-image-installing-coherence-35-and-accessing-data-grid-part-1
Packt
31 Mar 2010
10 min read
Save for later

Installing Coherence 3.5 and Accessing the Data Grid: Part 1

Packt
31 Mar 2010
10 min read
When I first started evaluating Coherence, one of my biggest concerns was how easy it would be to set up and use, especially in a development environment. The whole idea of having to set up a cluster scared me quite a bit, as any other solution I had encountered up to that point that had the word "cluster" in it was extremely difficult and time consuming to configure. My fear was completely unfounded—getting the Coherence cluster up and running is as easy as starting Tomcat. You can start multiple Coherence nodes on a single physical machine, and they will seamlessly form a cluster. Actually, it is easier than starting Tomcat. Installing Coherence In order to install Coherence you need to download the latest release from the Oracle Technology Network (OTN) website. The easiest way to do so is by following the link from the main Coherence page on OTN. At the time of this writing, this page was located at http://www.oracle.com/technology/products/coherence/index.html, but that might change. If it does, you can find its new location by searching for 'Oracle Coherence' using your favorite search engine. In order to download Coherence for evaluation, you will need to have an Oracle Technology Network (OTN) account. If you don't have one, registration is easy and completely free. Once you are logged in, you will be able to access the Coherence download page, where you will find the download links for all available Coherence releases: one for Java, one for .NET, and one for each of the supported C++ platforms. You can download any of the Coherence releases you are interested in while you are there, but for the remainder of this article you will only need the first one. The latter two (.NET and C++) are client libraries that allow .NET and C++ applications to access the Coherence data grid. Coherence ships as a single ZIP archive. Once you unpack it you should see the README.txt file containing the full product name and version number, and a single directory named coherence. Copy the contents of the coherence directory to a location of your choice on your hard drive. The common location on Windows is c:coherence and on Unix/Linux /opt/coherence, but you are free to put it wherever you want. The last thing you need to do is to configure the environment variable COHERENCE_HOME to point to the top-level Coherence directory created in the previous step, and you are done. Coherence is a Java application, so you also need to ensure that you have the Java SDK 1.4.2 or later installed and that JAVA_HOME environment variable is properly set to point to the Java SDK installation directory. If you are using a JVM other than Sun's, you might need to edit the scripts used in the following section. For example, not all JVMs support the -server option that is used while starting the Coherence nodes, so you might need to remove it. What's in the box? The first thing you should do after installing Coherence is become familiar with the structure of the Coherence installation directory. There are four subdirectories within the Coherence home directory: bin: This contains a number of useful batch files for Windows and shell scripts for Unix/Linux that can be used to start Coherence nodes or to perform various network tests doc: This contains the Coherence API documentation, as well as links to online copies of Release Notes, User Guide, and Frequently Asked Questions documents examples: This contains several basic examples of Coherence functionality lib: This contains JAR files that implement Coherence functionality Shell scripts on UnixIf you are on a Unix-based system, you will need to add execute permission to the shell scripts in the bin directory by executing the following command: $ chmod u+x *.sh Starting up the Coherence cluster In order to get the Coherence cluster up and running, you need to start one or more Coherence nodes. The Coherence nodes can run on a single physical machine, or on many physical machines that are on the same network. The latter will definitely be the case for a production deployment, but for development purposes you will likely want to limit the cluster to a single desktop or laptop. The easiest way to start a Coherence node is to run cache-server.cmd batch file on Windows or cache-server.sh shell script on Unix. The end result in either case should be similar to the following screenshot: There is quite a bit of information on this screen, and over time you will become familiar with each section. For now, notice two things: At the very top of the screen, you can see the information about the Coherence version that you are using, as well as the specific edition and the mode that the node is running in. Notice that by default you are using the most powerful, Grid Edition, in development mode. The MasterMemberSet section towards the bottom lists all members of the cluster and provides some useful information about the current and the oldest member of the cluster. Now that we have a single Coherence node running, let's start another one by running the cache-server script in a different terminal window. For the most part, the output should be very similar to the previous screen, but if everything has gone according to the plan, the MasterMemberSet section should reflect the fact that the second node has joined the cluster: MasterMemberSet ( ThisMember=Member(Id=2, ...) OldestMember=Member(Id=1, ...) ActualMemberSet=MemberSet(Size=2, BitSetCount=2 Member(Id=1, ...) Member(Id=2, ...) )RecycleMillis=120000RecycleSet=MemberSet(Size=0, BitSetCount=0)) You should also see several log messages on the first node's console, letting you know that another node has joined the cluster and that some of the distributed cache partitions were transferred to it. If you can see these log messages on the first node, as well as two members within the ActualMemberSet on the second node, congratulations—you have a working Coherence cluster. Troubleshooting cluster start-up In some cases, a Coherence node will not be able to start or to join the cluster. In general, the reason for this could be all kinds of networking-related issues, but in practice a few issues are responsible for the vast majority of problems. Multicast issues By far the most common issue is that multicast is disabled on the machine. By default, Coherence uses multicast for its cluster join protocol, and it will not be able to form the cluster unless it is enabled. You can easily check if multicast is enabled and working properly by running the multicast-test shell script within the bin directory. If you are unable to start the cluster on a single machine, you can execute the following command from your Coherence home directory: $ . bin/multicast-test.sh –ttl 0 This will limit time-to-live of multicast packets to the local machine and allow you to test multicast in isolation. If everything is working properly, you should see a result similar to the following: Starting test on ip=Aleks-Mac-Pro.home/192.168.1.7,group=/237.0.0.1:9000, ttl=0Configuring multicast socket...Starting listener...Fri Aug 07 13:44:44 EDT 2009: Sent packet 1.Fri Aug 07 13:44:44 EDT 2009: Received test packet 1 from selfFri Aug 07 13:44:46 EDT 2009: Sent packet 2.Fri Aug 07 13:44:46 EDT 2009: Received test packet 2 from selfFri Aug 07 13:44:48 EDT 2009: Sent packet 3.Fri Aug 07 13:44:48 EDT 2009: Received test packet 3 from self If the output is different from the above, it is likely that multicast is not working properly or is disabled on your machine. This is frequently the result of a firewall or VPN software running, so the first troubleshooting step would be to disable such software and retry. If you determine that was indeed the cause of the problem you have two options. The first, and obvious one, is to turn the offending software off while using Coherence. However, for various reasons that might not be an acceptable solution, in which case you will need to change the default Coherence behavior, and tell it to use the Well-Known Addresses (WKA) feature instead of multicast for the cluster join protocol. Doing so on a development machine is very simple—all you need to do is add the following argument to the JAVA_OPTS variable within the cache-server shell script: -Dtangosol.coherence.wka=localhost With that in place, you should be able to start Coherence nodes even if multicastis disabled. Localhost and loopback addressOn some systems, localhost maps to a loopback address, 127.0.0.1. If that's the case, you will have to specify the actual IP address or host name for the tangosol.coherence.wka configuration parameter. The host name should be preferred, as the IP address can change as you move from network to network, or if your machine leases an IP address from a DHCP server. As a side note, you can tell whether the WKA or multicast is being used for the cluster join protocol by looking at the section above the MasterMemberSet section when the Coherence node starts. If multicast is used, you will see something similar to the following: Group{Address=224.3.5.1, Port=35461, TTL=4} The actual multicast group address and port depend on the Coherence version being used. As a matter of fact, you can even tell the exact version and the build number from the preceding information. In this particular case, I am using Coherence 3.5.1 release, build 461. This is done in order to prevent accidental joins of cluster members into an existing cluster. For example, you wouldn't want a node in the development environment using newer version of Coherence that you are evaluating to join the existing production cluster, which could easily happen if the multicast group address remained the same. On the other hand, if you are using WKA, you should see output similar to the following instead: WellKnownAddressList(Size=1, WKA{Address=192.168.1.7, Port=8088} ) Using the WKA feature completely disables multicast in a Coherence cluster, and is recommended for most production deployments, primarily due to the fact that many production environments prohibit multicast traffic altogether, and that some network switches do not route multicast traffic properly. That said, configuring WKA for production clusters is out of the scope of this article, and you should refer to Coherence product manuals for details. Binding issues Another issue that sometimes comes up is that one of the ports that Coherence attempts to bind to is already in use and you see a bind exception when attempting to start the node. By default, Coherence starts the first node on port 8088, and increments port number by one for each subsequent node on the same machine. If for some reason that doesn't work for you, you need to identify a range of available ports for as many nodes as you are planning to start (both UDP and TCP ports with the same numbers must be available), and tell Coherence which port to use for the first node by specifying the tangosol.coherence.localport system property. For example, if you want Coherence to use port 9100 for the first node, you will need to add the following argument to the JAVA_OPTS variable in the cache-server shell script: -Dtangosol.coherence.localport=9100
Read more
  • 0
  • 0
  • 4114
article-image-oracle-when-use-log-miner
Packt
07 May 2010
6 min read
Save for later

Oracle: When to use Log Miner

Packt
07 May 2010
6 min read
Log Miner has both a GUI interface in OEM as well as the database package, DBMS_LOGMNR. When this utility is used by the DBA, its primary focus is to mine data from the online and archived redo logs. Internally Oracle uses the Log Miner technology for several other features, such as Flashback Transaction Backout, Streams, and Logical Standby Databases. This section is not on how to run Log Miner, but looks at the task of identifying the information to restore. The Log Miner utility comes into play when you need to retrieve an older version of selected pieces of data without completely recovering the entire database. A complete recovery is usually a drastic measure that means downtime for all users and the possibility of lost transactions. Most often Log Miner is used for recovery purposes when the data consists of just a few tables or a single code change. Make sure supplemental logging is turned on (see the Add Supplemental Logging section). In this case, you discover that one or more of the following conditions apply when trying to recover a small amount of data that was recently changed: Flashback is not enabled Flashback logs that are needed are no longer available Data that is needed is not available in the online redo logs Data that is needed has been overwritten in the undo segments Go to the last place available: archived redo logs. This requires the database to be in archivelog mode and for all archive logs that are needed to still be available or recoverable. Identifying the data needed to restore One of the hardest parts of restoring data is determining what to restore, the basic question being when did the bad data become part of the collective? Think the Borg from Star Trek! When you need to execute Log Miner to retrieve data from a production database, you will need to act fast. The older the transactions the longer it will take to recover and traverse with Log Miner. The newest (committed) transactions are processed first, proceeding backwards. The first question to ask is when do you think the bad event happened? Searching for data can be done in several different ways: SCN, timestamp, or log sequence number> Pseudo column ORA_ROWSCN SCN, timestamp, or log sequence number If you are lucky, the application also writes a timestamp of when the data was last changed. If that is the case, then you determine the archive log to mine by using the following queries. It is important to set the session NLS_DATE_FORMAT so that the time element is displayed along with the date, otherwise you will just get the default date format of DD-MMM-RR. The data format comes from the database startup parameters— the NLS_TERRITORY setting. Find the time when a log was archived and match that to the archive log needed. Pseudo column ORA_ROWSCN While this method seems very elegant, it does not work perfectly, meaning it won't always return the correct answer. As it may not work every time or accurately, it is generally not recommended for Flashback Transaction Queries. It is definitely worth trying to narrow the window that you will have to search. It uses the SCN information that was stored for the associated transaction in the Interested Transaction List. You know that delayed block cleanout is involved. The pseudo column ORA_ROWSCN contains information for the approximate time this table was updated for each row. In the following example the table has three rows, with the last row being the one that was most recently updated. It gives me the time window to search the archive logs with Log Miner. Log Miner is the basic technology behind several of the database Maximum Availability Architecture capabilities—Logical Standby, Streams, and the following Flashback Transaction Backout exercise. Flashback Transaction Query and Backout Flashback technology was first introduced in Oracle9i Database. This feature allows you to view data at different points in time and with more recent timestamps (versions), and thus provides the capability to recover previous versions of data. In this article, we are dealing with Flashback Transaction Query (FTQ) and Flashback Transaction Backout (FTB), because they both deal with transaction IDs and integrate with the Log Miner utility. See the MOS document: "What Do All 10g Flashback Features Rely on and what are their Limitations?" (Doc ID 435998.1). Flashback Transaction Query uses the transaction ID (Xid) that is stored with each row version in a Flashback Versions Query to display every transaction that changed the row. Currently, the only Flashback technology that can be used when the object(s) in question have been changed by DDL is Flashback Data Archive. There are other restrictions to using FTB with certain data types (VARRAYs, BFILES), which match the data type restrictions for Log Miner. This basically means if data types aren't supported, then you can't use Log Miner to find the undo and redo log entries. When would you use FTQ or FTB instead of the previously described methods? The answer is when the data involves several tables with multiple constraints or extensive amounts of information. Similar to Log Miner, the database can be up and running while people are working online in other schemas of the database to accomplish this restore task. An example of using FTB or FTQ would be to reverse a payroll batch job that was run with the wrong parameters. Most often a batch job is a compiled code (like C or Cobol) run against the database, with parameters built in by the application vendor. A wrong parameter could be the wrong payroll period, wrong set of employees, wrong tax calculations, or payroll deductions. Enabling flashback logs First off all flashback needs to be enabled in the database. Oracle Flashback is the database technology intended for a point-in-time recovery (PITR) by saving transactions in flashback logs. A flashback log is a temporary Oracle file and is required to be stored in the FRA, as it cannot be backed up to any other media. Extensive information on all of the ramifications of enabling flashback is found in the documentation labeled: Oracle Database Backup and Recovery User's Guide. See the following section for an example of how to enable flashback: SYS@NEWDB>ALTER SYSTEM SET DB_RECOVERY_FILE_DEST='/backup/flash_recovery_area/NEWDB' SCOPE=BOTH;SYS@NEWDB>ALTER SYSTEM SET DB_RECOVERY_FILE_DEST_SIZE=100M SCOPE=BOTH;--this is sized for a small test databaseSYS@NEWDB> SHUTDOWN IMMEDIATE;SYS@NEWDB> STARTUP MOUNT EXCLUSIVE;SYS@NEWDB> ALTER DATABASE FLASHBACK ON;SYS@NEWDB> ALTER DATABASE OPEN;SYS@NEWDB> SHOW PARAMETER RECOVERY; The following query would then verify that FLASHBACK had been turned on: SYS@NEWDB>SELECT FLASHBACK_ON FROM V$DATABASE;
Read more
  • 0
  • 0
  • 4101

article-image-azure-feature-pack
Packt
05 Jul 2017
9 min read
Save for later

Azure Feature Pack

Packt
05 Jul 2017
9 min read
In this article by Christian Cote, Matija Lah, and Dejan Sarka, the author of the book SQL Server 2016 Integration Services Cookbook, we will see how to install Azure Feature Pack that in turn, will install Azure control flow tasks and data flow components And we will also see how to use the Fuzzy Lookup transformation for identity mapping. (For more resources related to this topic, see here.) In the early years of SQL Server, Microsoft introduced a tool to help developers and database administrator (DBA) to interact with the data: data transformation services (DTS). The tool was very primitive compared to SSIS and it was mostly relying on ActiveX and TSQL to transform the data. SSIS 1.0 appears in 2005. The tool was a game changer in the ETL world at the time. It was a professional and (pretty much) reliable tool for 2005. 2008/2008R2 versions were much the same as 2005 in a sense that they didn't add much functionality but they made the tool more scalable. In 2012, Microsoft enhanced SSIS in many ways. They rewrote the package XML to ease source control integration and make package code easier to read. They also greatly enhanced the way packages are deployed by using a SSIS catalog in SQL Server. Having the catalog in SQL Server gives us execution reports and many views that give us access to metadata or metaprocess information's in our projects. Version 2014 didn't have anything for SSIS. Version 2016 brings other set of features as you will see. We now also have the possibility to integrate with big data. Business intelligence projects many times reveal previously unseen issues with the quality of the source data. Dealing with data quality includes data quality assessment, or data profiling, data cleansing, and maintaining high quality over time. In SSIS, the data profiling task helps you with finding unclean data. The Data Profiling task is not like the other tasks in SSIS because it is not intended to be run over and over again through a scheduled operation. Think about SSIS being the wrapper for this tool. You use the SSIS framework to configure and run the Data Profiling task, and then you observe the results through the separate Data Profile Viewer. The output of the Data Profiling task will be used to help you in your development and design of the ETL and dimensional structures in your solution. Periodically, you may want to rerun the Data Profile task to see how the data has changed, but the package you develop will not include the task in the overall recurring ETL process Azure tasks and transforms Azure ecosystem is becoming predominant in Microsoft ecosystem and SSIS has not been left over in the past few years. The Azure Feature Pack is not a SSIS 2016 specific feature. It's also available for SSIS version 2012 and 2014. It's worth mentioning that it appeared in July 2015, a few months before SSIS 2016 release. Getting ready This section assumes that you have installed SQL Server Data Tools 2015. How to do it... We'll start SQL Server Data Tools, and open the CustomLogging project if not already done: In the SSIS toolbox, scroll to the Azure group. Since the Azure tools are not installed with SSDT, the Azure group is disabled in the toolbox. The tools must be downloaded using a separate installer. Click on Azure group to expand it and click on Download Azure Feature Pack as shown in the following screenshot: Your default browser opens and the Microsoft SQL Server 2016 Integration Services Feature Pack for Azure opens. Click on Download as shown in the following screenshot: From the popup that appears, select both 32-bit and 64-bit version. The 32-bit version is necessary for SSIS package development since SSDT is a 32-bit program. Click Next as shown in the following screenshot: As shown in the following screenshot, the files are downloaded: Once the download completes, run on one the installers downloaded. The following screen appears. In that case, the 32-bit (x86) version is being installed. Click Next to start the installation process: As shown in the following screenshot, check the box near I accept the terms in the License Agreement and click Next. Then the installation starts. The following screen appears once the installation is completed. Click Finish to close the screen: Install the other feature pack you downloaded. If SSDT is opened, close it. Start SSDT again and open the CustomLogging project. In the Azure group in the SSIS toolbox, you should now see the Azure tasks as in the following screenshot: Using SSIS fuzzy components SSIS includes two really sophisticated matching transformations in the data flow. The Fuzzy Lookup transformation is used for mapping the identities. The Fuzzy Grouping Transformation is used for de-duplicating. Both of them use the same algorithm for comparing the strings and other data. Identity mapping and de-duplication are actually the same problem. For example, instead for mapping the identities of entities in two tables, you can union all of the data in a single table and then do the de-duplication. Or vice versa, you can join a table to itself and then do identity mapping instead of de-duplication. Getting ready This recipe assumes that you have successfully finished the previous recipe. How to do it… In SSMS, create a new table in the DQS_STAGING_DATA database in the dbo schema and name it dbo.FuzzyMatchingResults. Use the following code: CREATE TABLE dbo.FuzzyMatchingResults ( CustomerKey INT NOT NULL PRIMARY KEY, FullName NVARCHAR(200) NULL, StreetAddress NVARCHAR(200) NULL, Updated INT NULL, CleanCustomerKey INT NULL ); Switch to SSDT. Continue editing the DataMatching package. Add a Fuzzy Lookup transformation below the NoMatch Multicast transformation. Rename it FuzzyMatches and connect it to the NoMatch Multicast transformation with the regular data flow path. Double-click the transformation to open its editor. On the Reference Table tab, select the connection manager you want to use to connect to your DQS_STAGING_DATA database and select the dbo.CustomersClean table. Do not store a new index or use an existing index. When the package executes the transformation for the first time, it copies the reference table, adds a key with an integer datatype to the new table, and builds an index on the key column. Next, the transformation builds an index, called a match index, on the copy of the reference table. The match index stores the results of tokenizing the values in the transformation input columns. The transformation then uses these tokens in the lookup operation. The match index is a table in a SQL Server database. When the package runs again, the transformation can either use an existing match index or create a new index. If the reference table is static, the package can avoid the potentially expensive process of rebuilding the index for repeat sessions of data cleansing. Click the Columns tab. Delete the mapping between the two CustomerKey columns. Clear the check box next to the CleanCustomerKey input column. Select the check box next to the CustomerKey lookup column. Rename the output alias for this column to CleanCustomerKey. You are replacing the original column with the one retrieved during the lookup. Your mappings should resemble those shown in the following screenshot: Click the Advanced tab. Raise the Similarity threshold to 0.50 to reduce the matching search space. With similarity threshold of 0.00, you would get a full cross join. Click OK. Drag the Union All transformation below the Fuzzy Lookup transformation. Connect it to an output of the Match Multicast transformation and an output of the FuzzyMatches Fuzzy Lookup transformation. You will combine the exact and approximate matches in a single row set. Drag an OLE DB Destination below the Union All transformation. Rename it FuzzyMatchingResults and connect it with the Union All transformation. Double-click it to open the editor. Connect to your DQS_STAGING_DATA database and select the dbo.FuzzyMatchingResults table. Click the Mappings tab. Click OK. The completed data flow is shown in the following screenshot: You need to add restartability to your package. You will truncate all destination tables. Click the Control Flow tab. Drag the Execute T-SQL Statement task above the data flow task. Connect the tasks with the green precedence constraint from the Execute T-SQL Statement task to the data flow task. The Execute T-SQL Statement task must finish successfully before the data flow task starts. Double-click the Execute T-SQL Statement task. Use the connection manager to your DQS_STAGING_DATA database. Enter the following code in the T-SQL statement textbox, and then click OK: TRUNCATE TABLE dbo.CustomersDirtyMatch; TRUNCATE TABLE dbo.CustomersDirtyNoMatch; TRUNCATE TABLE dbo.FuzzyMatchingResults; Save the solution. Execute your package in debug mode to test it. Review the results of the Fuzzy Lookup transformation in SSMS. Look for rows for which the transformation did not find a match, and for any incorrect matches. Use the following code: -- Not matched SELECT * FROM FuzzyMatchingResults WHERE CleanCustomerKey IS NULL; -- Incorrect matches SELECT * FROM FuzzyMatchingResults WHERE CleanCustomerKey <> CustomerKey * (-1); You can use the following code to clean up the AdventureWorksDW2014 and DQS_STAGING_DATA databases: USE AdventureWorksDW2014; DROP TABLE IF EXISTS dbo.Chapter05Profiling; DROP TABLE IF EXISTS dbo.AWCitiesStatesCountries; USE DQS_STAGING_DATA; DROP TABLE IF EXISTS dbo.CustomersCh05; DROP TABLE IF EXISTS dbo.CustomersCh05DQS; DROP TABLE IF EXISTS dbo.CustomersClean; DROP TABLE IF EXISTS dbo.CustomersDirty; DROP TABLE IF EXISTS dbo.CustomersDirtyMatch; DROP TABLE IF EXISTS dbo.CustomersDirtyNoMatch; DROP TABLE IF EXISTS dbo.CustomersDQSMatch; DROP TABLE IF EXISTS dbo.DQSMatchingResults; DROP TABLE IF EXISTS dbo.DQSSurvivorshipResults; DROP TABLE IF EXISTS dbo.FuzzyMatchingResults; When you are done, close SSMS and SSDT. SQL Server data quality services (DQS) is a knowledge-driven data quality solution. This means that it requires you to maintain one or more knowledge bases (KBs). In a KB, you maintain all knowledge related to a specific portion of data—for example, customer data. The idea of data quality services is to mitigate the cleansing process. While the amount of time you need to spend on cleansing decreases, you will achieve higher and higher levels of data quality. While cleansing, you learn what types of errors to expect, discover error patterns, find domains of correct values, and so on. You don't throw away this knowledge. You store it and use it to find and correct the same issues automatically during your next cleansing process. Summary We have seen how to install Azure Feature Pack, Azure control flow tasks and data flow components, and Fuzzy Lookup transformation. Resources for Article: Further resources on this subject: Building A Recommendation System with Azure [article] Introduction to Microsoft Azure Cloud Services [article] Windows Azure Service Bus: Key Features [article]
Read more
  • 0
  • 0
  • 4100
Modal Close icon
Modal Close icon