Data | 0 articles | Tech News, Tutorials & Expert Insights

18 Apr 2014

10 min read

Indexing the Data

18 Apr 2014

(For more resources related to this topic, see here.) Elasticsearch indexing We have our Elasticsearch cluster up and running, and we also know how to use the Elasticsearch REST API to index our data, delete it, and retrieve it. We also know how to use search to get our documents. If you are used to SQL databases, you might know that before you can start putting the data there, you need to create a structure, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them). Before we look closer at the available API methods, let's see what the indexing process looks like. Shards and replicas The Elasticsearch index is built of one or more shards and each of them contains part of your document set. Each of these shards can also have replicas, which are exact copies of the shard. During index creation, we can specify how many shards and replicas should be created. We can also omit this information and use the default values either defined in the global configuration file (elasticsearch.yml) or implemented in Elasticsearch internals. If we rely on Elasticsearch defaults, our index will end up with five shards and one replica. What does that mean? To put it simply, we will end up with having 10 Lucene indices distributed among the cluster. Are you wondering how we did the calculation and got 10 Lucene indices from five shards and one replica? The term "replica" is somewhat misleading. It means that every shard has its copy, so it means there are five shards and five copies. Having a shard and its replica, in general, means that when we index a document, we will modify them both. That's because to have an exact copy of a shard, Elasticsearch needs to inform all the replicas about the change in shard contents. In the case of fetching a document, we can use either the shard or its copy. In a system with many physical nodes, we will be able to place the shards and their copies on different nodes and thus use more processing power (such as disk I/O or CPU). To sum up, the conclusions are as follows: More shards allow us to spread indices to more servers, which means we can handle more documents without losing performance. More shards means that fewer resources are required to fetch a particular document because fewer documents are stored in a single shard compared to the documents stored in a deployment with fewer shards. More shards means more problems when searching across the index because we have to merge results from more shards and thus the aggregation phase of the query can be more resource intensive. Having more replicas results in a fault tolerance cluster, because when the original shard is not available, its copy will take the role of the original shard. Having a single replica, the cluster may lose the shard without data loss. When we have two replicas, we can lose the primary shard and its single replica and still everything will work well. The more the replicas, the higher the query throughput will be. That's because the query can use either a shard or any of its copies to execute the query. Of course, these are not the only relationships between the number of shards and replicas in Elasticsearch. So, how many shards and replicas should we have for our indices? That depends. We believe that the defaults are quite good but nothing can replace a good test. Note that the number of replicas is less important because you can adjust it on a live cluster after index creation. You can remove and add them if you want and have the resources to run them. Unfortunately, this is not true when it comes to the number of shards. Once you have your index created, the only way to change the number of shards is to create another index and reindex your data. Creating indices When we created our first document in Elasticsearch, we didn't care about index creation at all. We just used the following command: curl -XPUT http://localhost:9200/blog/article/1 -d '{"title": "New version of Elasticsearch released!", "content": "...", "tags":["announce", "elasticsearch", "release"] }' This is fine. If such an index does not exist, Elasticsearch automatically creates the index for us. We can also create the index ourselves by running the following command: curl -XPUT http://localhost:9200/blog/ We just told Elasticsearch that we want to create the index with the blog name. If everything goes right, you will see the following response from Elasticsearch: {"acknowledged":true} When is manual index creation necessary? There are many situations. One of them can be the inclusion of additional settings such as the index structure or the number of shards. Altering automatic index creation Sometimes, you can come to the conclusion that automatic index creation is a bad thing. When you have a big system with many processes sending data into Elasticsearch, a simple typo in the index name can destroy hours of script work. You can turn off automatic index creation by adding the following line in the elasticsearch.yml configuration file: action.auto_create_index: false Note that action.auto_create_index is more complex than it looks. The value can be set to not only false or true. We can also use index name patterns to specify whether an index with a given name can be created automatically if it doesn't exist. For example, the following definition allows automatic creation of indices with the names beginning with a, but disallows the creation of indices starting with an. The other indices aren't allowed and must be created manually (because of -*). action.auto_create_index: -an*,+a*,-* Note that the order of pattern definitions matters. Elasticsearch checks the patterns up to the first pattern that matches, so if you move -an* to the end, it won't be used because of +a* , which will be checked first. Settings for a newly created index The manual creation of an index is also necessary when you want to set some configuration options, such as the number of shards and replicas. Let's look at the following example: curl -XPUT http://localhost:9200/blog/ -d '{ "settings" : { "number_of_shards" : 1, "number_of_replicas" : 2 } }' The preceding command will result in the creation of the blog index with one shard and two replicas, so it makes a total of three physical Lucene indices. Also, there are other values that can be set in this way. So, we already have our new, shiny index. But there is a problem; we forgot to provide the mappings, which are responsible for describing the index structure. What can we do? Since we have no data at all, we'll go for the simplest approach – we will just delete the index. To do that, we will run a command similar to the preceding one, but instead of using the PUT HTTP method, we use DELETE. So the actual command is as follows: curl –XDELETE http://localhost:9200/posts And the response will be the same as the one we saw earlier, as follows: {"acknowledged":true} Now that we know what an index is, how to create it, and how to delete it, we are ready to create indices with the mappings we have defined. It is a very important part because data indexation will affect the search process and the way in which documents are matched. Mappings configuration If you are used to SQL databases, you may know that before you can start inserting the data in the database, you need to create a schema, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fl y, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them) and how to create mappings that suit your needs and match your data structure. Type determining mechanism Before we start describing how to create mappings manually, we wanted to write about one thing. Elasticsearch can guess the document structure by looking at JSON, which defines the document. In JSON, strings are surrounded by quotation marks, Booleans are defined using specific words, and numbers are just a few digits. This is a simple trick, but it usually works. For example, let's look at the following document: { "field1": 10, "field2": "10" } The preceding document has two fields. The field1 field will be determined as a number (to be precise, as long type), but field2 will be determined as a string, because it is surrounded by quotation marks. Of course, this can be the desired behavior, but sometimes the data source may omit the information about the data type and everything may be present as strings. The solution to this is to enable more aggressive text checking in the mapping definition by setting the numeric_detection property to true. For example, we can execute the following command during the creation of the index: curl -XPUT http://localhost:9200/blog/?pretty -d '{ "mappings" : { "article": { "numeric_detection" : true } } }' Unfortunately, the problem still exists if we want the Boolean type to be guessed. There is no option to force the guessing of Boolean types from the text. In such cases, when a change of source format is impossible, we can only define the field directly in the mappings definition. Another type that causes trouble is a date-based one. Elasticsearch tries to guess dates given as timestamps or strings that match the date format. We can define the list of recognized date formats using the dynamic_date_formats property, which allows us to specify the formats array. Let's look at the following command for creating the index and type: curl -XPUT 'http://localhost:9200/blog/' -d '{ "mappings" : { "article" : { "dynamic_date_formats" : ["yyyy-MM-dd hh:mm"] } } }' The preceding command will result in the creation of an index called blog with the single type called article. We've also used the dynamic_date_formats property with a single date format that will result in Elasticsearch using the date core type for fields matching the defined format. Elasticsearch uses the joda-time library to define date formats, so please visit http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html if you are interested in finding out more about them. Remember that the dynamic_date_format property accepts an array of values. That means that we can handle several date formats simultaneously.

0
0
4183

article-image-learning-random-forest-using-mahout

Packt

05 Mar 2015

11 min read

Learning Random Forest Using Mahout

Packt

05 Mar 2015

11 min read

0
1
4176

article-image-practical-applications-deep-learning

Packt

14 Jan 2016

20 min read

Practical Applications of Deep Learning

Packt

14 Jan 2016

20 min read

In this article, Yusuke Sugomori, the author of the book Deep Learning with Java, we’ll first see how deep learning is actually applied. Here, you will see that the actual cases where deep learning is utilized are still very few. But why aren't there many cases even though it is such an innovative method? What is the problem? Later on, we’ll think about the reasons. Furthermore, going forward we will also consider which fields we can apply deep learning to and will have the chance to apply deep learning and all the related areas of artificial intelligence. The topics covered in this article include: The difficulties of turning deep learning models into practical applications The possible fields where deep learning can be applied, and ideas on how to approach these fields We'll explore the potential of this big AI boom, which will lead to ideas and hints that you can utilize in deep learning for your research, business, and many sorts of activities. (For more resources related to this topic, see here.) The difficulties of deep learning Deep learning has already got higher precision than humans in the image recognition field and has been applied to quite a lot of practical applications. Similarly, in the NLP field, many models have been researched. Then, how much deep learning is utilized in other fields? Surprisingly, there are still few fields where deep learning is successfully utilized. This is because deep learning is indeed innovative compared to past algorithms and definitely lets us take a big step towards materializing AI; however, it has some problems to be used for practical applications. The first problem is that there are too many model parameters in deep learning algorithms. We didn't look into detail when you learned about the theory and implementation of algorithms, but actually deep neural networks have many hyper parameters that need to be decided compared to the past neural networks or other machine learning algorithms. This means we have to go through more trial and error to get high precision. Combinations of parameters that define a structure of neural networks, such as how many hidden layers are to be set or how many units each hidden layer should have, need lots of experiments. Also, the parameters for training and test configurations such as the learning rateneed to be determined. Furthermore, peculiar parameters for each algorithm such as the corruption level in SDA and the size of kernels in CNN need additional trial and error. Thus, the great performance that deep learning provides is supported by steady parameter-tuning. However, people only look at one side of deep learning—that it can get great precision— and they tend to forget the hard process required to reach to the point. Deep learning is not magic. In addition, deep learning often fails to train and classify data from simple problems. The shape of deep neural networks is so deep and complicated that the weights can't be well optimized. In terms of optimization, data quantities are also important. This means that deep neural networks require a significant amount of time for each training. To sum up, deep learning shows its worth when: It solves complicated and hard problems when people have no idea what feature they can be classified as There is sufficient training data to properly optimize deep neural networks Compared to applications that constantly update a model using continuously updated data, once a model is built using a large-scale data set that doesn't change drastically, applications that use the model universally are rather well suited for deep learning. Therefore, when you look at business fields, you can say that there are more cases where the existing machine learning can get better results than using deep learning. For example, let's assume we would like to recommend appropriate products to users in an EC. In this EC, many users buy a lot of products daily, so purchase data is largely updated daily. In this case, do you use deep learning to get high-precision classification and recommendations to increase the conversion rates of users' purchases using this data? Probably not, because using the existing machine learning algorithms such as Naive Bayes, collaborative filtering, SVM, and so on, we can get sufficient precision from a practical perspective and can update the model and calculate quicker, which is usually more appreciated. This is why deep learning is not applied much in business fields. Of course, getting higher precision is better in any field, but in reality, higher precision and the necessary calculation time are in a trade-off relationship. Although deep learning is significant in the research field, it has many hurdles yet to clear considering practical applications. Besides, deep learning algorithms are not perfect, and they still need many improvements to their model itself. For example, RNN, as mentioned earlier, can only satisfy either how past information can be reflected to a network or how precision can be obtained, although it's contrived with techniques such as LSTM. Also, deep learning is still far from the true AI, although it's definitely a great technique compared to the past algorithms. Research on algorithms is progressing actively, but in the meantime, we need one more breakthrough to spread out and infiltrate deep learning into broader society. Maybe this is not just the problem of a model. Deep learning is suddenly booming because it is reinforced by huge developments in hardware and software. Deep learning is closely related to development of the surrounding technology. As mentioned earlier, there are still many hurdles to clear before deep learning can be applied more practically in the real world, but this is not impossible to achieve. It isn't possible to suddenly invent AI to achieve technological singularity, but there are some fields and methods where deep learning can be applied right away. In the next section, we’ll think about what kinds of industries deep learning can be utilized in. Hopefully, it will sew the seeds for new ideas in your business or research fields. The approaches to maximize deep learning possibilities and abilities There are several approaches on how we can apply deep learning to various industries. While it is true that an approach could be different depending on the task or purpose, we can briefly categorized the approaches as the following three: Field-oriented approach: This utilizes deep learning algorithms or models that are already thoroughly researched and can lead to a great performance Breakdown-oriented approach: This replaces problems to be solved that deep learning can apparently be applied with a different problem that deep learning can be well adopted Output-oriented approach: This explores new ways on how we express the output with deep learning These approaches are all explained in detail in the following subsections. Each approach is divided into its suitable industries or not for its use, but any of them could be a big hint for your activities going forward. There are still very few use cases of deep learning and bias against fields of use, but this means there should be many chances to create innovative and new things. Start-ups who utilize deep learning have been emerging recently and some of them have already achieved success to some extent. You can have a significant impact on the world depending on your ideas. Field-oriented approach This approach doesn't require new techniques or algorithms. There are obviously fields that are well suited to the current deep learning techniques, and the concept here is to dive into these fields. As explained previously, since deep learning algorithms that have been practically studied and developed are mostly in image recognition and NLP, we'll explore some fields that can work in great harmony with them. Medicine Medical fields should be developed by deep learning. Tumors or cancers are detected on scanned images. This means nothing else but being able to utilize one of the strongest features of deep learning—the technique of image recognition. It is possible to dramatically increase precision using deep learning to help with the early detection of an illness and identifying the kind of illness. Since CNN can be applied to 3D images, 3D scanned images should be able to be analyzed relatively easily. By adopting deep learning more in the current medical field, deep learning should greatly contribute. We can also say that deep learning can be significantly useful for the future medical field. The medical field has been under strict regulations, however, there is a movement progressing to ease the regulations in some countries, probably because of the recent development of IT and its potential. Therefore, there will be opportunities in business for the medical field and IT to have a synergy effect. For example, if telemedicine is more infiltrated, there is the possibility that diagnosing or identifying a disease can be done not only by a scanned image, but also by an image shown in real time on a display. Also, if electronic charts become widespread, it would be easier to analyze medical data using deep learning. This is because medical records are compatible with deep learning as they are a dataset of texts and images. Then, the symptom of unknown disease can be found. Automobiles We can say that surroundings off running cars are image sequences and texts. Other cars and views are images and a road sign has texts. This means we can also utilize deep learning techniques here, and it is possible to reduce the risk of accidents by improving driving assistance functions. It can be said that the ultimate type of driving assistance is self-driving cars, which is being tackled mainly by Google and Tesla. An example that is both famous and fascinating was when George Hotz, the first person to hack the iPhone, built a self-driving car in his garage. The appearance of the car was introduced in an article by Bloomberg Business (http://www.bloomberg.com/features/2015-george-hotz-self-driving-car/), and the following image was included in the article: Self-driving cars have been already tested in the U.S., but since other countries have different traffic rules and road conditions, this idea requires further studying and development before self-driving cars are commonly used worldwide. The key to success in this field is in learning and recognizing surrounding cars, people, views, and traffic signs, and properly judging how to process them. In the meantime, we don't have to just focus on utilizing deep learning techniques for the actual body of a car. Let's assume we could develop a smartphone app that has the same function as we just described, that is, recognizing and classifying surrounding images and text. Then, if you just set up the smartphone in your car, you could utilize it as a car-navigation app. In addition, for example, it could be used as a navigation app for blind people, providing them with good reliable directions. Advert technologies Advert (ad) technologies could expand their coverage with deep learning. When we say ad technologies, this currently means recommendation or ad networks that optimize ad banners or products to be shown. On the other hand, when we say advertising, this doesn't only mean banners or ad networks. There are various kinds of ads in the world depending on the type of media such as TV ads, radio ads, newspaper ads, posters, flyers, and so on. We have also digital ad campaigns with YouTube, Vine, Facebook, Twitter, Snapchat, and so on. Advertising itself has changed its definition and content, but all ads have one thing in common, they consist of images and/or language. This means they are fields that deep learning is good at. Until now, we could only use user-behavior-based indicators, such as page view (PV), click through rate (CTR), and conversion rate (CVR), to estimate the effect of an ad, but if we apply deep learning technologies, we might be able to analyze the actual content of an ad and autogenerate ads going forward. Especially since movies and videos can only be analyzed as a result of image recognition and NLP, video recognition, not image recognition, will gather momentum besides ad technologies. Profession or practice Professions such as doctor, lawyer, patent attorney, and accountant are considered to be roles that deep learning can replace. For example, if NLP's precision and accuracy gets higher, any perusal that requires expertise can be left to a machine. As a machine can cover these time-consuming reading tasks, people can focus more on high-value tasks. In addition, if a machine classifies past judicial cases or medical cases on what disease caused what symptoms and so on, we would be able to build an app like Apple’s Siri that answers simple questions that usually require professional knowledge. Then, the machine could handle these professional cases to some extent if a doctor or a lawyer is too busy to help in a timely manner. It's often said that AI takes away human’s jobs, but personally, this seems incorrect. Rather, a machine takes away menial work, which should support humans. A software engineer who works on AI programming can described as having a professional job, but this work will also be changed in the future. For example, think about a car-related job, where the current work is building standard automobiles, but in the future engineers will be in a position just like pit crews for Formula 1 cars. Sports Deep learning can certainly contribute to sports as well. In the study field known as sports science, it has become increasingly important to analyze and examine data from sports. As an example, you may know the book or movie Moneyball. In this film, they hugely increased the win percentage of the team by adopting a regression model in baseball. Watching sports itself is very exciting, but on the other hand, sport can be seen as a chunk of image sequences and number data. Since deep learning is good at identifying features that humans can't find, it will become easier to find out why certain players get good scores while others don't. These fields we have mentioned are only a small part of the many fields where deep learning is capable of significantly contributing to development. We have looked into these fields from the perspective of whether a field has images or text, but of course deep learning should also show great performance for simple analysis with general number data. It should be possible to apply deep learning to various other fields such as bioinformatics, finance, agriculture, chemistry, astronomy, economy, and more. Breakdown-oriented approach This approach might be similar to the approach considered in traditional machine learning algorithms. We already talked about how feature engineering is the key to improving precision in machine learning. Now we can say that this feature engineering can be divided into the following two parts: Engineering under the constraints of a machine learning model. The typical case is to make inputs discrete or continuous. Feature engineering to increase precision by machine learning. This tends to rely on the sense of a researcher. In a narrower meaning, feature engineering is considered as the second one, and this is the part that deep learning doesn't have to focus on, whereas the first one is definitely the important part even for deep learning. For example, it's difficult to predict stock prices using deep learning. Stock prices are volatile and it’s difficult to define inputs. Besides, how to apply an output value is also a difficult problem. Enabling deep learning to handle these inputs and outputs is also said to be feature engineering in the wider sense. If there is no limitation to the value of original data and/or data you would like to predict, it’s difficult to insert these datasets into machine learning and deep learning algorithms, including neural networks. However, we can take a certain approach and apply a model to these previous problems by breaking down the inputs and/or outputs. In terms of NLP as explained earlier, you might have thought, for example, that it would be impossible to put numberless words into features in the first place, but as you already know, we can train feed-forward neural networks with words by representing them with sparse vectors and combining N-grams into them. Of course, we can not only use neural networks, but also other machine learning algorithms such as SVM here. Thus, we can cultivate a new field where deep learning hasn't been applied by engineering to fit features well into deep learning models. In the meantime, when we focus on NLP, we can see that RNN and LSTM were developed to properly resolve the difficulties or tasks encountered in NLP. This can be considered as the opposite approach to feature engineering because in this case, the problem is solved by breaking down a model to fit into features. Then, how do we do engineering for stock prediction as we just mentioned? It's actually not difficult to think of inputs, that is, features. For example, if you predict stock prices daily, it’s hard to calculate if you use daily stock prices as features, but if you use a rate of price change between a day and the day before, then it should be much easier to process as the price stays within a certain range and the gradients won't explode easily. Meanwhile, what is difficult is how to deal with outputs. Stock prices are of course continuous values, hence outputs can be various values. This means that in neural network model where the number of units in the output layer is fixed, they can't handle this problem. What should we do here—should we give up?! No, wait a minute. Unfortunately, we can't predict a stock price itself, but there is an alternative prediction method. Here, the problem is that we can classify stock prices to be predicted into infinite patterns. Then, can we make them into limited patterns? Yes, we can. Let's forcibly make them. Think about the most extreme but easy to understand case: predicting whether a tomorrow's stock price, strictly speaking a close price, is up or down using the data from the stock price up to today. For this case, we can show it with a deep learning model as follows: In the preceding image, denotes the open price of a day, ; denotes the close price, is the high price, and is the actual price. The features used here are mere examples, and need to be fine-tuned when applied to real applications. The point here is that replacing the original task with this type of problem enables deep neural networks to theoretically classify data. Furthermore, if you classify the data by how much it will go up or down, you could make more detailed predictions. For example, you could classify data as shown in the following table: Class Description Class 1 Up more than 3 percent from the closing price Class 2 Up more than 1~3 percent from the closing price Class 3 Up more than 0~1 percent from the closing price Class 4 Down more than 0~-1 percent from the closing price Class 5 Down more than -1~-3 percent from the closing price Class 6 Down more than -3 percent from the closing price Whether the prediction actually works, in other words whether the classification works, is unknown until we examine it, but the fluctuation of stock prices can be predicted in a quite narrow range by dividing the outputs into multiple classes. Once we can adopt the task into neural networks, then what we should do is just examine which model gets better results. In this example, we may apply RNN because the stock price is time sequential data. If we look at charts showing the price as image data, we can also use CNN to predict the future price. So now we've thought about the approach by referring to examples, but to sum up in general, we can say that: Feature engineering for models: This is designing inputs or adjusting values to fit deep learning models, or enabling classification by setting a limitation for the outputs Model engineering for features: This is devising new neural network models or algorithms to solve problems in a focused field The first one needs ideas for the part of designing inputs and outputs to fit to a model, whereas the second one needs to take a mathematical approach. Feature engineering might be easier to start if you are conscious of making an item prediction-limited. Output-oriented approach The previously mentioned two approaches are to increase the percentage of correct answers for a certain field's task or problem using deep learning. Of course, it is essential and the part where deep learning proves its worth, however, increasing precision to the ultimate level may not be the only way of utilizing deep learning. Another approach is to devise the outputs using deep learning by slightly changing the point of view. Let's see what this means. Deep learning is applauded as an innovative approach among researchers and technical experts of AI, but the world in general doesn't know much about its greatness yet. Rather, they pay attention to what a machine can't do. For example, people don't really focus on the image recognition capabilities of MNIST using CNN, which generates a lower error rate than humans, but they criticize that a machine can't recognize images perfectly. This is probably because people expect a lot when they hear and imagine AI. We might need to change this mindset. Let's consider DORAEMON, a Japanese national cartoon character who is also famous worldwide—a robot who has high intelligence and AI, but often makes silly mistakes. Do we criticize him? No, we just laugh it off or take it as a joke and don’t get serious. Also, think about DUMMY / DUM-E, the robot arm in the movie Iron Man. It has AI as well, but makes silly mistakes. See, they make mistakes but we still like them. In this way, it might be better to emphasize the point that machines make mistakes. Changing the expression part of a user interface could be the trigger for people to adopt AI rather than just studying an algorithm the most. Who knows? It’s highly likely that you can gain the world’s interest by thinking of ideas in creative fields, not from the perspective of precision. Deep Dream by Google is one good example. We can do more exciting things when art or design and deep learning collaborate. Summary And …congratulations! You’ve just accomplished the learning part of deep learning with Java. Although there are still some models that have not been mentioned, you can be sure there will be no problem in acquiring and utilizing them. Resources for Article: Further resources on this subject: Setup Routine for an Enterprise Spring Application[article] Support for Developers of Spring Web Flow 2[article] Using Spring JMX within Java Applications[article]

0
0
4173

Packt

16 Feb 2016

5 min read

Introduction to Scikit-Learn

Packt

16 Feb 2016

5 min read

Introduction Since its release in 2007, scikit-learn has become one of the most popular open source machine learning libraries for Python. scikit-learn provides algorithms for machine learning tasks including classification, regression, dimensionality reduction, and clustering. It also provides modules for extracting features, processing data, and evaluating models. (For more resources related to this topic, see here.) Conceived as an extension to the SciPy library, scikit-learn is built on the popular Python libraries NumPy and matplotlib. NumPy extends Python to support efficient operations on large arrays and multidimensional matrices. matplotlib provides visualization tools, and SciPy provides modules for scientific computing. scikit-learn is popular for academic research because it has a well-documented, easy-to-use, and versatile API. Developers can use scikit-learn to experiment with different algorithms by changing only a few lines of the code. scikit-learn wraps some popular implementations of machine learning algorithms, such as LIBSVM and LIBLINEAR. Other Python libraries, including NLTK, include wrappers for scikit-learn. scikit-learn also includes a variety of datasets, allowing developers to focus on algorithms rather than obtaining and cleaning data. Licensed under the permissive BSD license, scikit-learn can be used in commercial applications without restrictions. Many of scikit-learn's algorithms are fast and scalable to all but massive datasets. Finally, scikit-learn is noted for its reliability; much of the library is covered by automated tests. Installing scikit-learn This book is written for version 0.15.1 of scikit-learn; use this version to ensure that the examples run correctly. If you have previously installed scikit-learn, you can retrieve the version number with the following code: >>> import sklearn >>> sklearn.__version__ '0.15.1' If you have not previously installed scikit-learn, you can install it from a package manager or build it from the source. We will review the installation processes for Linux, OS X, and Windows in the following sections, but refer to http://scikit-learn.org/stable/install.html for the latest instructions. The following instructions only assume that you have installed Python 2.6, Python 2.7, or Python 3.2 or newer. Go to http://www.python.org/download/ for instructions on how to install Python. Installing scikit-learn on Windows scikit-learn requires Setuptools, a third-party package that supports packaging and installing software for Python. Setuptools can be installed on Windows by running the bootstrap script at https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py. Windows binaries for the 32- and 64-bit versions of scikit-learn are also available. If you cannot determine which version you need, install the 32-bit version. Both versions depend on NumPy 1.3 or newer. The 32-bit version of NumPy can be downloaded from http://sourceforge.net/projects/numpy/files/NumPy/. The 64-bit version can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. A Windows installer for the 32-bit version of scikit-learn can be downloaded from http://sourceforge.net/projects/scikit-learn/files/. An installer for the 64-bit version of scikit-learn can be downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn. scikit-learn can also be built from the source code on Windows. Building requires a C/C++ compiler such as MinGW (http://www.mingw.org/), NumPy, SciPy, and Setuptools. To build, clone the Git repository from https://github.com/scikit-learn/scikit-learn and execute the following command: python setup.py install Installing scikit-learn on Linux There are several options to install scikit-learn on Linux, depending on your distribution. The preferred option to install scikit-learn on Linux is to use pip. You may also install it using a package manager, or build scikit-learn from its source. To install scikit-learn using pip, execute the following command: sudo pip install scikit-learn To build scikit-learn, clone the Git repository from https://github.com/scikit-learn/scikit-learn. Then install the following dependencies: sudo apt-get install python-dev python-numpy python-numpy-dev python-setuptools python-numpy-dev python-scipy libatlas-dev g++ Navigate to the repository's directory and execute the following command: python setup.py install Installing scikit-learn on OS X scikit-learn can be installed on OS X using Macports: sudo port install py26-sklearn If Python 2.7 is installed, run the following command: sudo port install py27-sklearn scikit-learn can also be installed using pip with the following command: pip install scikit-learn Verifying the installation To verify that scikit-learn has been installed correctly, open a Python console and execute the following: >>> import sklearn >>> sklearn.__version__ '0.15.1' To run scikit-learn's unit tests, first install the nose library. Then execute the following: nosetest sklearn –exe Congratulations! You've successfully installed scikit-learn. Summary In this article, we had a brief introduction of Scikit. We also covered the installation of Scikit on various operating system Windows, Linux, OS X. You can also refer the following books on the similar topics: scikit-learn Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/scikit-learn-cookbook) Learning scikit-learn: Machine Learning in Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-scikit-learn-machine-learning-python) Resources for Article: Further resources on this subject: Our First Machine Learning Method – Linear Classification[article] Specialized Machine Learning Topics[article] Machine Learning[article]

0
0
4154

article-image-getting-started-oracle-data-guard

Packt

02 Jul 2013

13 min read

Getting Started with Oracle Data Guard

Packt

02 Jul 2013

13 min read

0
0
4153

article-image-installing-coherence-35-and-accessing-data-grid-part-1

Packt

31 Mar 2010

10 min read

Installing Coherence 3.5 and Accessing the Data Grid: Part 1

Packt

31 Mar 2010

10 min read

When I first started evaluating Coherence, one of my biggest concerns was how easy it would be to set up and use, especially in a development environment. The whole idea of having to set up a cluster scared me quite a bit, as any other solution I had encountered up to that point that had the word "cluster" in it was extremely difficult and time consuming to configure. My fear was completely unfounded—getting the Coherence cluster up and running is as easy as starting Tomcat. You can start multiple Coherence nodes on a single physical machine, and they will seamlessly form a cluster. Actually, it is easier than starting Tomcat. Installing Coherence In order to install Coherence you need to download the latest release from the Oracle Technology Network (OTN) website. The easiest way to do so is by following the link from the main Coherence page on OTN. At the time of this writing, this page was located at http://www.oracle.com/technology/products/coherence/index.html, but that might change. If it does, you can find its new location by searching for 'Oracle Coherence' using your favorite search engine. In order to download Coherence for evaluation, you will need to have an Oracle Technology Network (OTN) account. If you don't have one, registration is easy and completely free. Once you are logged in, you will be able to access the Coherence download page, where you will find the download links for all available Coherence releases: one for Java, one for .NET, and one for each of the supported C++ platforms. You can download any of the Coherence releases you are interested in while you are there, but for the remainder of this article you will only need the first one. The latter two (.NET and C++) are client libraries that allow .NET and C++ applications to access the Coherence data grid. Coherence ships as a single ZIP archive. Once you unpack it you should see the README.txt file containing the full product name and version number, and a single directory named coherence. Copy the contents of the coherence directory to a location of your choice on your hard drive. The common location on Windows is c:coherence and on Unix/Linux /opt/coherence, but you are free to put it wherever you want. The last thing you need to do is to configure the environment variable COHERENCE_HOME to point to the top-level Coherence directory created in the previous step, and you are done. Coherence is a Java application, so you also need to ensure that you have the Java SDK 1.4.2 or later installed and that JAVA_HOME environment variable is properly set to point to the Java SDK installation directory. If you are using a JVM other than Sun's, you might need to edit the scripts used in the following section. For example, not all JVMs support the -server option that is used while starting the Coherence nodes, so you might need to remove it. What's in the box? The first thing you should do after installing Coherence is become familiar with the structure of the Coherence installation directory. There are four subdirectories within the Coherence home directory: bin: This contains a number of useful batch files for Windows and shell scripts for Unix/Linux that can be used to start Coherence nodes or to perform various network tests doc: This contains the Coherence API documentation, as well as links to online copies of Release Notes, User Guide, and Frequently Asked Questions documents examples: This contains several basic examples of Coherence functionality lib: This contains JAR files that implement Coherence functionality Shell scripts on UnixIf you are on a Unix-based system, you will need to add execute permission to the shell scripts in the bin directory by executing the following command: $ chmod u+x *.sh Starting up the Coherence cluster In order to get the Coherence cluster up and running, you need to start one or more Coherence nodes. The Coherence nodes can run on a single physical machine, or on many physical machines that are on the same network. The latter will definitely be the case for a production deployment, but for development purposes you will likely want to limit the cluster to a single desktop or laptop. The easiest way to start a Coherence node is to run cache-server.cmd batch file on Windows or cache-server.sh shell script on Unix. The end result in either case should be similar to the following screenshot: There is quite a bit of information on this screen, and over time you will become familiar with each section. For now, notice two things: At the very top of the screen, you can see the information about the Coherence version that you are using, as well as the specific edition and the mode that the node is running in. Notice that by default you are using the most powerful, Grid Edition, in development mode. The MasterMemberSet section towards the bottom lists all members of the cluster and provides some useful information about the current and the oldest member of the cluster. Now that we have a single Coherence node running, let's start another one by running the cache-server script in a different terminal window. For the most part, the output should be very similar to the previous screen, but if everything has gone according to the plan, the MasterMemberSet section should reflect the fact that the second node has joined the cluster: MasterMemberSet ( ThisMember=Member(Id=2, ...) OldestMember=Member(Id=1, ...) ActualMemberSet=MemberSet(Size=2, BitSetCount=2 Member(Id=1, ...) Member(Id=2, ...) )RecycleMillis=120000RecycleSet=MemberSet(Size=0, BitSetCount=0)) You should also see several log messages on the first node's console, letting you know that another node has joined the cluster and that some of the distributed cache partitions were transferred to it. If you can see these log messages on the first node, as well as two members within the ActualMemberSet on the second node, congratulations—you have a working Coherence cluster. Troubleshooting cluster start-up In some cases, a Coherence node will not be able to start or to join the cluster. In general, the reason for this could be all kinds of networking-related issues, but in practice a few issues are responsible for the vast majority of problems. Multicast issues By far the most common issue is that multicast is disabled on the machine. By default, Coherence uses multicast for its cluster join protocol, and it will not be able to form the cluster unless it is enabled. You can easily check if multicast is enabled and working properly by running the multicast-test shell script within the bin directory. If you are unable to start the cluster on a single machine, you can execute the following command from your Coherence home directory: $ . bin/multicast-test.sh –ttl 0 This will limit time-to-live of multicast packets to the local machine and allow you to test multicast in isolation. If everything is working properly, you should see a result similar to the following: Starting test on ip=Aleks-Mac-Pro.home/192.168.1.7,group=/237.0.0.1:9000, ttl=0Configuring multicast socket...Starting listener...Fri Aug 07 13:44:44 EDT 2009: Sent packet 1.Fri Aug 07 13:44:44 EDT 2009: Received test packet 1 from selfFri Aug 07 13:44:46 EDT 2009: Sent packet 2.Fri Aug 07 13:44:46 EDT 2009: Received test packet 2 from selfFri Aug 07 13:44:48 EDT 2009: Sent packet 3.Fri Aug 07 13:44:48 EDT 2009: Received test packet 3 from self If the output is different from the above, it is likely that multicast is not working properly or is disabled on your machine. This is frequently the result of a firewall or VPN software running, so the first troubleshooting step would be to disable such software and retry. If you determine that was indeed the cause of the problem you have two options. The first, and obvious one, is to turn the offending software off while using Coherence. However, for various reasons that might not be an acceptable solution, in which case you will need to change the default Coherence behavior, and tell it to use the Well-Known Addresses (WKA) feature instead of multicast for the cluster join protocol. Doing so on a development machine is very simple—all you need to do is add the following argument to the JAVA_OPTS variable within the cache-server shell script: -Dtangosol.coherence.wka=localhost With that in place, you should be able to start Coherence nodes even if multicastis disabled. Localhost and loopback addressOn some systems, localhost maps to a loopback address, 127.0.0.1. If that's the case, you will have to specify the actual IP address or host name for the tangosol.coherence.wka configuration parameter. The host name should be preferred, as the IP address can change as you move from network to network, or if your machine leases an IP address from a DHCP server. As a side note, you can tell whether the WKA or multicast is being used for the cluster join protocol by looking at the section above the MasterMemberSet section when the Coherence node starts. If multicast is used, you will see something similar to the following: Group{Address=224.3.5.1, Port=35461, TTL=4} The actual multicast group address and port depend on the Coherence version being used. As a matter of fact, you can even tell the exact version and the build number from the preceding information. In this particular case, I am using Coherence 3.5.1 release, build 461. This is done in order to prevent accidental joins of cluster members into an existing cluster. For example, you wouldn't want a node in the development environment using newer version of Coherence that you are evaluating to join the existing production cluster, which could easily happen if the multicast group address remained the same. On the other hand, if you are using WKA, you should see output similar to the following instead: WellKnownAddressList(Size=1, WKA{Address=192.168.1.7, Port=8088} ) Using the WKA feature completely disables multicast in a Coherence cluster, and is recommended for most production deployments, primarily due to the fact that many production environments prohibit multicast traffic altogether, and that some network switches do not route multicast traffic properly. That said, configuring WKA for production clusters is out of the scope of this article, and you should refer to Coherence product manuals for details. Binding issues Another issue that sometimes comes up is that one of the ports that Coherence attempts to bind to is already in use and you see a bind exception when attempting to start the node. By default, Coherence starts the first node on port 8088, and increments port number by one for each subsequent node on the same machine. If for some reason that doesn't work for you, you need to identify a range of available ports for as many nodes as you are planning to start (both UDP and TCP ports with the same numbers must be available), and tell Coherence which port to use for the first node by specifying the tangosol.coherence.localport system property. For example, if you want Coherence to use port 9100 for the first node, you will need to add the following argument to the JAVA_OPTS variable in the cache-server shell script: -Dtangosol.coherence.localport=9100

0
0
4114

Packt

07 May 2010

6 min read

Oracle: When to use Log Miner

Packt

07 May 2010

6 min read

Log Miner has both a GUI interface in OEM as well as the database package, DBMS_LOGMNR. When this utility is used by the DBA, its primary focus is to mine data from the online and archived redo logs. Internally Oracle uses the Log Miner technology for several other features, such as Flashback Transaction Backout, Streams, and Logical Standby Databases. This section is not on how to run Log Miner, but looks at the task of identifying the information to restore. The Log Miner utility comes into play when you need to retrieve an older version of selected pieces of data without completely recovering the entire database. A complete recovery is usually a drastic measure that means downtime for all users and the possibility of lost transactions. Most often Log Miner is used for recovery purposes when the data consists of just a few tables or a single code change. Make sure supplemental logging is turned on (see the Add Supplemental Logging section). In this case, you discover that one or more of the following conditions apply when trying to recover a small amount of data that was recently changed: Flashback is not enabled Flashback logs that are needed are no longer available Data that is needed is not available in the online redo logs Data that is needed has been overwritten in the undo segments Go to the last place available: archived redo logs. This requires the database to be in archivelog mode and for all archive logs that are needed to still be available or recoverable. Identifying the data needed to restore One of the hardest parts of restoring data is determining what to restore, the basic question being when did the bad data become part of the collective? Think the Borg from Star Trek! When you need to execute Log Miner to retrieve data from a production database, you will need to act fast. The older the transactions the longer it will take to recover and traverse with Log Miner. The newest (committed) transactions are processed first, proceeding backwards. The first question to ask is when do you think the bad event happened? Searching for data can be done in several different ways: SCN, timestamp, or log sequence number> Pseudo column ORA_ROWSCN SCN, timestamp, or log sequence number If you are lucky, the application also writes a timestamp of when the data was last changed. If that is the case, then you determine the archive log to mine by using the following queries. It is important to set the session NLS_DATE_FORMAT so that the time element is displayed along with the date, otherwise you will just get the default date format of DD-MMM-RR. The data format comes from the database startup parameters— the NLS_TERRITORY setting. Find the time when a log was archived and match that to the archive log needed. Pseudo column ORA_ROWSCN While this method seems very elegant, it does not work perfectly, meaning it won't always return the correct answer. As it may not work every time or accurately, it is generally not recommended for Flashback Transaction Queries. It is definitely worth trying to narrow the window that you will have to search. It uses the SCN information that was stored for the associated transaction in the Interested Transaction List. You know that delayed block cleanout is involved. The pseudo column ORA_ROWSCN contains information for the approximate time this table was updated for each row. In the following example the table has three rows, with the last row being the one that was most recently updated. It gives me the time window to search the archive logs with Log Miner. Log Miner is the basic technology behind several of the database Maximum Availability Architecture capabilities—Logical Standby, Streams, and the following Flashback Transaction Backout exercise. Flashback Transaction Query and Backout Flashback technology was first introduced in Oracle9i Database. This feature allows you to view data at different points in time and with more recent timestamps (versions), and thus provides the capability to recover previous versions of data. In this article, we are dealing with Flashback Transaction Query (FTQ) and Flashback Transaction Backout (FTB), because they both deal with transaction IDs and integrate with the Log Miner utility. See the MOS document: "What Do All 10g Flashback Features Rely on and what are their Limitations?" (Doc ID 435998.1). Flashback Transaction Query uses the transaction ID (Xid) that is stored with each row version in a Flashback Versions Query to display every transaction that changed the row. Currently, the only Flashback technology that can be used when the object(s) in question have been changed by DDL is Flashback Data Archive. There are other restrictions to using FTB with certain data types (VARRAYs, BFILES), which match the data type restrictions for Log Miner. This basically means if data types aren't supported, then you can't use Log Miner to find the undo and redo log entries. When would you use FTQ or FTB instead of the previously described methods? The answer is when the data involves several tables with multiple constraints or extensive amounts of information. Similar to Log Miner, the database can be up and running while people are working online in other schemas of the database to accomplish this restore task. An example of using FTB or FTQ would be to reverse a payroll batch job that was run with the wrong parameters. Most often a batch job is a compiled code (like C or Cobol) run against the database, with parameters built in by the application vendor. A wrong parameter could be the wrong payroll period, wrong set of employees, wrong tax calculations, or payroll deductions. Enabling flashback logs First off all flashback needs to be enabled in the database. Oracle Flashback is the database technology intended for a point-in-time recovery (PITR) by saving transactions in flashback logs. A flashback log is a temporary Oracle file and is required to be stored in the FRA, as it cannot be backed up to any other media. Extensive information on all of the ramifications of enabling flashback is found in the documentation labeled: Oracle Database Backup and Recovery User's Guide. See the following section for an example of how to enable flashback: SYS@NEWDB>ALTER SYSTEM SET DB_RECOVERY_FILE_DEST='/backup/flash_recovery_area/NEWDB' SCOPE=BOTH;SYS@NEWDB>ALTER SYSTEM SET DB_RECOVERY_FILE_DEST_SIZE=100M SCOPE=BOTH;--this is sized for a small test databaseSYS@NEWDB> SHUTDOWN IMMEDIATE;SYS@NEWDB> STARTUP MOUNT EXCLUSIVE;SYS@NEWDB> ALTER DATABASE FLASHBACK ON;SYS@NEWDB> ALTER DATABASE OPEN;SYS@NEWDB> SHOW PARAMETER RECOVERY; The following query would then verify that FLASHBACK had been turned on: SYS@NEWDB>SELECT FLASHBACK_ON FROM V$DATABASE;

0
0
4101

Packt

05 Jul 2017

9 min read

Azure Feature Pack

Packt

05 Jul 2017

9 min read

0
0
4100

article-image-signal-processing-techniques

Packt

12 Jun 2014

6 min read

Signal Processing Techniques

Packt

12 Jun 2014

6 min read

(For more resources related to this topic, see here.) Introducing the Sunspot data Sunspots are dark spots visible on the Sun's surface. This phenomenon has been studied for many centuries by astronomers. Evidence has been found for periodic sunspot cycles. We can download up-to-date annual sunspot data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. This is provided by the Belgian Solar Influences Data Analysis Center. The data goes back to 1700 and contains more than 300 annual averages. In order to determine sunspot cycles, scientists successfully used the Hilbert-Huang transform (refer to http://en.wikipedia.org/wiki/Hilbert%E2%80%93Huang_transform). A major part of this transform is the so-called Empirical Mode Decomposition (EMD) method. The entire algorithm contains many iterative steps, and we will cover only some of them here. EMD reduces data to a group of Intrinsic Mode Functions (IMF). You can compare this to the way Fast Fourier Transform decomposes a signal in a superposition of sine and cosine terms. Extracting IMFs is done via a sifting process. The sifting of a signal is related to separating out components of a signal one at a time. The first step of this process is identifying local extrema. We will perform the first step and plot the data with the extrema we found. Let's download the data in CSV format. We also need to reverse the array to have it in the correct chronological order. The following code snippet finds the indices of the local minima and maxima respectively: mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] Now we need to concatenate these arrays and use the indices to select the corresponding values. The following code accomplishes that and also plots the data: import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True,skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) plt.plot(1700 + extrema, data[extrema], 'go') plt.plot(year_range, data) plt.show() We will see the following chart: In this plot, you can see the extrema is indicated with dots. Sifting continued The next steps in the sifting process require us to interpolate with a cubic spline of the minima and maxima. This creates an upper envelope and a lower envelope, which should surround the data. The mean of the envelopes is needed for the next iteration of the EMD process. We can interpolate minima with the following code snippet: spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) Similar code can be used to interpolate the maxima. We need to be aware that the interpolation results are only valid within the range over which we are interpolating. This range is defined by the first occurrence of a minima/maxima and ends at the last occurrence of a minima/maxima. Unfortunately, the interpolation ranges we can define in this way for the maxima and minima do not match perfectly. So, for the purpose of plotting, we need to extract a shorter range that lies within both the maxima and minima interpolation ranges. Have a look at the following code: import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal from scipy import interpolate data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True,skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) spl_max = interpolate.interp1d(maxs, data[maxs], kind='cubic') max_rng = np.arange(maxs.min(), maxs.max()) u_env = spl_max(max_rng) inclusive_rng = np.arange(max(min_rng[0], max_rng[0]), min(min_rng[-1],max_rng[-1])) mid = (spl_max(inclusive_rng) + spl_min(inclusive_rng))/2 plt.plot(year_range, data) plt.plot(1700 + min_rng, l_env, '-x') plt.plot(1700 + max_rng, u_env, '-x') plt.plot(1700 + inclusive_rng, mid, '--') plt.show() The code produces the following chart: What you see is the observed data, with computed envelopes and mid line. Obviously, negative values don't make any sense in this context. However, for the algorithm we only need to care about the mid line of the upper and lower envelopes. In these first two sections, we basically performed the first iteration of the EMD process. The algorithm is a bit more involved, so we will leave it up to you whether or not you want to continue with this analysis on your own. Moving averages Moving averages are tools commonly used to analyze time-series data. A moving average defines a window of previously seen data that is averaged each time the window slides forward one period. The different types of moving average differ essentially in the weights used for averaging. The exponential moving average, for instance, has exponentially decreasing weights with time. This means that older values have less influence than newer values, which is sometimes desirable. We can express an equal-weight strategy for the simple moving average as follows in the NumPy code: weights = np.exp(np.linspace(-1., 0., N)) weights /= weights.sum() A simple moving average uses equal weights which, in code, looks as follows: def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] The following code plots the simple moving average for the 11- and 22-year sunspot cycle: import numpy as np import sys import matplotlib.pyplot as plt data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True, skiprows=1) #reverse order data = data[::-1] year_range = np.arange(1700, 1700 + len(data)) def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] sma11 = sma(data, 11) sma22 = sma(data, 22) plt.plot(year_range, data, label='Data') plt.plot(year_range[10:], sma11, '-x', label='SMA 11') plt.plot(year_range[21:], sma22, '--', label='SMA 22') plt.legend() plt.show() In the following plot, we see the original data and the simple moving averages for 11- and 22-year periods. As you can see, moving averages are not a good fit for this data; this is generally the case for sinusoidal data. Summary This article gave us examples of signal processing and time series analysis. We looked at shifting continued that performs the first iteration of the EMD process. We also learned about Moving averages, which are tools commonly used to analyze time-series data. Resources for Article: Further resources on this subject: Advanced Indexing and Array Concepts [Article] Fast Array Operations with NumPy [Article] Move Further with NumPy Modules [Article]

0
0
4085

article-image-using-r-statistics-research-and-graphics

Packt

16 Sep 2014

12 min read

Using R for Statistics, Research, and Graphics

Packt

16 Sep 2014

12 min read

In this article by David Alexander Lillis, author of the R Graph Essentials, we will talk about R. Developed by Professor Ross Ihaka and Dr. Robert Gentleman at Auckland University (New Zealand) during the early 1990s, the R statistics environment is a real success story. R is open source software, which you can download in a couple of minutes from the Comprehensive R Network (CRAN) website (http://cran.r-project.org/), and combines a powerful programming language, outstanding graphics, and a comprehensive range of useful statistical functions. If you need a statistics environment that includes a programming language, R is ideal. It's true that the learning curve is longer than for spreadsheet-based packages, but once you master the R programming syntax, you can develop your own very powerful analytic tools. Many contributed packages are available on the web for use with R, and very often the analytic tools you need can be downloaded at no cost. (For more resources related to this topic, see here.) The main problem for those new to R is the time required to master the programming language, but several nice graphical user interfaces, such as John Fox's R Commander package, are available, which make it much easier for the newcomer to develop proficiency in R than it used to be. For many statisticians and researchers, R is the package of choice because of its powerful programming language, the easy availability of code, and because it can import Excel spreadsheets, comma separated variable (.csv) spreadsheets, and text files, as well as SPSS files, STATA files, and files produced within other statistical packages. You may be looking for a tool for your own data analysis. If so, let's take a brief look at what R can do for you. Some basic R syntax Data can be created in R or else read in from .csv or other files as objects. For example, you can read in the data contained within a .csv file called mydata.csv as follows: A <- read.csv(mydata.csv, h=T) A The object A now contains all the data of the original file. The syntax A[3,7] picks out the element in row 3 and column 7. The syntax A[14, ] selects the fourteenth row and A[,6] selects the sixth column. The functions mean(A) and sd(A) find the mean and standard deviation of each column. The syntax 3*A + 7 would triple each element of A and add 7 to each element and store the new array as the object B Now you could save this array as a .csv file called Outputfile.csv as follows: write.csv(B, file="Outputfile.csv") Statistical modeling R provides a comprehensive range of basic statistical functions relating to the commonly-used distributions (normal distribution, t-distribution, Poisson, gamma, and so on), and many less-well known distributions. It also provides a range of non-parametric tests that are appropriate when your data are not distributed normally. Linear and non-linear regressions are easy to perform, and finding the optimum model (that is, by eliminating non-significant independent variables and non-significant factor interactions) is particularly easy. Implementing Generalized Linear Models and other commonly-used models such as Analysis of Variance, Multivariate Analysis of Variance, and Analysis of Covariance is also straightforward and, once you know the syntax, you may find that such tasks can be done more quickly in R than in other packages. The usual post-hoc tests for identifying factor levels that are significantly different from the other levels (for example, Tukey and Sheffe tests) are available, and testing for interactions between factors is easy. Factor Analysis, and the related Principal Components Analysis, are well known data reduction techniques that enable you to explain your data in terms of smaller sets of independent variables (or factors). Both methods are available in R, and code for complex designs, including One and Two Way Repeated Measures, and Four Way ANOVA (for example, two repeated measures and two between-subjects), can be written relatively easily or downloaded from various websites (for example, http://www.personality-project.org/r/). Other analytic tools include Cluster Analysis, Discriminant Analysis, Multidimensional Scaling, and Correspondence Analysis. R also provides various methods for fitting analytic models to data and smoothing (for example, lowess and spline-based methods). Miscellaneous packages for specialist methods You can find some very useful packages of R code for fields as diverse as biometry, epidemiology, astrophysics, econometrics, financial and actuarial modeling, the social sciences, and psychology. For example, if you are interested in Astrophysics, Penn State Astrophysics School offers a nice website that includes both tutorials and code (http://www.iiap.res.in/astrostat/RTutorials.html). Here I'll mention just a few of the popular techniques: Monte Carlo methods A number of sources give excellent accounts of how to perform Monte Carlo simulations in R (that is, drawing samples from multidimensional distributions and estimating expected values). A valuable text is Christian Robert's book Introducing Monte Carlo Methods with R. Murali Haran gives another interesting Astrophysical example in the CAStR website (http://www.stat.psu.edu/~mharan/MCMCtut/MCMC.html). Structural Equation Modeling Structural Equation Modelling (SEM) is becoming increasingly popular in the social sciences and economics as an alternative to other modeling techniques such as multiple regression, factor analysis and analysis of covariance. Essentially, SEM is a kind of multiple regression that takes account of factor interactions, nonlinearities, measurement error, multiple latent independent variables, and latent dependent variables. Useful references for conducting SEM in R include those of Revelle, Farnsworth (2008), and Fox (2002 and 2006). Data mining A number of very useful resources are available for anyone contemplating data mining using R. For example, Luis Torgo has just published a book on data mining using R, and presents case studies, along with the datasets and code, which the interested student can work through. Torgo's book provides the usual analytic and graphical techniques used every day by data miners, including visualization techniques, dealing with missing values, developing prediction models, and methods for evaluating the performance of your models. Also of interest to the data miner is the Rattle GUI (R Analytical Tool to Learn Easily). Rattle is a data mining facility for analyzing very large data sets. It provides many useful statistical and graphical data summaries, presents mechanisms for developing a variety of models, and summarizes the performance of your models. Graphics in R Quite simply, the quality and range of graphics available through R is superb and, in my view, vastly superior to those of any other package I have encountered. Of course, you have to write the necessary code, but once you have mastered this skill, you have access to wonderful graphics. You can write your own code from scratch, but many websites provide helpful examples, complete with code, which you can download and modify to suit your own needs. R's base graphics (graphics created without the use of any additional contributed packages) are superb, but various graphics packages such as ggplot2 (and the associated qplot function) help you to create wonderful graphs. R's graphics capabilities include, but are not limited to, the following: Base graphics in R Basic graphics techniques and syntax Creating scatterplots and line plots Customizing axes, colors, and symbols Adding text – legends, titles, and axis labels Adding lines – interpolation lines, regression lines, and curves Increasing complexity – graphing three variables, multiple plots, or multiple axes Saving your plots to multiple formats – PDF, postscript, and JPG Including mathematical expressions on your plots Making graphs clear and pretty – including a grid, point labels, and shading Shading and coloring your plot Creating bar charts, histograms, boxplots, pie charts, and dotcharts Adding loess smoothers Scatterplot matrices R's color palettes Adding error bars Creating graphs using qplot Using basic qplot graphics techniques and syntax to customize in easy steps Creating scatterplots and line plots in qplot Mapping symbol size, symbol type and symbol color to categorical data Including regressions and confidence intervals on your graphs Shading and coloring your graph Creating bar charts, histograms, boxplots, pie charts, and dotcharts Labelling points on your graph Creating graphs using ggplot Ploting options – backgrounds, sizes, transparencies, and colors Superimposing points Controlling symbol shapes and using pretty color schemes Stacked, clustered, and paneled bar charts Methods for detailed customization of lines, point labels, smoothers, confidence bands, and error bars The following graph records information on the heights in centimeters and weights in kilograms of patients in a medical study. The curve in red gives a smoothed version of the data, created using locally weighted scatterplot smoothing. Both the graph and the modelling required to produce the smoothed curve, were performed in R. Here is another graph. It gives the heights and body masses of female patients receiving treatment in a hospital. Each patient is identified by name. This graph was created very easily using ggplot, and shows the default background produced by ggplot (a grey plotting background and white grid lines). Next, we see a histogram of patients' heights and body masses, partitioned by gender. The bars are given in an orange and an ivory color. The ggplot package provides a wide range of colors and hues, as well as a wide range of color palettes. Finally, we see a line graph of height against age for a group of four children. The graph includes both points and lines and we have a unique color for each child. The ggplot package makes it possible to create attractive and effective graphs for research and data analysis. Summary For many scientists and data analysts, mastery of R could be an investment for the future, particularly for those who are beginning their careers. The technology for handling scientific computation is advancing very quickly, and is a major impetus for scientific advance. Some level of mastery of R has become, for many applications, essential for taking advantage of these developments. Spatial analysis, where R provides an integrated framework access to abilities that are spread across many different computer programs, is a good example. A few years ago, I would not have recommended R as a statistics environment for generalist data analysts or postgraduate students, except those working directly in areas involving statistical modeling. However, many tutorials are downloadable from the Internet and a number of organizations provide online tutorials and/or face-to-face workshops (for example, The Analysis Factor http://www.theanalysisfactor.com/). In addition, the appearance of GUIs, such as R Commander and the new iNZight GUI33 (designed for use in schools), makes it easier for non-specialists to learn and use R effectively. I am most happy to provide advice to anyone contemplating learning to use this outstanding statistical and research tool. References Some useful material on R are as follows: L'analyse des donn´ees. Tome 1: La taxinomie, Tome 2: L'analyse des correspondances, Dunod, Paris, Benz´ecri, J. P (1973). Computation of Correspondence Analysis, Blasius J, Greenacre, M. J (1994). In M J Greenacre, J Blasius (eds.), Correspondence Analysis in the Social Sciences, pp. 53–75, Academic Press, London. Statistics: An Introduction using R, Crawley, M. J. (m.crawley@imperial.ac.uk), Imperial College, Silwood Park, Ascot, Berks, Published in 2005 by John Wiley & Sons, Ltd. http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470022973,subjectCd-ST05.html (ISBN 0-470-02297-3). http://www3.imperial.ac.uk/naturalsciences/research/statisticsusingr. Structural Equation Models Appendix to An R and S-PLUS Companion to Applied Regression, Fox, John, http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-sems.pdf. Getting Started with the R Commander, Fox, John, 26 August 2006. The R Commander: A Basic-Statistics Graphical User Interface to R, Fox, John, Journal of Statistical Software, September 2005, Volume 14, Issue 9. http://www.jstatsoft.org/. Structural Equation Modeling With the sem Package in R, Fox, John, Structural Equation Modeling, 13(3), 465–486. Lawrence Erlbaum Associates, Inc. 2006. Biplots in Biomedical Research, Gabriel, K, R and Odoroff, C, 9, 469–485, Statistics in Medicine, 1990. Theory and Applications of Correspondence Analysis, Greenacre M. J., Academic Press, London, 1984. Using R for Data Analysis and Graphics Introduction, Code and Commentary, Maindonald, J. H, Centre for Mathematics and its Applications, Australian National University. Introducing Monte Carlo Methods with R, Series Use R, Robert, Christian P., Casella, George, 2010, XX, 284 p., Softcover, ISBN 978-1-4419-1575-7. <p>Useful tutorials available on the web are as follows:</p> An Introduction to R: examples for Actuaries, De Silva, N, 2006, http://toolkit.pbworks.com/f/R%20Examples%20for%20Actuaries%20v0.1-1.pdf. Econometrics in R, Farnsworth, Grant, V, October 26, 2008, http://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf. An Introduction to the R Language, Harte, David, Statistics Research Associates Limited, www.statsresearch.co.nz. Quick R, Kabakoff, Rob, http://www.statmethods.net/index.html. R for SAS and SPSS Users, Muenchen, Bob, http://RforSASandSPSSusers.com. Statistical Analysis with R - a quick start, Nenadi´,C and Zucchini, Walter. R for Beginners, Paradis, Emannuel (paradis@isem.univ-montp2.fr), Institut des Sciences de l' Evolution, Universite Montpellier II, F-34095 Montpellier c_edex 05, France. Data Mining with R learning by case studies, Torgo, Luis, http://www.liaad.up.pt/~ltorgo/DataMiningWithR/. SimpleR - Using R for Introductory Statistics, Verzani, John, http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf. Time Series Analysis and Its Applications: With R Examples, http://www.stat.pitt.edu/stoffer/tsa2/textRcode.htm#ch2. The irises of the Gaspé peninsula, E. Anderson, Bulletin of the American Iris Society, 59, 2-5. 1935. Introducing Monte Carlo Methods with R, Series Use R, Robert, Christian P., Casella, George. 2010, XX, 284 p., Softcover, ISBN: 978-1-4419-1575-7. Resources for Article: Further resources on this subject: Aspects of Data Manipulation in R [Article] Learning Data Analytics with R and Hadoop [Article] First steps with R [Article]

0
0
4073

article-image-social-media-insight-using-naive-bayes

Packt

22 Feb 2016

48 min read

Social Media Insight Using Naive Bayes

Packt

22 Feb 2016

48 min read

0
0
4053

article-image-crypto-cash-is-missing-from-the-wallet-of-dead-cryptocurrency-entrepreneur-gerald-cotten-find-it-and-you-could-get-100000

Richard Gall

05 Mar 2019

3 min read

Crypto-cash is missing from the wallet of dead cryptocurrency entrepreneur Gerald Cotten - find it, and you could get $100,000

Richard Gall

05 Mar 2019

3 min read

In theory, stealing cryptocurrency should be impossible. But a mystery has emerged that seems to throw all that into question and even suggests a bigger, much stranger conspiracy. Gerald Cotten, the founder of cryptocurrency exchange QadrigaCX, died in December in India. He was believed to have left $136 million USD worth of crypto-cash in 'cold wallets' on his own laptop, to which only he had access. However, investigators from EY, who have been working on closing QuadrigaCX following Cotten's death, were surprised to find that the wallets were empty. In fact, it's believed crypto-cash had disappeared from them months before Cotten died. A cryptocurrency mystery now involving the FBI The only lead in this mystery is the fact that the EY investigators have found other user accounts that appear to be linked to Gerald Cotten. There's a chance that Cotten used these to trade on his own exchange, but the nature of these exchanges remain a little unclear. To add to the intrigue, Fortune reported yesterday that the FBI are working with Canada's Mounted Police Force to investigate the missing money. This information came from Jesse Powell, CEO of another cryptocurrency company called Kraken. Powell told Fortune that both the FBI and the Mounted Police have been in touch with him about the mystery surrounding QuadrigaCX. Powell has offered a reward of $100,000 to anyone that can locate the missing cryptocurrency funds. So what actually happened to Gerald Cotten and his crypto-cash? The story has many layers of complexity. There are rumors that Cotten faked his own death. For example, Cotten filed a will just 12 days before his death, leaving a significant amount of wealth and assets to his wife. And while sources from the hospital in India where Cotten is believed to have died say he died of cardiac arrest, as Fortune explains, "Cotten’s body was handled by hotel staff after an embalmer refused to receive it" - something which is, at the very least, strange. It should be noted that there is certainly no clear evidence that Cotten faked his own death - only missing pieces that encourage such rumors. A further subplot - that might or night not be useful in cracking this case - emerged late last week when Canada's Globe and Mail reported that QuadrigaCX's co-founder has a history of identity theft and using digital currencies to launder money. Where could the money be? There are, as you might expect, no shortage of theories about where the cash could be. A few days ago, it was suggested that it might be possible to locate Cotten's Ethereum funds - a blog post by James Edwards, who is the editor of cryptocurrency blog zerononcense claimed that Ethereum linked to QuadrigaCX can be found in Bitfinex, Poloniex, and Jesse Powell's Kraken. "It appears that a significant amount of Ethereum (600,000+ ETH) was transferred to these exchanges as a means of ‘storage’ during the years that QuadrigaCX was in operation and offering Ethereum on their exchange," Edwards writes. Edwards is keen for his findings to be the starting point for a clearer line of inquiry, free from speculation and conspiracy. He wrote that he hoped that it would be "a helpful addition to the QuadrigaCX narrative, rather than a conspiratorial piece that speculates on whether the exchange or its owners have been honest."

0
0
4038

article-image-top-announcements-from-the-tensorflow-dev-summit-2019

Sugandha Lahoti

08 Mar 2019

5 min read

Top announcements from the TensorFlow Dev Summit 2019

Sugandha Lahoti

08 Mar 2019

5 min read

The two-days long TensorFlow Dev Summit 2019 just got over, leaving in its wake major updates being made to the TensorFlow ecosystem. The major announcement included the release of the first alpha version of most coveted release TensorFlow 2.0. Also announced were, TensorFlow Lite 1.0, TensorFlow Federated, TensorFlow Privacy and more. TensorFlow Federated In a medium blog post, Alex Ingerman (Product Manager) and Krzys Ostrowski (Research Scientist) introduced the TensorFlow Federated framework on the first day. This open source framework is useful for experimenting with machine learning and other computations on decentralized data. As the name suggests, this framework uses Federated Learning, a learning approach introduced by Google in 2017. This technique enables ML models to collaboratively learn a shared prediction model while keeping all the training data on the device. Thus eliminating machine learning from the need to store the data in the cloud. The authors note that TFF is based on their experiences with developing federated learning technology at Google. TFF uses the Federated Learning API to express an ML model architecture, and then train it across data provided by multiple developers, while keeping each developer’s data separate and local. It also uses the Federated Core (FC) API, a set of lower-level primitives, which enables the expression of a broad range of computations over a decentralized dataset. The authors conclude, “With TFF, we are excited to put a flexible, open framework for locally simulating decentralized computations into the hands of all TensorFlow users. You can try out TFF in your browser, with just a few clicks, by walking through the tutorials.” TensorFlow 2.0.0- alpha0 The event also the release of the first alpha version of the TensorFlow 2.0 framework which came with fewer APIs. First introduced last year in August by Martin Wicke, engineer at Google, TensorFlow 2.0, is expected to come with: Easy model building with Keras and eager execution. Robust model deployment in production on any platform. Powerful experimentation for research. API simplification by reducing duplication removing deprecated endpoints. The first teaser, TensorFlow 2.0.0- alpha0 version comes with the following changes: API clean-up included removing tf.app, tf.flags, and tf.logging in favor of absl-py. No more global variables with helper methods like tf.global_variables_initializer and tf.get_global_step. Functions, not sessions (tf.Session and session.run -> tf.function). Added support for TensorFlow Lite in TensorFlow 2.0. tf.contrib has been deprecated, and functionality has been either migrated to the core TensorFlow API, to tensorflow/addons, or removed entirely. Checkpoint breakage for RNNs and for Optimizers. Minor bug fixes have also been made to the Keras and Python API and tf.estimator. Read the full list of bug fixes in the changelog. TensorFlow Lite 1.0 The TF-Lite framework is basically designed to aid developers in deploying machine learning and artificial intelligence models on mobile and IoT devices. Lite was first introduced at the I/O developer conference in May 2017 and in developer preview later that year. At the TensorFlow Dev Summit, the team announced a new version of this framework, the TensorFlow Lite 1.0. According to a post by VentureBeat, improvements include selective registration and quantization during and after training for faster, smaller models. The team behind TF-Lite 1.0 says that quantization has helped them achieve up to 4 times compression of some models. TensorFlow Privacy Another interesting library released at the TensorFlow dev summit was TensorFlow Privacy. This Python-based open source library aids developers to train their machine-learning models with strong privacy guarantees. To achieve this, it takes inspiration from the principles of differential privacy. This technique offers strong mathematical guarantees that models do not learn or remember the details about any specific user when training the user data. TensorFlow Privacy includes implementations of TensorFlow optimizers for training machine learning models with differential privacy. For more information, you can go through the technical whitepaper describing its privacy mechanisms in more detail. The creators also note that “no expertise in privacy or its underlying mathematics should be required for using TensorFlow Privacy. Those using standard TensorFlow mechanisms should not have to change their model architectures, training procedures, or processes.” TensorFlow Replicator TF Replicator also released at the TensorFlow Dev Summit, is a software library that helps researchers deploy their TensorFlow models on GPUs and Cloud TPUs. To do this, the creators assure that developers would require minimal effort and need not have previous experience with distributed systems. For multi-GPU computation, TF-Replicator relies on an “in-graph replication” pattern, where the computation for each device is replicated in the same TensorFlow graph. When TF-Replicator builds an in-graph replicated computation, it first builds the computation for each device independently and leaves placeholders where cross-device computation has been specified by the user. Once the sub-graphs for all devices have been built, TF-Replicator connects them by replacing the placeholders with actual cross-device computation. For a more comprehensive description, you can go through the research paper. These were the top announcements made at the TensorFlow Dev Summit 2019. You can go through the Keynote and other videos of the announcements and tutorials on this YouTube playlist. TensorFlow 2.0 to be released soon with eager execution, removal of redundant APIs, tffunction and more. TensorFlow 2.0 is coming. Here’s what we can expect. Google introduces and open-sources Lingvo, a scalable TensorFlow framework for Sequence-to-Sequence Modeling

0
0
4037

article-image-introduction-practical-business-intelligence

Packt

10 Nov 2016

20 min read

Introduction to Practical Business Intelligence

Packt

10 Nov 2016

20 min read

0
1
4035

article-image-including-charts-and-graphics-pentaho-reports-part-2

Packt

29 Oct 2009

6 min read

Including Charts and Graphics in Pentaho Reports (Part 2)

Packt

29 Oct 2009

6 min read

Ring chart The ring chart is identical to the pie chart, except that it renders as a ring versus a complete pie. In addition to sharing all the properties similar to the pie chart, it also defines the following rendering property : Options Property Group Property name Description section-depth This property defines the percentage of the radius to render the section as. The default value is set to 0.5. Ring chart example For this example, simply open the defined pie chart example and select the Ring chart type. Also, set the section-depth to 0.1, in order to generate the following effect: Multi pie chart The multi pie chart renders a group of pie charts, based on a category dataset. This meta-chart renders individual series data as a pie chart, each broken into individual categories within the individual pie charts. The multi pie chart utilizes the common properties defined above, including the category dataset properties. In addition to the standard set of properties, it also defines the following two properties: Options Property Group Property name Description label-format This label defines how each item within a chart is rendered. The default value is set to "{0}". The format string may also contain any of the following: {0}: To render the item name {1}: To render the item value {2}: To render the item percentage in relation to the entire pie chart by-row This value defaults to True. If set to False, the series and category fields are reversed, and individual charts render series information. Note that the horizontal, series-color, stacked and stacked-percent properties do not apply to this chart type. Multi pie chart example This example demonstrates the distribution of purchased item types, based on payment type. To begin, create a new report. You'll reuse the bar chart's SQL query. Now, place a new Chart element into the Report Header. Edit the chart, selecting Multi Pie as the chart type. To configure the dataset for this chart, select ITEMCATEGORY as the category-column. Set the value-columns property to QUANTITY and the series-by-field to PAYMENTTYPE. Waterfall chart The waterfall chart displays a unique stacked bar chart that spans categories. This chart is useful when comparing categories to one another. The last category in a waterfall chart normally equals the total of all the other categories to render appropriately, but this is based on the dataset, not the chart rendering. The waterfall chart utilizes the common properties defined above, including the category dataset properties. The stacked property is not available for this chart. There are no additional properties defined for the waterfall chart. Waterfall chart example In this example, you'll compare by type, the quantity of items in your inventory. Normally, the last category would be used to display the total values. The chart will render the data provided with or without a summary series, so you'll just use the example SQL query from the bar chart example. Add a Chart element to the Report Header and select Waterfall as the chart type. Set the category-column to ITEMCATEGORY, the value-columns to QUANTITY, and the series-by-value property to Quantity. Now, apply your changes and preview the results. Bar line chart The bar line chart combines the bar and line charts, allowing visualization of trends with categories, along with comparisons. The bar line chart is unique in that it requires two category datasets to populate the chart. The first dataset populates the bar chart, and the second dataset populates the line chart. The bar line chart utilizes the common properties defined above, including the category dataset properties. This chart also inherits the properties from both the bar chart, as well as the line chart. This chart also has certain additional properties, which are listed in the following table: Required Property Group Property name Description bar-data-source The name of the first dataset required by the bar line chart, which populates the bars in the chart. This value is automatically populated with the correct name. line-data-source The name of the second dataset required by the bar line chart, which populates the lines in the chart. This value is automatically populated with the correct name. Bar Options Property Group Property name Description ctgry-tick-font Defines the Java font that renders the Categories. Line Options Property Group Property name Description line-series-color Defines the color in which to render each line series. line-tick-fmt Specifies the Java DecimalFormat string for rendering the Line Axis Labels lines-label-font Defines the Java font to use when rendering line labels. line-tick-font Defines the Java font to use when rendering the Line Axis Labels. As part of the bar line chart, a second y-axis is defined for the lines. The property group Y2-Axis is available with similar properties as the standard y-axis. Bar line chart example To demonstrate the bar line chart, you'll reuse the SQL query from the area chart example. Create a new report, and add a Chart element to the Report Header. Edit the chart, and select Bar Line as the chart type. You'll begin by configuring the first dataset. Set the category-column to ITEMCATEGORY, the value-columns to COST, and the series-by-value property to Cost. To configure the second dataset, set the category-column to ITEMCATEGORY, the value-columns to SALEPRICE, and the series-by-value property to Sale Price. Set the x-axis-label-width to 2.0, and reduce the x-font size to 7. Also, set show-legend to True. You're now ready to preview the bar line chart. Bubble chart The bubble chart allows you to view three dimensions of data. The first two dimensions are your traditional X and Y dimensions, also known as domain and range. The third dimension is expressed by the size of the individual bubbles rendered. The bubble chart utilizes the common properties defined above, including the XY series dataset properties. The bubble chart also defines the following properties: Options Property Group Property name Description max-bubble-size This value defines the diameter of the largest bubble to render. All other bubble sizes are relative to the maximum bubble size. The default value is 0, so this value must be set to a reasonable value for rendering of bubbles to take place. Note that this value is based on pixels, not the domain or range values. The bubble chart defines the following additional dataset property: Required Property Group Property name Description z-value-columns This is the data source column to use for Z value, which specifies the bubble diameter relative to the maximum bubble size. Bubble chart example In this example, you need to define a three dimensional SQL query to populate the chart. You'll use inventory information, and calculate Item Category Margin: SELECT"INVENTORY"."ITEMCATEGORY","INVENTORY"."ONHAND","INVENTORY"."ONORDER","INVENTORY"."COST","INVENTORY"."SALEPRICE","INVENTORY"."SALEPRICE" - "INVENTORY"."COST" MARGINFROM"INVENTORY"ORDER BY"INVENTORY"."ITEMCATEGORY" ASC Now that you have a SQL query to work with, add a Chart element to the Report Header and select Bubble as the chart type. First, you'll populate the correct dataset fields. Set the series-by-field property to ITEMCATEGORY. Now, set the X, Y, and Z value columns to ONHAND, SALEPRICE, and MARGIN. You're now ready to customize the chart rendering. Set the x-title to On Hand, the y-title to Sales Price, the max-bubble-size to 100, and the show-legend property to True. The final result should look like this:

0
0
4011

How-To Tutorials - Data

Indexing the Data

Learning Random Forest Using Mahout

Practical Applications of Deep Learning

Introduction to Scikit-Learn

Getting Started with Oracle Data Guard

Installing Coherence 3.5 and Accessing the Data Grid: Part 1

Oracle: When to use Log Miner

Azure Feature Pack

Signal Processing Techniques

Using R for Statistics, Research, and Graphics

Trending Topics

Social Media Insight Using Naive Bayes

Crypto-cash is missing from the wallet of dead cryptocurrency entrepreneur Gerald Cotten - find it, and you could get $100,000

Top announcements from the TensorFlow Dev Summit 2019

Introduction to Practical Business Intelligence

Including Charts and Graphics in Pentaho Reports (Part 2)

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access