Data | Tech News, Tutorials & Expert Insights

article-image-signal-processing-techniques

12 Jun 2014

6 min read

Signal Processing Techniques

12 Jun 2014

(For more resources related to this topic, see here.) Introducing the Sunspot data Sunspots are dark spots visible on the Sun's surface. This phenomenon has been studied for many centuries by astronomers. Evidence has been found for periodic sunspot cycles. We can download up-to-date annual sunspot data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. This is provided by the Belgian Solar Influences Data Analysis Center. The data goes back to 1700 and contains more than 300 annual averages. In order to determine sunspot cycles, scientists successfully used the Hilbert-Huang transform (refer to http://en.wikipedia.org/wiki/Hilbert%E2%80%93Huang_transform). A major part of this transform is the so-called Empirical Mode Decomposition (EMD) method. The entire algorithm contains many iterative steps, and we will cover only some of them here. EMD reduces data to a group of Intrinsic Mode Functions (IMF). You can compare this to the way Fast Fourier Transform decomposes a signal in a superposition of sine and cosine terms. Extracting IMFs is done via a sifting process. The sifting of a signal is related to separating out components of a signal one at a time. The first step of this process is identifying local extrema. We will perform the first step and plot the data with the extrema we found. Let's download the data in CSV format. We also need to reverse the array to have it in the correct chronological order. The following code snippet finds the indices of the local minima and maxima respectively: mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] Now we need to concatenate these arrays and use the indices to select the corresponding values. The following code accomplishes that and also plots the data: import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True,skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) plt.plot(1700 + extrema, data[extrema], 'go') plt.plot(year_range, data) plt.show() We will see the following chart: In this plot, you can see the extrema is indicated with dots. Sifting continued The next steps in the sifting process require us to interpolate with a cubic spline of the minima and maxima. This creates an upper envelope and a lower envelope, which should surround the data. The mean of the envelopes is needed for the next iteration of the EMD process. We can interpolate minima with the following code snippet: spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) Similar code can be used to interpolate the maxima. We need to be aware that the interpolation results are only valid within the range over which we are interpolating. This range is defined by the first occurrence of a minima/maxima and ends at the last occurrence of a minima/maxima. Unfortunately, the interpolation ranges we can define in this way for the maxima and minima do not match perfectly. So, for the purpose of plotting, we need to extract a shorter range that lies within both the maxima and minima interpolation ranges. Have a look at the following code: import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal from scipy import interpolate data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True,skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) spl_max = interpolate.interp1d(maxs, data[maxs], kind='cubic') max_rng = np.arange(maxs.min(), maxs.max()) u_env = spl_max(max_rng) inclusive_rng = np.arange(max(min_rng[0], max_rng[0]), min(min_rng[-1],max_rng[-1])) mid = (spl_max(inclusive_rng) + spl_min(inclusive_rng))/2 plt.plot(year_range, data) plt.plot(1700 + min_rng, l_env, '-x') plt.plot(1700 + max_rng, u_env, '-x') plt.plot(1700 + inclusive_rng, mid, '--') plt.show() The code produces the following chart: What you see is the observed data, with computed envelopes and mid line. Obviously, negative values don't make any sense in this context. However, for the algorithm we only need to care about the mid line of the upper and lower envelopes. In these first two sections, we basically performed the first iteration of the EMD process. The algorithm is a bit more involved, so we will leave it up to you whether or not you want to continue with this analysis on your own. Moving averages Moving averages are tools commonly used to analyze time-series data. A moving average defines a window of previously seen data that is averaged each time the window slides forward one period. The different types of moving average differ essentially in the weights used for averaging. The exponential moving average, for instance, has exponentially decreasing weights with time. This means that older values have less influence than newer values, which is sometimes desirable. We can express an equal-weight strategy for the simple moving average as follows in the NumPy code: weights = np.exp(np.linspace(-1., 0., N)) weights /= weights.sum() A simple moving average uses equal weights which, in code, looks as follows: def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] The following code plots the simple moving average for the 11- and 22-year sunspot cycle: import numpy as np import sys import matplotlib.pyplot as plt data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True, skiprows=1) #reverse order data = data[::-1] year_range = np.arange(1700, 1700 + len(data)) def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] sma11 = sma(data, 11) sma22 = sma(data, 22) plt.plot(year_range, data, label='Data') plt.plot(year_range[10:], sma11, '-x', label='SMA 11') plt.plot(year_range[21:], sma22, '--', label='SMA 22') plt.legend() plt.show() In the following plot, we see the original data and the simple moving averages for 11- and 22-year periods. As you can see, moving averages are not a good fit for this data; this is generally the case for sinusoidal data. Summary This article gave us examples of signal processing and time series analysis. We looked at shifting continued that performs the first iteration of the EMD process. We also learned about Moving averages, which are tools commonly used to analyze time-series data. Resources for Article: Further resources on this subject: Advanced Indexing and Array Concepts [Article] Fast Array Operations with NumPy [Article] Move Further with NumPy Modules [Article]

0
0
4085

article-image-using-r-statistics-research-and-graphics

Packt

16 Sep 2014

12 min read

Using R for Statistics, Research, and Graphics

Packt

16 Sep 2014

12 min read

In this article by David Alexander Lillis, author of the R Graph Essentials, we will talk about R. Developed by Professor Ross Ihaka and Dr. Robert Gentleman at Auckland University (New Zealand) during the early 1990s, the R statistics environment is a real success story. R is open source software, which you can download in a couple of minutes from the Comprehensive R Network (CRAN) website (http://cran.r-project.org/), and combines a powerful programming language, outstanding graphics, and a comprehensive range of useful statistical functions. If you need a statistics environment that includes a programming language, R is ideal. It's true that the learning curve is longer than for spreadsheet-based packages, but once you master the R programming syntax, you can develop your own very powerful analytic tools. Many contributed packages are available on the web for use with R, and very often the analytic tools you need can be downloaded at no cost. (For more resources related to this topic, see here.) The main problem for those new to R is the time required to master the programming language, but several nice graphical user interfaces, such as John Fox's R Commander package, are available, which make it much easier for the newcomer to develop proficiency in R than it used to be. For many statisticians and researchers, R is the package of choice because of its powerful programming language, the easy availability of code, and because it can import Excel spreadsheets, comma separated variable (.csv) spreadsheets, and text files, as well as SPSS files, STATA files, and files produced within other statistical packages. You may be looking for a tool for your own data analysis. If so, let's take a brief look at what R can do for you. Some basic R syntax Data can be created in R or else read in from .csv or other files as objects. For example, you can read in the data contained within a .csv file called mydata.csv as follows: A <- read.csv(mydata.csv, h=T) A The object A now contains all the data of the original file. The syntax A[3,7] picks out the element in row 3 and column 7. The syntax A[14, ] selects the fourteenth row and A[,6] selects the sixth column. The functions mean(A) and sd(A) find the mean and standard deviation of each column. The syntax 3*A + 7 would triple each element of A and add 7 to each element and store the new array as the object B Now you could save this array as a .csv file called Outputfile.csv as follows: write.csv(B, file="Outputfile.csv") Statistical modeling R provides a comprehensive range of basic statistical functions relating to the commonly-used distributions (normal distribution, t-distribution, Poisson, gamma, and so on), and many less-well known distributions. It also provides a range of non-parametric tests that are appropriate when your data are not distributed normally. Linear and non-linear regressions are easy to perform, and finding the optimum model (that is, by eliminating non-significant independent variables and non-significant factor interactions) is particularly easy. Implementing Generalized Linear Models and other commonly-used models such as Analysis of Variance, Multivariate Analysis of Variance, and Analysis of Covariance is also straightforward and, once you know the syntax, you may find that such tasks can be done more quickly in R than in other packages. The usual post-hoc tests for identifying factor levels that are significantly different from the other levels (for example, Tukey and Sheffe tests) are available, and testing for interactions between factors is easy. Factor Analysis, and the related Principal Components Analysis, are well known data reduction techniques that enable you to explain your data in terms of smaller sets of independent variables (or factors). Both methods are available in R, and code for complex designs, including One and Two Way Repeated Measures, and Four Way ANOVA (for example, two repeated measures and two between-subjects), can be written relatively easily or downloaded from various websites (for example, http://www.personality-project.org/r/). Other analytic tools include Cluster Analysis, Discriminant Analysis, Multidimensional Scaling, and Correspondence Analysis. R also provides various methods for fitting analytic models to data and smoothing (for example, lowess and spline-based methods). Miscellaneous packages for specialist methods You can find some very useful packages of R code for fields as diverse as biometry, epidemiology, astrophysics, econometrics, financial and actuarial modeling, the social sciences, and psychology. For example, if you are interested in Astrophysics, Penn State Astrophysics School offers a nice website that includes both tutorials and code (http://www.iiap.res.in/astrostat/RTutorials.html). Here I'll mention just a few of the popular techniques: Monte Carlo methods A number of sources give excellent accounts of how to perform Monte Carlo simulations in R (that is, drawing samples from multidimensional distributions and estimating expected values). A valuable text is Christian Robert's book Introducing Monte Carlo Methods with R. Murali Haran gives another interesting Astrophysical example in the CAStR website (http://www.stat.psu.edu/~mharan/MCMCtut/MCMC.html). Structural Equation Modeling Structural Equation Modelling (SEM) is becoming increasingly popular in the social sciences and economics as an alternative to other modeling techniques such as multiple regression, factor analysis and analysis of covariance. Essentially, SEM is a kind of multiple regression that takes account of factor interactions, nonlinearities, measurement error, multiple latent independent variables, and latent dependent variables. Useful references for conducting SEM in R include those of Revelle, Farnsworth (2008), and Fox (2002 and 2006). Data mining A number of very useful resources are available for anyone contemplating data mining using R. For example, Luis Torgo has just published a book on data mining using R, and presents case studies, along with the datasets and code, which the interested student can work through. Torgo's book provides the usual analytic and graphical techniques used every day by data miners, including visualization techniques, dealing with missing values, developing prediction models, and methods for evaluating the performance of your models. Also of interest to the data miner is the Rattle GUI (R Analytical Tool to Learn Easily). Rattle is a data mining facility for analyzing very large data sets. It provides many useful statistical and graphical data summaries, presents mechanisms for developing a variety of models, and summarizes the performance of your models. Graphics in R Quite simply, the quality and range of graphics available through R is superb and, in my view, vastly superior to those of any other package I have encountered. Of course, you have to write the necessary code, but once you have mastered this skill, you have access to wonderful graphics. You can write your own code from scratch, but many websites provide helpful examples, complete with code, which you can download and modify to suit your own needs. R's base graphics (graphics created without the use of any additional contributed packages) are superb, but various graphics packages such as ggplot2 (and the associated qplot function) help you to create wonderful graphs. R's graphics capabilities include, but are not limited to, the following: Base graphics in R Basic graphics techniques and syntax Creating scatterplots and line plots Customizing axes, colors, and symbols Adding text – legends, titles, and axis labels Adding lines – interpolation lines, regression lines, and curves Increasing complexity – graphing three variables, multiple plots, or multiple axes Saving your plots to multiple formats – PDF, postscript, and JPG Including mathematical expressions on your plots Making graphs clear and pretty – including a grid, point labels, and shading Shading and coloring your plot Creating bar charts, histograms, boxplots, pie charts, and dotcharts Adding loess smoothers Scatterplot matrices R's color palettes Adding error bars Creating graphs using qplot Using basic qplot graphics techniques and syntax to customize in easy steps Creating scatterplots and line plots in qplot Mapping symbol size, symbol type and symbol color to categorical data Including regressions and confidence intervals on your graphs Shading and coloring your graph Creating bar charts, histograms, boxplots, pie charts, and dotcharts Labelling points on your graph Creating graphs using ggplot Ploting options – backgrounds, sizes, transparencies, and colors Superimposing points Controlling symbol shapes and using pretty color schemes Stacked, clustered, and paneled bar charts Methods for detailed customization of lines, point labels, smoothers, confidence bands, and error bars The following graph records information on the heights in centimeters and weights in kilograms of patients in a medical study. The curve in red gives a smoothed version of the data, created using locally weighted scatterplot smoothing. Both the graph and the modelling required to produce the smoothed curve, were performed in R. Here is another graph. It gives the heights and body masses of female patients receiving treatment in a hospital. Each patient is identified by name. This graph was created very easily using ggplot, and shows the default background produced by ggplot (a grey plotting background and white grid lines). Next, we see a histogram of patients' heights and body masses, partitioned by gender. The bars are given in an orange and an ivory color. The ggplot package provides a wide range of colors and hues, as well as a wide range of color palettes. Finally, we see a line graph of height against age for a group of four children. The graph includes both points and lines and we have a unique color for each child. The ggplot package makes it possible to create attractive and effective graphs for research and data analysis. Summary For many scientists and data analysts, mastery of R could be an investment for the future, particularly for those who are beginning their careers. The technology for handling scientific computation is advancing very quickly, and is a major impetus for scientific advance. Some level of mastery of R has become, for many applications, essential for taking advantage of these developments. Spatial analysis, where R provides an integrated framework access to abilities that are spread across many different computer programs, is a good example. A few years ago, I would not have recommended R as a statistics environment for generalist data analysts or postgraduate students, except those working directly in areas involving statistical modeling. However, many tutorials are downloadable from the Internet and a number of organizations provide online tutorials and/or face-to-face workshops (for example, The Analysis Factor http://www.theanalysisfactor.com/). In addition, the appearance of GUIs, such as R Commander and the new iNZight GUI33 (designed for use in schools), makes it easier for non-specialists to learn and use R effectively. I am most happy to provide advice to anyone contemplating learning to use this outstanding statistical and research tool. References Some useful material on R are as follows: L'analyse des donn´ees. Tome 1: La taxinomie, Tome 2: L'analyse des correspondances, Dunod, Paris, Benz´ecri, J. P (1973). Computation of Correspondence Analysis, Blasius J, Greenacre, M. J (1994). In M J Greenacre, J Blasius (eds.), Correspondence Analysis in the Social Sciences, pp. 53–75, Academic Press, London. Statistics: An Introduction using R, Crawley, M. J. (m.crawley@imperial.ac.uk), Imperial College, Silwood Park, Ascot, Berks, Published in 2005 by John Wiley & Sons, Ltd. http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470022973,subjectCd-ST05.html (ISBN 0-470-02297-3). http://www3.imperial.ac.uk/naturalsciences/research/statisticsusingr. Structural Equation Models Appendix to An R and S-PLUS Companion to Applied Regression, Fox, John, http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-sems.pdf. Getting Started with the R Commander, Fox, John, 26 August 2006. The R Commander: A Basic-Statistics Graphical User Interface to R, Fox, John, Journal of Statistical Software, September 2005, Volume 14, Issue 9. http://www.jstatsoft.org/. Structural Equation Modeling With the sem Package in R, Fox, John, Structural Equation Modeling, 13(3), 465–486. Lawrence Erlbaum Associates, Inc. 2006. Biplots in Biomedical Research, Gabriel, K, R and Odoroff, C, 9, 469–485, Statistics in Medicine, 1990. Theory and Applications of Correspondence Analysis, Greenacre M. J., Academic Press, London, 1984. Using R for Data Analysis and Graphics Introduction, Code and Commentary, Maindonald, J. H, Centre for Mathematics and its Applications, Australian National University. Introducing Monte Carlo Methods with R, Series Use R, Robert, Christian P., Casella, George, 2010, XX, 284 p., Softcover, ISBN 978-1-4419-1575-7. <p>Useful tutorials available on the web are as follows:</p> An Introduction to R: examples for Actuaries, De Silva, N, 2006, http://toolkit.pbworks.com/f/R%20Examples%20for%20Actuaries%20v0.1-1.pdf. Econometrics in R, Farnsworth, Grant, V, October 26, 2008, http://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf. An Introduction to the R Language, Harte, David, Statistics Research Associates Limited, www.statsresearch.co.nz. Quick R, Kabakoff, Rob, http://www.statmethods.net/index.html. R for SAS and SPSS Users, Muenchen, Bob, http://RforSASandSPSSusers.com. Statistical Analysis with R - a quick start, Nenadi´,C and Zucchini, Walter. R for Beginners, Paradis, Emannuel (paradis@isem.univ-montp2.fr), Institut des Sciences de l' Evolution, Universite Montpellier II, F-34095 Montpellier c_edex 05, France. Data Mining with R learning by case studies, Torgo, Luis, http://www.liaad.up.pt/~ltorgo/DataMiningWithR/. SimpleR - Using R for Introductory Statistics, Verzani, John, http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf. Time Series Analysis and Its Applications: With R Examples, http://www.stat.pitt.edu/stoffer/tsa2/textRcode.htm#ch2. The irises of the Gaspé peninsula, E. Anderson, Bulletin of the American Iris Society, 59, 2-5. 1935. Introducing Monte Carlo Methods with R, Series Use R, Robert, Christian P., Casella, George. 2010, XX, 284 p., Softcover, ISBN: 978-1-4419-1575-7. Resources for Article: Further resources on this subject: Aspects of Data Manipulation in R [Article] Learning Data Analytics with R and Hadoop [Article] First steps with R [Article]

0
0
4073

article-image-social-media-insight-using-naive-bayes

Packt

22 Feb 2016

48 min read

Social Media Insight Using Naive Bayes

Packt

22 Feb 2016

48 min read

0
0
4053

article-image-crypto-cash-is-missing-from-the-wallet-of-dead-cryptocurrency-entrepreneur-gerald-cotten-find-it-and-you-could-get-100000

Richard Gall

05 Mar 2019

3 min read

Crypto-cash is missing from the wallet of dead cryptocurrency entrepreneur Gerald Cotten - find it, and you could get $100,000

Richard Gall

05 Mar 2019

3 min read

In theory, stealing cryptocurrency should be impossible. But a mystery has emerged that seems to throw all that into question and even suggests a bigger, much stranger conspiracy. Gerald Cotten, the founder of cryptocurrency exchange QadrigaCX, died in December in India. He was believed to have left $136 million USD worth of crypto-cash in 'cold wallets' on his own laptop, to which only he had access. However, investigators from EY, who have been working on closing QuadrigaCX following Cotten's death, were surprised to find that the wallets were empty. In fact, it's believed crypto-cash had disappeared from them months before Cotten died. A cryptocurrency mystery now involving the FBI The only lead in this mystery is the fact that the EY investigators have found other user accounts that appear to be linked to Gerald Cotten. There's a chance that Cotten used these to trade on his own exchange, but the nature of these exchanges remain a little unclear. To add to the intrigue, Fortune reported yesterday that the FBI are working with Canada's Mounted Police Force to investigate the missing money. This information came from Jesse Powell, CEO of another cryptocurrency company called Kraken. Powell told Fortune that both the FBI and the Mounted Police have been in touch with him about the mystery surrounding QuadrigaCX. Powell has offered a reward of $100,000 to anyone that can locate the missing cryptocurrency funds. So what actually happened to Gerald Cotten and his crypto-cash? The story has many layers of complexity. There are rumors that Cotten faked his own death. For example, Cotten filed a will just 12 days before his death, leaving a significant amount of wealth and assets to his wife. And while sources from the hospital in India where Cotten is believed to have died say he died of cardiac arrest, as Fortune explains, "Cotten’s body was handled by hotel staff after an embalmer refused to receive it" - something which is, at the very least, strange. It should be noted that there is certainly no clear evidence that Cotten faked his own death - only missing pieces that encourage such rumors. A further subplot - that might or night not be useful in cracking this case - emerged late last week when Canada's Globe and Mail reported that QuadrigaCX's co-founder has a history of identity theft and using digital currencies to launder money. Where could the money be? There are, as you might expect, no shortage of theories about where the cash could be. A few days ago, it was suggested that it might be possible to locate Cotten's Ethereum funds - a blog post by James Edwards, who is the editor of cryptocurrency blog zerononcense claimed that Ethereum linked to QuadrigaCX can be found in Bitfinex, Poloniex, and Jesse Powell's Kraken. "It appears that a significant amount of Ethereum (600,000+ ETH) was transferred to these exchanges as a means of ‘storage’ during the years that QuadrigaCX was in operation and offering Ethereum on their exchange," Edwards writes. Edwards is keen for his findings to be the starting point for a clearer line of inquiry, free from speculation and conspiracy. He wrote that he hoped that it would be "a helpful addition to the QuadrigaCX narrative, rather than a conspiratorial piece that speculates on whether the exchange or its owners have been honest."

0
0
4038

article-image-top-announcements-from-the-tensorflow-dev-summit-2019

Sugandha Lahoti

08 Mar 2019

5 min read

Top announcements from the TensorFlow Dev Summit 2019

Sugandha Lahoti

08 Mar 2019

5 min read

The two-days long TensorFlow Dev Summit 2019 just got over, leaving in its wake major updates being made to the TensorFlow ecosystem. The major announcement included the release of the first alpha version of most coveted release TensorFlow 2.0. Also announced were, TensorFlow Lite 1.0, TensorFlow Federated, TensorFlow Privacy and more. TensorFlow Federated In a medium blog post, Alex Ingerman (Product Manager) and Krzys Ostrowski (Research Scientist) introduced the TensorFlow Federated framework on the first day. This open source framework is useful for experimenting with machine learning and other computations on decentralized data. As the name suggests, this framework uses Federated Learning, a learning approach introduced by Google in 2017. This technique enables ML models to collaboratively learn a shared prediction model while keeping all the training data on the device. Thus eliminating machine learning from the need to store the data in the cloud. The authors note that TFF is based on their experiences with developing federated learning technology at Google. TFF uses the Federated Learning API to express an ML model architecture, and then train it across data provided by multiple developers, while keeping each developer’s data separate and local. It also uses the Federated Core (FC) API, a set of lower-level primitives, which enables the expression of a broad range of computations over a decentralized dataset. The authors conclude, “With TFF, we are excited to put a flexible, open framework for locally simulating decentralized computations into the hands of all TensorFlow users. You can try out TFF in your browser, with just a few clicks, by walking through the tutorials.” TensorFlow 2.0.0- alpha0 The event also the release of the first alpha version of the TensorFlow 2.0 framework which came with fewer APIs. First introduced last year in August by Martin Wicke, engineer at Google, TensorFlow 2.0, is expected to come with: Easy model building with Keras and eager execution. Robust model deployment in production on any platform. Powerful experimentation for research. API simplification by reducing duplication removing deprecated endpoints. The first teaser, TensorFlow 2.0.0- alpha0 version comes with the following changes: API clean-up included removing tf.app, tf.flags, and tf.logging in favor of absl-py. No more global variables with helper methods like tf.global_variables_initializer and tf.get_global_step. Functions, not sessions (tf.Session and session.run -> tf.function). Added support for TensorFlow Lite in TensorFlow 2.0. tf.contrib has been deprecated, and functionality has been either migrated to the core TensorFlow API, to tensorflow/addons, or removed entirely. Checkpoint breakage for RNNs and for Optimizers. Minor bug fixes have also been made to the Keras and Python API and tf.estimator. Read the full list of bug fixes in the changelog. TensorFlow Lite 1.0 The TF-Lite framework is basically designed to aid developers in deploying machine learning and artificial intelligence models on mobile and IoT devices. Lite was first introduced at the I/O developer conference in May 2017 and in developer preview later that year. At the TensorFlow Dev Summit, the team announced a new version of this framework, the TensorFlow Lite 1.0. According to a post by VentureBeat, improvements include selective registration and quantization during and after training for faster, smaller models. The team behind TF-Lite 1.0 says that quantization has helped them achieve up to 4 times compression of some models. TensorFlow Privacy Another interesting library released at the TensorFlow dev summit was TensorFlow Privacy. This Python-based open source library aids developers to train their machine-learning models with strong privacy guarantees. To achieve this, it takes inspiration from the principles of differential privacy. This technique offers strong mathematical guarantees that models do not learn or remember the details about any specific user when training the user data. TensorFlow Privacy includes implementations of TensorFlow optimizers for training machine learning models with differential privacy. For more information, you can go through the technical whitepaper describing its privacy mechanisms in more detail. The creators also note that “no expertise in privacy or its underlying mathematics should be required for using TensorFlow Privacy. Those using standard TensorFlow mechanisms should not have to change their model architectures, training procedures, or processes.” TensorFlow Replicator TF Replicator also released at the TensorFlow Dev Summit, is a software library that helps researchers deploy their TensorFlow models on GPUs and Cloud TPUs. To do this, the creators assure that developers would require minimal effort and need not have previous experience with distributed systems. For multi-GPU computation, TF-Replicator relies on an “in-graph replication” pattern, where the computation for each device is replicated in the same TensorFlow graph. When TF-Replicator builds an in-graph replicated computation, it first builds the computation for each device independently and leaves placeholders where cross-device computation has been specified by the user. Once the sub-graphs for all devices have been built, TF-Replicator connects them by replacing the placeholders with actual cross-device computation. For a more comprehensive description, you can go through the research paper. These were the top announcements made at the TensorFlow Dev Summit 2019. You can go through the Keynote and other videos of the announcements and tutorials on this YouTube playlist. TensorFlow 2.0 to be released soon with eager execution, removal of redundant APIs, tffunction and more. TensorFlow 2.0 is coming. Here’s what we can expect. Google introduces and open-sources Lingvo, a scalable TensorFlow framework for Sequence-to-Sequence Modeling

0
0
4037

article-image-introduction-practical-business-intelligence

Packt

10 Nov 2016

20 min read

Introduction to Practical Business Intelligence

Packt

10 Nov 2016

20 min read

0
1
4035

article-image-including-charts-and-graphics-pentaho-reports-part-2

Packt

29 Oct 2009

6 min read

Including Charts and Graphics in Pentaho Reports (Part 2)

Packt

29 Oct 2009

6 min read

Ring chart The ring chart is identical to the pie chart, except that it renders as a ring versus a complete pie. In addition to sharing all the properties similar to the pie chart, it also defines the following rendering property : Options Property Group Property name Description section-depth This property defines the percentage of the radius to render the section as. The default value is set to 0.5. Ring chart example For this example, simply open the defined pie chart example and select the Ring chart type. Also, set the section-depth to 0.1, in order to generate the following effect: Multi pie chart The multi pie chart renders a group of pie charts, based on a category dataset. This meta-chart renders individual series data as a pie chart, each broken into individual categories within the individual pie charts. The multi pie chart utilizes the common properties defined above, including the category dataset properties. In addition to the standard set of properties, it also defines the following two properties: Options Property Group Property name Description label-format This label defines how each item within a chart is rendered. The default value is set to "{0}". The format string may also contain any of the following: {0}: To render the item name {1}: To render the item value {2}: To render the item percentage in relation to the entire pie chart by-row This value defaults to True. If set to False, the series and category fields are reversed, and individual charts render series information. Note that the horizontal, series-color, stacked and stacked-percent properties do not apply to this chart type. Multi pie chart example This example demonstrates the distribution of purchased item types, based on payment type. To begin, create a new report. You'll reuse the bar chart's SQL query. Now, place a new Chart element into the Report Header. Edit the chart, selecting Multi Pie as the chart type. To configure the dataset for this chart, select ITEMCATEGORY as the category-column. Set the value-columns property to QUANTITY and the series-by-field to PAYMENTTYPE. Waterfall chart The waterfall chart displays a unique stacked bar chart that spans categories. This chart is useful when comparing categories to one another. The last category in a waterfall chart normally equals the total of all the other categories to render appropriately, but this is based on the dataset, not the chart rendering. The waterfall chart utilizes the common properties defined above, including the category dataset properties. The stacked property is not available for this chart. There are no additional properties defined for the waterfall chart. Waterfall chart example In this example, you'll compare by type, the quantity of items in your inventory. Normally, the last category would be used to display the total values. The chart will render the data provided with or without a summary series, so you'll just use the example SQL query from the bar chart example. Add a Chart element to the Report Header and select Waterfall as the chart type. Set the category-column to ITEMCATEGORY, the value-columns to QUANTITY, and the series-by-value property to Quantity. Now, apply your changes and preview the results. Bar line chart The bar line chart combines the bar and line charts, allowing visualization of trends with categories, along with comparisons. The bar line chart is unique in that it requires two category datasets to populate the chart. The first dataset populates the bar chart, and the second dataset populates the line chart. The bar line chart utilizes the common properties defined above, including the category dataset properties. This chart also inherits the properties from both the bar chart, as well as the line chart. This chart also has certain additional properties, which are listed in the following table: Required Property Group Property name Description bar-data-source The name of the first dataset required by the bar line chart, which populates the bars in the chart. This value is automatically populated with the correct name. line-data-source The name of the second dataset required by the bar line chart, which populates the lines in the chart. This value is automatically populated with the correct name. Bar Options Property Group Property name Description ctgry-tick-font Defines the Java font that renders the Categories. Line Options Property Group Property name Description line-series-color Defines the color in which to render each line series. line-tick-fmt Specifies the Java DecimalFormat string for rendering the Line Axis Labels lines-label-font Defines the Java font to use when rendering line labels. line-tick-font Defines the Java font to use when rendering the Line Axis Labels. As part of the bar line chart, a second y-axis is defined for the lines. The property group Y2-Axis is available with similar properties as the standard y-axis. Bar line chart example To demonstrate the bar line chart, you'll reuse the SQL query from the area chart example. Create a new report, and add a Chart element to the Report Header. Edit the chart, and select Bar Line as the chart type. You'll begin by configuring the first dataset. Set the category-column to ITEMCATEGORY, the value-columns to COST, and the series-by-value property to Cost. To configure the second dataset, set the category-column to ITEMCATEGORY, the value-columns to SALEPRICE, and the series-by-value property to Sale Price. Set the x-axis-label-width to 2.0, and reduce the x-font size to 7. Also, set show-legend to True. You're now ready to preview the bar line chart. Bubble chart The bubble chart allows you to view three dimensions of data. The first two dimensions are your traditional X and Y dimensions, also known as domain and range. The third dimension is expressed by the size of the individual bubbles rendered. The bubble chart utilizes the common properties defined above, including the XY series dataset properties. The bubble chart also defines the following properties: Options Property Group Property name Description max-bubble-size This value defines the diameter of the largest bubble to render. All other bubble sizes are relative to the maximum bubble size. The default value is 0, so this value must be set to a reasonable value for rendering of bubbles to take place. Note that this value is based on pixels, not the domain or range values. The bubble chart defines the following additional dataset property: Required Property Group Property name Description z-value-columns This is the data source column to use for Z value, which specifies the bubble diameter relative to the maximum bubble size. Bubble chart example In this example, you need to define a three dimensional SQL query to populate the chart. You'll use inventory information, and calculate Item Category Margin: SELECT"INVENTORY"."ITEMCATEGORY","INVENTORY"."ONHAND","INVENTORY"."ONORDER","INVENTORY"."COST","INVENTORY"."SALEPRICE","INVENTORY"."SALEPRICE" - "INVENTORY"."COST" MARGINFROM"INVENTORY"ORDER BY"INVENTORY"."ITEMCATEGORY" ASC Now that you have a SQL query to work with, add a Chart element to the Report Header and select Bubble as the chart type. First, you'll populate the correct dataset fields. Set the series-by-field property to ITEMCATEGORY. Now, set the X, Y, and Z value columns to ONHAND, SALEPRICE, and MARGIN. You're now ready to customize the chart rendering. Set the x-title to On Hand, the y-title to Sales Price, the max-bubble-size to 100, and the show-legend property to True. The final result should look like this:

0
0
4011

Packt

10 Aug 2015

17 min read

The Splunk Interface

Packt

10 Aug 2015

17 min read

In this article by Vincent Bumgarner & James D. Miller, author of the book, Implementing Splunk - Second Edition, we will walk through the most common elements in the Splunk interface, and will touch upon concepts that will be covered in greater detail. You may want to dive right into the search section, but an overview of the user interface elements might save you some frustration later. We will cover the following topics: Logging in and app selection A detailed explanation of the search interface widgets A quick overview of the admin interface (For more resources related to this topic, see here.) Logging into Splunk The Splunk GUI interface (Splunk is also accessible through its command-line interface [CLI] and REST API) is web-based, which means that no client needs to be installed. Newer browsers with fast JavaScript engines, such as Chrome, Firefox, and Safari, work better with the interface. As of Splunk Version 6.2.0, no browser extensions are required. Splunk Versions 4.2 and earlier require Flash to render graphs. Flash can still be used by older browsers, or for older apps that reference Flash explicitly. The default port for a Splunk installation is 8000. The address will look like: http://mysplunkserver:8000 or http://mysplunkserver.mycompany.com:8000. The Splunk interface If you have installed Splunk on your local machine, the address can be some variant of http://localhost:8000, http://127.0.0.1:8000, http://machinename:8000, or http://machinename.local:8000. Once you determine the address, the first page you will see is the login screen. The default username is admin with the password changeme. The first time you log in, you will be prompted to change the password for the admin user. It is a good idea to change this password to prevent unwanted changes to your deployment. By default, accounts are configured and stored within Splunk. Authentication can be configured to use another system, for instance Lightweight Directory Access Protocol (LDAP). By default, Splunk authenticates locally. If LDAP is set up, the order is as follows: LDAP / Local. The home app After logging in, the default app is the Launcher app (some may refer to this as Home). This app is a launching pad for apps and tutorials. In earlier versions of Splunk, the Welcome tab provided two important shortcuts, Add data and the Launch search app. In version 6.2.0, the Home app is divided into distinct areas, or panes, that provide easy access to Explore Splunk Enterprise (Add Data, Splunk Apps, Splunk Docs, and Splunk Answers) as well as Apps (the App management page) Search & Reporting (the link to the Search app), and an area where you can set your default dashboard (choose a home dashboard). The Explore Splunk Enterprise pane shows links to: Add data: This links Add Data to the Splunk page. This interface is a great start for getting local data flowing into Splunk (making it available to Splunk users). The Preview data interface takes an enormous amount of complexity out of configuring dates and line breaking. Splunk Apps: This allows you to find and install more apps from the Splunk Apps Marketplace (http://apps.splunk.com). This marketplace is a useful resource where Splunk users and employees post Splunk apps, mostly free but some premium ones as well. Splunk Answers: This is one of your links to the wide amount of Splunk documentation available, specifically http://answers.splunk.com, where you can engage with the Splunk community on Splunkbase (https://splunkbase.splunk.com/) and learn how to get the most out of your Splunk deployment. The Apps section shows the apps that have GUI elements on your instance of Splunk. App is an overloaded term in Splunk. An app doesn't necessarily have a GUI at all; it is simply a collection of configurations wrapped into a directory structure that means something to Splunk. Search & Reporting is the link to the Splunk Search & Reporting app. Beneath the Search & Reporting link, Splunk provides an outline which, when you hover over it, displays a Find More Apps balloon tip. Clicking on the link opens the same Browse more apps page as the Splunk Apps link mentioned earlier. Choose a home dashboard provides an intuitive way to select an existing (simple XML) dashboard and set it as part of your Splunk Welcome or Home page. This sets you at a familiar starting point each time you enter Splunk. The following image displays the Choose Default Dashboard dialog: Once you select an existing dashboard from the dropdown list, it will be part of your welcome screen every time you log into Splunk – until you change it. There are no dashboards installed by default after installing Splunk, except the Search & Reporting app. Once you have created additional dashboards, they can be selected as the default. The top bar The bar across the top of the window contains information about where you are, as well as quick links to preferences, other apps, and administration. The current app is specified in the upper-left corner. The following image shows the upper-left Splunk bar when using the Search & Reporting app: Clicking on the text takes you to the default page for that app. In most apps, the text next to the logo is simply changed, but the whole block can be customized with logos and alternate text by modifying the app's CSS. The upper-right corner of the window, as seen in the previous image, contains action links that are almost always available: The name of the user who is currently logged in appears first. In this case, the user is Administrator. Clicking on the username allows you to select Edit Account (which will take you to the Your account page) or to Logout (of Splunk). Logout ends the session and forces the user to login again. The following screenshot shows what the Your account page looks like: This form presents the global preferences that a user is allowed to change. Other settings that affect users are configured through permissions on objects and settings on roles. (Note: preferences can also be configured using the CLI or by modifying specific Splunk configuration files). Full name and Email address are stored for the administrator's convenience. Time zone can be changed for the logged-in user. This is a new feature in Splunk 4.3. Setting the time zone only affects the time zone used to display the data. It is very important that the date is parsed properly when events are indexed. Default app controls the starting page after login. Most users will want to change this to search. Restart backgrounded jobs controls whether unfinished queries should run again if Splunk is restarted. Set password allows you to change your password. This is only relevant if Splunk is configured to use internal authentication. For instance, if the system is configured to use Windows Active Directory via LDAP (a very common configuration), users must change their password in Windows. Messages allows you to view any system-level error messages you may have pending. When there is a new message for you to review, a notification displays as a count next to the Messages menu. You can click the X to remove a message. The Settings link presents the user with the configuration pages for all Splunk Knowledge objects, Distributed Environment settings, System and Licensing, Data, and Users and Authentication settings. If you do not see some of these options, you do not have the permissions to view or edit them. The Activity menu lists shortcuts to Splunk Jobs, Triggered Alerts, and System Activity views. You can click Jobs (to open the search jobs manager window, where you can view and manage currently running searches), click Triggered Alerts (to view scheduled alerts that are triggered) or click System Activity (to see dashboards about user activity and the status of the system). Help lists links to video Tutorials, Splunk Answers, the Splunk Contact Support portal, and online Documentation. Find can be used to search for objects within your Splunk Enterprise instance. For example, if you type in error, it returns the saved objects that contain the term error. These saved objects include Reports, Dashboards, Alerts, and so on. You can also search for error in the Search & Reporting app by clicking Open error in search. The search & reporting app The Search & Reporting app (or just the search app) is where most actions in Splunk start. This app is a dashboard where you will begin your searching. The summary view Within the Search & Reporting app, the user is presented with the Summary view, which contains information about the data which that user searches for by default. This is an important distinction—in a mature Splunk installation, not all users will always search all data by default. But at first, if this is your first trip into Search & Reporting, you'll see the following: From the screen depicted in the previous screenshot, you can access the Splunk documentation related to What to Search and How to Search. Once you have at least some data indexed, Splunk will provide some statistics on the available data under What to Search (remember that this reflects only the indexes that this particular user searches by default; there are other events that are indexed by Splunk, including events that Splunk indexes about itself.) This is seen in the following image: In previous versions of Splunk, panels such as the All indexed data panel provided statistics for a user's indexed data. Other panels gave a breakdown of data using three important pieces of metadata—Source, Sourcetype, and Hosts. In the current version—6.2.0—you access this information by clicking on the button labeled Data Summary, which presents the following to the user: This dialog splits the information into three tabs—Hosts, Sources and Sourcetypes. A host is a captured hostname for an event. In the majority of cases, the host field is set to the name of the machine where the data originated. There are cases where this is not known, so the host can also be configured arbitrarily. A source in Splunk is a unique path or name. In a large installation, there may be thousands of machines submitting data, but all data on the same path across these machines counts as one source. When the data source is not a file, the value of the source can be arbitrary, for instance, the name of a script or network port. A source type is an arbitrary categorization of events. There may be many sources across many hosts, in the same source type. For instance, given the sources /var/log/access.2012-03-01.log and /var/log/access.2012-03-02.log on the hosts fred and wilma, you could reference all these logs with source type access or any other name that you like. Let's move on now and discuss each of the Splunk widgets (just below the app name). The first widget is the navigation bar. As a general rule, within Splunk, items with downward triangles are menus. Items without a downward triangle are links. Next we find the Search bar. This is where the magic starts. We'll go into great detail shortly. Search Okay, we've finally made it to search. This is where the real power of Splunk lies. For our first search, we will search for the word (not case specific); error. Click in the search bar, type the word error, and then either press Enter or click on the magnifying glass to the right of the bar. Upon initiating the search, we are taken to the search results page. Note that the search we just executed was across All time (by default); to change the search time, you can utilize the Splunk time picker. Actions Let's inspect the elements on this page. Below the Search bar, we have the event count, action icons, and menus. Starting from the left, we have the following: The number of events matched by the base search. Technically, this may not be the number of results pulled from disk, depending on your search. Also, if your query uses commands, this number may not match what is shown in the event listing. Job: This opens the Search job inspector window, which provides very detailed information about the query that was run. Pause: This causes the current search to stop locating events but keeps the job open. This is useful if you want to inspect the current results to determine whether you want to continue a long running search. Stop: This stops the execution of the current search but keeps the results generated so far. This is useful when you have found enough and want to inspect or share the results found so far. Share: This shares the search job. This option extends the job's lifetime to seven days and sets the read permissions to everyone. Export: This exports the results. Select this option to output to CSV, raw events, XML, or JavaScript Object Notation (JSON) and specify the number of results to export. Print: This formats the page for printing and instructs the browser to print. Smart Mode: This controls the search experience. You can set it to speed up searches by cutting down on the event data it returns and, additionally, by reducing the number of fields that Splunk will extract by default from the data (Fast mode). You can, otherwise, set it to return as much event information as possible (Verbose mode). In Smart mode (the default setting) it toggles search behavior based on the type of search you're running. Timeline Now we'll skip to the timeline below the action icons. Along with providing a quick overview of the event distribution over a period of time, the timeline is also a very useful tool for selecting sections of time. Placing the pointer over the timeline displays a pop-up for the number of events in that slice of time. Clicking on the timeline selects the events for a particular slice of time. Clicking and dragging selects a range of time. Once you have selected a period of time, clicking on Zoom to selection changes the time frame and reruns the search for that specific slice of time. Repeating this process is an effective way to drill down to specific events. Deselect shows all events for the time range selected in the time picker. Zoom out changes the window of time to a larger period around the events in the current time frame The field picker To the left of the search results, we find the field picker. This is a great tool for discovering patterns and filtering search results. Fields The field list contains two lists: Selected Fields, which have their values displayed under the search event in the search results Interesting Fields, which are other fields that Splunk has picked out for you Above the field list are two links: Hide Fields and All Fields. Hide Fields: Hides the field list area from view. All Fields: Takes you to the Selected Fields window. Search results We are almost through with all the widgets on the page. We still have a number of items to cover in the search results section though, just to be thorough. As you can see in the previous screenshot, at the top of this section, we have the number of events displayed. When viewing all results in their raw form, this number will match the number above the timeline. This value can be changed either by making a selection on the timeline or by using other search commands. Next, we have the action icons (described earlier) that affect these particular results. Under the action icons, we have four results tabs: Events list, which will show the raw events. This is the default view when running a simple search, as we have done so far. Patterns streamlines the event pattern detection. It displays a list of the most common patterns among the set of events returned by your search. Each of these patterns represents the number of events that share a similar structure. Statistics populates when you run a search with transforming commands such as stats, top, chart, and so on. The previous keyword search for error does not display any results in this tab because it does not have any transforming commands. Visualization transforms searches and also populates the Visualization tab. The results area of the Visualization tab includes a chart and the statistics table used to generate the chart. Not all searches are eligible for visualization. Under the tabs described just now, is the timeline. Options Beneath the timeline, (starting at the left) is a row of option links that include: Show Fields: shows the Selected Fields screen List: allows you to select an output option (Raw, List, or Table) for displaying the search results Format: provides the ability to set Result display options, such as Show row numbers, Wrap results, the Max lines (to display) and Drilldown as on or off. NN Per Page: is where you can indicate the number of results to show per page (10, 20, or 50). To the right are options that you can use to choose a page of results, and to change the number of events per page. In prior versions of Splunk, these options were available from the Results display options popup dialog. The events viewer Finally, we make it to the actual events. Let's examine a single event. Starting at the left, we have: Event Details: Clicking here (indicated by the right facing arrow) opens the selected event, providing specific information about the event by type, field, and value, and allows you the ability to perform specific actions on a particular event field. In addition, Splunk version 6.2.0 offers a button labeled Event Actions to access workflow actions, a few of which are always available. Build Eventtype: Event types are a way to name events that match a certain query. Extract Fields: This launches an interface for creating custom field extractions. Show Source: This pops up a window with a simulated view of the original source. The event number: Raw search results are always returned in the order most recent first. Next to appear are any workflow actions that have been configured. Workflow actions let you create new searches or links to other sites, using data from an event. Next comes the parsed date from this event, displayed in the time zone selected by the user. This is an important and often confusing distinction. In most installations, everything is in one time zone—the servers, the user, and the events. When one of these three things is not in the same time zone as the others, things can get confusing. Next, we see the raw event itself. This is what Splunk saw as an event. With no help, Splunk can do a good job finding the date and breaking lines appropriately, but as we will see later, with a little help, event parsing can be more reliable and more efficient. Below the event are the fields that were selected in the field picker. Clicking on the value adds the field value to the search. Summary As you have seen, the Splunk GUI provides a rich interface for working with search results. We have really only scratched the surface and will cover more elements. Resources for Article: Further resources on this subject: The Splunk Web Framework [Article] Loading data, creating an app, and adding dashboards and reports in Splunk [Article] Working with Apps in Splunk [Article]

0
0
4002

article-image-article-oracle-bi-publisher-11g-learning-new-xpt-format

Packt

31 Oct 2011

4 min read

Oracle BI Publisher 11g: Learning the new XPT format

Packt

31 Oct 2011

4 min read

0
0
3995

Packt

03 Sep 2012

18 min read

Overview of FIM 2010 R2

Packt

03 Sep 2012

18 min read

0
0
3966

article-image-exact-inference-using-graphical-models

Packt

25 Jun 2014

7 min read

Exact Inference Using Graphical Models

Packt

25 Jun 2014

7 min read

(For more resources related to this topic, see here.) Complexity of inference A graphical model can be used to answer both probability queries and MAP queries. The most straightforward way to use this model is to generate the joint distribution and sum out all the variables, except the ones we are interested in. However, we need to determine and specify the joint distribution where an exponential blowup happens. In worst-case scenarios, we need to determine the exact inference in NP-hard. By the word exact, we mean specifying the probability values with a certain precision (say, five digits after the decimals). Suppose we tone down our precision requirements (for example, only up to two digits after the decimals). Now, is the (approximate) inference task any easier? Unfortunately not—even approximate inference is NP-hard, that is, getting values is far better than random guessing (50 percent or a probability of 0.5), which takes exponential time. It might seem like inference is a hopeless task, but that is only in the worst case. In general cases, we can use exact inference to solve certain classes of real-world problems (such as Bayesian networks that have a small number of discrete random variables). Of course, for larger problems, we have to resort to approximate inference. Real-world issues Since inference is a task that is NP-hard, inference engines are written in languages that are as close to bare metal as possible; usually in C or C++. Use Python implementations of inference algorithms. Complete and mature packages for these are uncommon. Use inference engines that have a Python interface, such as Stan (mc-stan.org). This choice serves a good balance between running the Python code and a fast inference implementation. Use inference engines that do not have a Python interface, which is true for majority of the inference engines out there. A fairly comprehensive list can be found at http://en.wikipedia.org/wiki/Bayesian_network#Software. The use of Python here is limited to creating a file that describes the model in a format that the inference engine can consume. In the article on inference, we will stick to the first two choices in the list. We will use native Python implementations (of inference algorithms) to peek into the interiors of the inference algorithms while running toy-sized problems, and then use an external inference engine with Python interfaces to try out a more real-world problem. The tree algorithm We will now look at another class of exact inference algorithms based on message passing. Message passing is a general mechanism, and there exist many variations of message passing algorithms. We shall look at a short snippet of the clique tree-message passing algorithm (which is sometimes called the junction tree algorithm too). Other versions of the message passing algorithm are used in approximate inference as well. We initiate the discussion by clarifying some of the terms used. A cluster graph is an arrangement of a network where groups of variables are placed in the cluster. It is similar to a factor where each cluster has a set of variables in its scope. The message passing algorithm is all about passing messages between clusters. As an analogy, consider the gossip going on at a party, where Shelly and Clair are in a conversation. If Shelly knows B, C, and D, and she is chatting with Clair who knows D, E, and F (note that the only person they know in common is D), they can share information (or pass messages) about their common friend D. In the message passing algorithm, two clusters are connected by a Separation Set (sepset), which contains variables common to both clusters. Using the preceding example, the two clusters and are connected by the sepset , which contains the only variable common to both clusters. In the next section, we shall learn about the implementation details of the junction tree algorithm. We will first understand the four stages of the algorithm and then use code snippets to learn about it from an implementation perspective. The four stages of the junction tree algorithm In this section, we will discuss the four stages of the junction tree algorithm. In the first stage, the Bayes network is converted into a secondary structure called a join tree (alternate names for this structure in the literature are junction tree, cluster tree, or a clique tree). The transformation from the Bayes network to junction tree proceeds as per the following steps: We will construct a moral graph by changing all the directed edges to undirected edges. All nodes that have V-structures that enter the said node have their parents connected with an edge. We have seen an example of this process (in the VE algorithm) called moralization, which is a possible reference to connect (apparently unmarried) parents that have a child (node). Then, we will selectively add edges to the moral graph to create a triangulated graph. A triangulated graph is an undirected graph where the maximum cycle length between the nodes is 3. From the triangulated graph, we will identify the subsets of nodes (called cliques). Starting with the cliques as clusters, we will arrange the clusters to form an undirected tree called the join tree, which satisfies the running intersection property. This property states that if a node appears in two cliques, it should also appear in all the nodes on the path that connect the two cliques. In the second stage, the potentials at each cluster are initialized. The potentials are similar to a CPD or a table. They have a list of values against each assignment to a variable in their scope. Both clusters and sepsets contain a set of potentials. The term potential is used as opposed to probabilities because in Markov networks, unlike probabilities, the values of the potentials are not obliged to sum to 1. This stage consists of message passing or belief propagation between neighboring clusters. Each message consists of a belief the cluster has about a particular variable. Each message can be passed asynchronously, but it has to wait for information from other clusters before it collates that information and passes it to the next cluster. It can be useful to think of a tree-structured cluster graph, where the message passing happens in two stages: an upward pass stage and a downward pass stage. Only after a node receives messages from the leaf nodes, will it send the message to its parent (in the "upward pass"), and only after the node receives a message from its parents will it send a message to its children (in the "downward pass"). The message passing stage completes when each cluster sepset has consistent beliefs. Recall that a cluster connected to a sepset has common variables. For example, cluster C and sepset S have and variables in its scope. Then, the potential against obtained from either the cluster or the sepset has the same value, which is why it is said that the cluster graph has consistent beliefs or that the cliques are calibrated. Once the whole cluster graph has consistent beliefs, the fourth stage is marginalization, where we can query the marginal distribution for any variable in the graph. Summary We first explored the inference problem where we studied the types of inference. We then learned that inference is NP-hard and understood that, for large networks, exact inference is infeasible. Resources for Article: Further resources on this subject: Getting Started with Spring Python [article] Python Testing: Installing the Robot Framework [article] Discovering Python's parallel programming tools [article]

0
0
3946

article-image-securing-data-rest-oracle-11g

Packt

23 Oct 2012

11 min read

Securing Data at Rest in Oracle 11g

Packt

23 Oct 2012

11 min read

0
0
3919

Packt

20 Jun 2014

11 min read

What is Quantitative Finance?

Packt

20 Jun 2014

11 min read

(For more resources related to this topic, see here.) Discipline 1 – finance (financial derivatives) In general, a financial derivative is a contract between two parties who agree to exchange one or more cash flows in the future. The value of these cash flows depends on some future event, for example, that the value of some stock index or interest rate being above or below some predefined level. The activation or triggering of this future event thus depends on the behavior of a variable quantity known as the underlying. Financial derivatives receive their name because they derive their value from the behavior of another financial instrument. As such, financial derivatives do not have an intrinsic value in themselves (in contrast to bonds or stocks); their price depends entirely on the underlying. A critical feature of derivative contracts is thus that their future cash flows are probabilistic and not deterministic. The future cash flows in a derivative contract are contingent on some future event. That is why derivatives are also known as contingent claims. This feature makes these types of contracts difficult to price. The following are the most common types of financial derivatives: Futures Forwards Options Swaps Futures and forwards are financial contracts between two parties. One party agrees to buy the underlying from the other party at some predetermined date (the maturity date) for some predetermined price (the delivery price). An example could be a one-month forward contract on one ounce of silver. The underlying is the price of one ounce of silver. No exchange of cash flows occur at inception (today, t=0), but it occurs only at maturity (t=T). Here t represents the variable time. Forwards are contracts negotiated privately between two parties (in other words, Over The Counter (OTC)), while futures are negotiated at an exchange. Options are financial contracts between two parties. One party (called the holder of the option) pays a premium to the other party (called the writer of the option) in order to have the right, but not the obligation, to buy some particular asset (the underlying) for some particular price (the strike price) at some particular date in the future (the maturity date). This type of contract is called a European Call contract. Example 1 Consider a one-month call contract on the S&P 500 index. The underlying in this case will be the value of the S&P 500 index. There are cash flows both at inception (today, t=0) and at maturity (t=T). At inception, (t=0) the premium is paid, while at maturity (t=T), the holder of the option will choose between the following two possible scenarios, depending on the value of the underlying at maturity S(T): Scenario A: To exercise his/her right and buy the underlying asset for K Scenario B: To do nothing if the value of the underlying at maturity is below the value of the strike, that is, S(T)<K The option holder will choose Scenario A if the value of the underlying at maturity is above the value of the strike, that is, S(T)>K. This will guarantee him/her a profit of S(T)-K. The option holder will choose Scenario B if the value of the underlying at maturity is below the value of the strike, that is, S(T)<K. This will guarantee him/her to limit his/her losses to zero. Example 2 An Interest Rate Swap (IRS) is a financial contract between two parties A and B who agree to exchange cash flows at regular intervals during a given period of time (the life of a contract). Typically, the cash flows from A to B are indexed to a fixed rate of interest, while the cash flows from B to A are indexed to a floating interest rate. The set of fixed cash flows is known as the fixed leg, while the set of floating cash flows is known as the floating leg. The cash flows occur at regular intervals during the life of the contract between inception (t=0) and maturity (t=T). An example could be a fixed-for-floating IRS, who pays a rate of 5 percent on the agreed notional N every three months and receives EURIBOR3M on the agreed notional N every three months. Example 3 A futures contract on a stock index also involves a single future cash flow (the delivery price) to be paid at the maturity of the contract. However, the payoff in this case is uncertain because how much profit I will get from this operation will depend on the value of the underlying at maturity. If the price of the underlying is above the delivery price, then the payoff I get (denoted by function H) is positive (indicating a profit) and corresponds to the difference between the value of the underlying at maturity S(T) and the delivery price K. If the price of the underlying is below the delivery price, then the payoff I get is negative (indicating a loss) and corresponds to the difference between the delivery price K and the value of the underlying at maturity S(T). This characteristic can be summarized in the following payoff formula: Equation 1 Here, H(S(T)) is the payoff at maturity, which is a function of S(T). Financial derivatives are very important to the modern financial markets. According to the Bank of International Settlements (BIS) as of December 2012, the amounts outstanding for OTC derivative contracts worldwide were Foreign exchange derivatives with 67,358 billion USD, Interest Rate Derivatives with 489,703 billion USD, Equity-linked derivatives with 6,251 billion USD, Commodity derivatives with 2,587 billion USD, and Credit default swaps with 25,069 billion USD. For more information, see http://www.bis.org/statistics/dt1920a.pdf. Discipline 2 – mathematics We need mathematical models to capture both the future evolution of the underlying and the probabilistic nature of the contingent cash flows we encounter in financial derivatives. Regarding the contingent cash flows, these can be represented in terms of the payoff function H(S(T)) for the specific derivative we are considering. Because S(T) is a stochastic variable, the value of H(S(T)) ought to be computed as an expectation E[H(S(T))]. And in order to compute this expectation, we need techniques that allow us to predict or simulate the behavior of the underlying S(T) into the future, so as to be able to compute the value of ST and finally be able to compute the mean value of the payoff E[H(S(T))]. Regarding the behavior of the underlying, typically, this is formalized using Stochastic Differential Equations (SDEs), such as Geometric Brownian Motion (GBM), as follows: Equation 2 The previous equation fundamentally says that the change in a stock price (dS), can be understood as the sum of two effects—a deterministic effect (first term on the right-hand side) and a stochastic term (second term on the right-hand side). The parameter is called the drift, and the parameter is called the volatility. S is the stock price, dt is a small time interval, and dW is an increment in the Wiener process. This model is the most common model to describe the behavior of stocks, commodities, and foreign exchange. Other models exist, such as jump, local volatility, and stochastic volatility models that enhance the description of the dynamics of the underlying. Regarding the numerical methods, these correspond to ways in which the formal expression described in the mathematical model (usually in continuous time) is transformed into an approximate representation that can be used for calculation (usually in discrete time). This means that the SDE that describes the evolution of the price of some stock index into the future, such as the FTSE 100, is changed to describe the evolution at discrete intervals. An approximate representation of an SDE can be calculated using the Euler approximation as follows: Equation 3 The preceding equation needs to be solved in an iterative way for each time interval between now and the maturity of the contract. If these time intervals are days and the contract has a maturity of 30 days from now, then we compute tomorrow's price in terms of todays. Then we compute the day after tomorrow as a function of tomorrow's price and so on. In order to price the derivative, we require to compute the expected payoff E[H(ST)] at maturity and then discount it to the present. In this way, we would be able to compute what should be the fair premium associated with a European option contract with the help of the following equation: Equation 4 Discipline 3 – informatics (C++ programming) What is the role of C++ in pricing derivatives? Its role is fundamental. It allows us to implement the actual calculations that are required in order to solve the pricing problem. Using the preceding techniques to describe the dynamics of the underlying, we require to simulate many potential future scenarios describing its evolution. Say we ought to price a futures contract on the EUR/USD exchange rate with one year maturity. We have to simulate the future evolution of EUR/USD for each day for the next year (using equation 3). We can then compute the payoff at maturity (using equation 1). However, in order to compute the expected payoff (using equation 4), we need to simulate thousands of such possible evolutions via a technique known as Monte Carlo simulation. The set of steps required to complete this process is known as an algorithm. To price a derivative, we ought to construct such algorithm and then implement it in an advanced programming language such as C++. Of course C++ is not the only possible choice, other languages include Java, VBA, C#, Mathworks Matlab, and Wolfram Mathematica. However, C++ is an industry standard because it's flexible, fast, and portable. Also, through the years, several numerical libraries have been created to conduct complex numerical calculations in C++. Finally, C++ is a powerful modern object-oriented language. It is always difficult to strike a balance between clarity and efficiency. We have aimed at making computer programs that are self-contained (not too object oriented) and self-explanatory. More advanced implementations are certainly possible, particularly in the context of larger financial pricing libraries in a corporate context. In this article, all the programs are implemented with the newest standard C++11 using Code::Blocks (http://www.codeblocks.org) and MinGW (http://www.mingw.org). The Bento Box template A Bento Box is a single portion take-away meal common in Japanese cuisine. Usually, it has a rectangular form that is internally divided in compartments to accommodate the various types of portions that constitute a meal. In this article, we use the metaphor of the Bento Box to describe a visual template to facilitate, organize, and structure the solution of derivative problems. The Bento Box template is simply a form that we will fill sequentially with the different elements that we require to price derivatives in a logical structured manner. The Bento Box template when used to price a particular derivative is divided into four areas or boxes, each containing information critical for the solution of the problem. The following figure illustrates a generic template applicable to all derivatives: The Bento Box template – general case The following figure shows an example of the Bento Box template as applied to a simple European Call option: The Bento Box template – European Call option In the preceding figure, we have filled the various compartments, starting in the top-left box and proceeding clockwise. Each compartment contains the details about our specific problem, taking us in sequence from the conceptual (box 1: derivative contract) to the practical (box 4: algorithm), passing through the quantitative aspects required for the solution (box 2: mathematical model and box 3: numerical method). Summary This article gave an overview of the main elements of Quantitative Finance as applied to pricing financial derivatives. The Bento Box template technique will be used to organize our approach to solve problems in pricing financial derivatives. We will assume that we are in possession with enough information to fill box 1 (derivative contract). Resources for Article: Further resources on this subject: Application Development in Visual C++ - The Tetris Application [article] Getting Started with Code::Blocks [article] Creating and Utilizing Custom Entities [article]

0
0
3915

article-image-using-execnet-parallel-and-distributed-processing-nltk

Packt

09 Nov 2010

8 min read

Using Execnet for Parallel and Distributed Processing with NLTK

Packt

09 Nov 2010

8 min read

Python Text Processing with NLTK 2.0 Cookbook Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities. Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond Learn how machines and crawlers interpret and process natural languages Easily work with huge amounts of data and learn how to handle distributed processing Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible Introduction NLTK is great for in-memory single-processor natural language processing. However, there are times when you have a lot of data to process and want to take advantage of multiple CPUs, multi-core CPUs, and even multiple computers. Or perhaps you want to store frequencies and probabilities in a persistent, shared database so multiple processes can access it simultaneously. For the first case, we'll be using execnet to do parallel and distributed processing with NLTK. For the second case, you'll learn how to use the Redis data structure server/database to store frequency distributions and more. Distributed tagging with execnet Execnet is a distributed execution library for python. It allows you to create gateways and channels for remote code execution. A gateway is a connection from the calling process to a remote environment. The remote environment can be a local subprocess or an SSH connection to a remote node. A channel is created from a gateway and handles communication between the channel creator and the remote code. Since many NLTK processes require 100 percent CPU utilization during computation, execnet is an ideal way to distribute that computation for maximum resource usage. You can create one gateway per CPU core, and it doesn't matter whether the cores are in your local computer or spread across remote machines. In many situations, you only need to have the trained objects and data on a single machine, and can send the objects and data to the remote nodes as needed. Getting ready You'll need to install execnet for this to work. It should be as simple as sudo pip install execnet or sudo easy_install execnet. The current version of execnet, as of this writing, is 1.0.8. The execnet homepage, which has API documentation and examples, is at http://codespeak.net/execnet/. How to do it... We start by importing the required modules, as well as an additional module remote_tag. py that will be explained in the next section. We also need to import pickle so we can serialize the tagger. Execnet does not natively know how to deal with complex objects such as a part-of-speech tagger, so we must dump the tagger to a string using pickle.dumps(). We'll use the default tagger that's used by the nltk.tag.pos_tag() function, but you could load and dump any pre-trained part-of-speech tagger as long as it implements the TaggerI interface. Once we have a serialized tagger, we start execnet by making a gateway with execnet. makegateway(). The default gateway creates a Python subprocess, and we can call the remote_exec() method with the remote_tag module to create a channel. With an open channel, we send over the serialized tagger and then the first tokenized sentence of the treebank corpus. You don't have to do any special serialization of simple types such as lists and tuples, since execnet already knows how to handle serializing the built-in types. Now if we call channel.receive(), we get back a tagged sentence that is equivalent to the first tagged sentence in the treebank corpus, so we know the tagging worked. We end by exiting the gateway, which closes the channel and kills the subprocess. >>> import execnet, remote_tag, nltk.tag, nltk.data >>> from nltk.corpus import treebank >>> import cPickle as pickle >>> tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER)) >>> gw = execnet.makegateway() >>> channel = gw.remote_exec(remote_tag) >>> channel.send(tagger) >>> channel.send(treebank.sents()[0]) >>> tagged_sentence = channel.receive() >>> tagged_sentence == treebank.tagged_sents()[0] True >>> gw.exit() Visually, the communication process looks like this: How it works... The gateway's remote_exec() method takes a single argument that can be one of the following three types: A string of code to execute remotely. The name of a pure function that will be serialized and executed remotely. The name of a pure module whose source will be executed remotely. We use the third option with the remote_tag.py module, which is defined as follows: import cPickle as pickle if __name__ == '__channelexec__': tagger = pickle.loads(channel.receive()) for sentence in channel: channel.send(tagger.tag(sentence)) A pure module is a module that is self-contained. It can only access Python modules that are available where it executes, and does not have access to any variables or states that exist wherever the gateway is initially created. To detect that the module is being executed by execnet, you can look at the __name__ variable. If it's equal to '__channelexec__', then it is being used to create a remote channel. This is similar to doing if __name__ == '__ main__' to check if a module is being executed on the command line. The first thing we do is call channel.receive() to get the serialized tagger, which we load using pickle.loads(). You may notice that channel is not imported anywhere—that's because it is included in the global namespace of the module. Any module that execnet executes remotely has access to the channel variable in order to communicate with the channel` creator. Once we have the tagger, we iteratively tag() each tokenized sentence that we receive from the channel. This allows us to tag as many sentences as the sender wants to send, as iteration will not stop until the channel is closed. What we've essentially created is a compute node for part-of-speech tagging that dedicates 100 percent of its resources to tagging whatever sentences it receives. As long as the channel remains open, the node is available for processing. There's more... This is a simple example that opens a single gateway and channel. But execnet can do a lot more, such as opening multiple channels to increase parallel processing, as well as opening gateways to remote hosts over SSH to do distributed processing. Multiple channels We can create multiple channels, one per gateway, to make the processing more parallel. Each gateway creates a new subprocess (or remote interpreter if using an SSH gateway) and we use one channel per gateway for communication. Once we've created two channels, we can combine them using the MultiChannel class, which allows us to iterate over the channels, and make a receive queue to receive messages from each channel. After creating each channel and sending the tagger, we cycle through the channels to send an even number of sentences to each channel for tagging. Then we collect all the responses from the queue. A call to queue.get() will return a 2-tuple of (channel, message) in case you need to know which channel the message came from. If you don't want to wait forever, you can also pass a timeout keyword argument with the maximum number of seconds you want to wait, as in queue.get(timeout=4). This can be a good way to handle network errors. Once all the tagged sentences have been collected, we can exit the gateways. Here's the code: >>> import itertools >>> gw1 = execnet.makegateway() >>> gw2 = execnet.makegateway() >>> ch1 = gw1.remote_exec(remote_tag) >>> ch1.send(tagger) >>> ch2 = gw2.remote_exec(remote_tag) >>> ch2.send(tagger) >>> mch = execnet.MultiChannel([ch1, ch2]) >>> queue = mch.make_receive_queue() >>> channels = itertools.cycle(mch) >>> for sentence in treebank.sents()[:4]: ... channel = channels.next() ... channel.send(sentence) >>> tagged_sentences = [] >>> for i in range(4): ... channel, tagged_sentence = queue.get() ... tagged_sentences.append(tagged_sentence) >>> len(tagged_sentences) 4 >>> gw1.exit() >>> gw2.exit() Local versus remote gateways The default gateway spec is popen, which creates a Python subprocess on the local machine. This means execnet.makegateway() is equivalent to execnet. makegateway('popen'). If you have passwordless SSH access to a remote machine, then you can create a remote gateway using execnet.makegateway('ssh=remotehost') where remotehost should be the hostname of the machine. A SSH gateway spawns a new Python interpreter for executing the code remotely. As long as the code you're using for remote execution is pure, you only need a Python interpreter on the remote machine. Channels work exactly the same no matter what kind of gateway is used; the only difference will be communication time. This means you can mix and match local subprocesses with remote interpreters to distribute your computations across many machines in a network. There are many more details on gateways in the API documentation at http://codespeak.net/execnet/basics.html.

0
0
3905

article-image-integrating-ibm-cognos-tm1-ibm-cognos-8-bi

Packt

16 Dec 2011

4 min read

Integrating IBM Cognos TM1 with IBM Cognos 8 BI

Packt

16 Dec 2011

4 min read

(For more resources on IBM, see here.) Before proceeding with the actual steps of the recipe, we will take a note of the following integration considerations: The measured Dimension in the TM1 Cube needs to be explicitly identified. The Data Source needs to be created in IBM Cognos Connection which points to the TM1 Cube. New Data Source can also be created from IBM Cognos Framework Manager, but for the sake of simplicity we will be creating that from IBM Cognos Connection itself. The created Data Source is used in IBM Cognos Framework Manager Model to create a Metadata Package and publish to IBM Cognos Connection. Metadata Package can be used to create reports, generate queries, slice and dice, or event management using one of the designer studios available in IBM Cognos BI. We will focus on each of the above steps in this recipe, where we will be using one of the Cubes created as part of demodata TM1 Server application and we will be using the Cube as a Data Source in the IBM Cognos BI layer. Getting ready Ensure that the TM1 Admin Server service is started and demodata TM1 Server is running. We should have IBM Cognos 8 BI Server running and IBM Cognos 8 Framework Manager installed. How to do it... Open the TM1 Architect and right-click on the Sales_Plan Cube. Click on Properties. In the Measures Dimension box, click on Sales_Plan_Measures and then for Time Dimension click on Months. Note that the preceding step is compulsory if we want to use the Cube as a Data Source for the BI layer. We need to explicitly define a measures dimension and a time dimension. Click on OK and minimize the TM1 Architect, keep the server running. Now from the Start menu, open IBM Cognos Framework Manager, which is desktop-based tool used to create metadata models. Create a new project from IBM Cognos 8 Framework Manager. Enter the Project name as Demodata and provide the Location where the model file will be located. Note that each project generates a .cpf file which can be opened in the IBM Cognos Framework Manager. Provide valid user credentials so that IBM Cognos Framework Manager can link to a running IBM Cognos BI Server setup. Users and roles are defined by IBM Cognos BI admin user. Choose English as the authoring language when the Select Language list comes up. This will open the Metadata Wizard - Select Metadata Source. We use the Metadata Wizard to create a new Data Source or point to an existing Data Source. In the Metadata Wizard make sure that Data Sources is selected and click on the Next button. In the next screen, click on the New button to create a new Data Source by the name of TM1_Demodata_Sales_Plan. This will open a New data source wizard, where we need to specify the name of the Data Source. On next screen, it will ask for the Data Source Type for which we will specify TM1 from the drop-down, as we want to create a new Data Source based on the TM1 Cube Sales_Data. On the next screen specify the connection parameters. For Administration Host we can specify a name or localhost, depending on the name of the server. In our case, we have specified name of the server as ankitgar, hence we are using an actual name instead of a localhost. In the case of TM1 sitting on another server within the network, we will provide the IP address or name of the host in UNC format. Test the connection to test whether the connection to the TM1 Cube is successful. Click on Close and proceed. Click on the Finish button to complete the creation of the Data Source. The new Data Source is created on the Cognos 8 Server and now can be used by anyone with valid privileges given by the admin user. It's just a connection to the Sales_Plan TM1 Cube which now can be used to create metadata models and, hence, reports and queries perform the various functions suggested in the preceding sections. Now it will return to Metadata Wizard as shown, with the new Data Source appearing with the list of already created Data Sources. Click on the newly created Data Source and on the Next button. It will display all available Cubes on the DemoData TM1 Server, the machine name being the server name (localhost/ankitgar). Click on the Sales_Plan cube and then on Next.

0
0
3894

How-To Tutorials - Data

Signal Processing Techniques

Using R for Statistics, Research, and Graphics

Social Media Insight Using Naive Bayes

Crypto-cash is missing from the wallet of dead cryptocurrency entrepreneur Gerald Cotten - find it, and you could get $100,000

Top announcements from the TensorFlow Dev Summit 2019

Introduction to Practical Business Intelligence

Including Charts and Graphics in Pentaho Reports (Part 2)

The Splunk Interface

Oracle BI Publisher 11g: Learning the new XPT format

Overview of FIM 2010 R2

Trending Topics

Exact Inference Using Graphical Models

Securing Data at Rest in Oracle 11g

What is Quantitative Finance?

Using Execnet for Parallel and Distributed Processing with NLTK

Integrating IBM Cognos TM1 with IBM Cognos 8 BI

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access