Data | 0 articles | Tech News, Tutorials & Expert Insights

12 Jun 2013

8 min read

A quick start – OpenCV fundamentals

12 Jun 2013

(For more resources related to this topic, see here.) The OpenCV library has a modular structure, and the following diagram depicts the different modules available in it: A brief description of all the modules is as follows: Module Feature Core A compact module defining basic data structures, including the dense multidimensional array Mat and basic functions used by all other modules. Imgproc An image processing module that includes linear and non-linear image filtering, geometrical image transformations (resize, affine and perspective warping, generic table-based remapping), color space conversion, histograms, and so on. Video A video analysis module that includes motion estimation, background subtraction, and object tracking algorithms. Calib3d Basic multiple-view geometry algorithms, single and stereo camera calibration, object pose estimation, stereo correspondence algorithms, and elements of 3D reconstruction. Features2d Salient feature detectors, descriptors, and descriptor matchers. Objdetect Detection of objects and instances of the predefined classes; for example, faces, eyes, mugs, people, cars, and so on. Highgui An easy-to-use interface to video capturing, image and video codecs, as well as simple UI capabilities. Gpu GPU-accelerated algorithms from different OpenCV modules. Task 1 – image basics When trying to recreate the physical world around us in digital format via a camera, for example, the computer just sees the image in the form of a code that just contains the numbers 1 and 0. A digital image is nothing but a collection of pixels (picture elements) which are then stored in matrices in OpenCV for further manipulation. In the matrices, each element contains information about a particular pixel in the image. The pixel value decides how bright or what color that pixel should be. Based on this, we can classify images as: Greyscale Color/RGB Greyscale Here the pixel value can range from 0 to 255 and hence we can see the various shades of gray as shown in the following diagram. Here, 0 represents black and 255 represents white: A special case of grayscale is the binary image or black and white image. Here every pixel is either black or white, as shown in the following diagram: Color/RGB Red, Blue, and Green are the primary colors and upon mixing them in various different proportions, we can get new colors. A pixel in a color image has three separate channels— one each for Red, Blue, and Green. The value ranges from 0 to 255 for each channel, as shown in the following diagram: Task 2 – reading and displaying an image We are now going to write a very simple and basic program using the OpenCV library to read and display an image. This will help you understand the basics. Code A simple program to read and display an image is as follows: // opencv header files #include "opencv2/highgui/highgui.hpp" #include "opencv2/core/core.hpp" // namespaces declaration using namespace cv; using namespace std; // create a variable to store the image Mat image; int main( int argc, char** argv ) { // open the image and store it in the 'image' variable // Replace the path with where you have downloaded the image image=imread("<path to image">/lena.jpg"); // create a window to display the image namedWindow( "Display window", CV_WINDOW_AUTOSIZE ); // display the image in the window created imshow( "Display window", image ); // wait for a keystroke waitKey(0); return 0; } Code explanation Now let us understand how the code works. Short comments have also been included in the code itself to increase the readability. #include "opencv2/highgui/highgui.hpp" #include "opencv2/core/core.hpp" The preceding two header files will be a part of almost every program we write using the OpenCV library. As explained earlier, the highgui header is used for window creation, management, and so on, while the core header is used to access the Mat data structure in OpenCV. using namespace cv; using namespace std; The preceding two lines declare the required namespaces for this code so that we don't have to use the :: (scope resolution) operator every time for accessing the functions. Mat image; With the above command, we have just created a variable image of the datatype Mat that is frequently used in OpenCV to store images. image=imread("<path to image">/lena.jpg"); In the previous command, we opened the image lena.jpg and stored it in the image variable. Replace <path to image> in the preceding command with the location of that picture on your PC. namedWindow( "Display window", CV_WINDOW_AUTOSIZE ); We now need a window to display our image. So, we use the above function to do the same. This function takes two parameters, out of which the first one is the name of the window. In our case, we would like to name our window Display Window. The second parameter is optional, but it resizes the window based on the size of the image so that the image is not cropped. imshow( "Display window", image ); Finally, we are ready to display our image in the window we just created by using the preceding function. This function takes two parameters out of which the first one is the window name in which the image has to be displayed. In our case, obviously, that will be Display Window . The second parameter is the image variable containing the image that we want to display. In our case, it's the image variable. waitKey(0); Last but not least, it is advised that you use the preceding function in most of the codes that you write using the OpenCV library. If we don't write this code, the image will be displayed for a fraction of a second and the program will be immediately terminated. It happens so fast that you will not be able to see the image. What this function does essentially is that it waits for a keystroke from the user and hence it delays the termination of the program. The delay here is in milliseconds. Output The image can be displayed as follows: Task 3 – resizing and saving an image We are now going to write a very simple and basic program using the OpenCV library to resize and save an image. Code The following code helps you to resize a given image: // opencv header files #include "opencv2/highgui/highgui.hpp" #include "opencv2/imgproc/imgproc.hpp" #include "opencv2/core/core.hpp" // namespaces declaration using namespace std; using namespace cv; int main(int argc, char** argv) { // create variables to store the images Mat org, resized,saved; // open the image and store it in the 'org' variable // Replace the path with where you have downloaded the image org=imread("<path to image>/lena.png"); //Create a window to display the image namedWindow("Original Image",CV_WINDOW_AUTOSIZE); //display the image imshow("Original Image",org); //resize the image resize(org,resized,Size(),0.5,0.5,INTER_LINEAR); namedWindow("Resized Image",CV_WINDOW_AUTOSIZE); imshow("Resized Image",resized); //save the image //Replace <path> with your desired location imwrite("<path>/saved.png",resized; namedWindow("Image saved",CV_WINDOW_AUTOSIZE); saved=imread("<path to image>/saved.png"); imshow("Image saved",saved); //wait for a keystroke waitKey(0); return 0; } Code explanation Only the new functions/concepts will be explained in this case. #include "opencv2/imgproc/imgproc.hpp" Imgproc is another useful header that gives us access to the various transformations, color conversions, filters, histograms, and so on. Mat org, resized; We have now created two variables, org and resized, to store the original and resized images respectively. resize(org,resized,Size(),0.5,0.5,INTER_LINEAR); We have used the preceding function to resize the image. The preceding function takes six parameters, out of which the first one is the variable containing the source image to be modified. The second one is the variable to store the resized image. The third parameter is the output image size. In this case we have not specified this, but we have instead used the Size() function, which will automatically calculate it based on the values of the fourth and fifth parameters. The fourth and fifth parameters are the scale factors along the horizontal and vertical axes respectively. The sixth parameter is for choosing the type of interpolation method. We have used the bilinear interpolation, which is the default method. imwrite("<path>/saved.png",final); Finally, using the preceding function, you can save an image to a particular location on our PC. The function takes two parameters, out of which the first one is the location where you want to store the image and the second is the variable in which the image is stored. This function is very useful when you want to perform multiple operations on an image and save the image on your PC for future reference. Replace <path> in the preceding function with your desired location. Output Resizing can be demonstrated through the following output: Summary This section showed you how to perform a few of the basic tasks in OpenCV as well as how to write your first OpenCV program. Resources for Article : Further resources on this subject: OpenCV: Segmenting Images [Article] Tracking Faces with Haar Cascades [Article] OpenCV: Image Processing using Morphological Filters [Article]

0
0
8544

article-image-extending-elasticsearch-scripting

Packt

06 Feb 2015

21 min read

Extending ElasticSearch with Scripting

Packt

06 Feb 2015

21 min read

0
0
8475

article-image-how-bad-is-the-gender-diversity-crisis-in-ai-research-study-analysing-1-5million-arxiv-papers-says-its-serious

Fatema Patrawala

18 Jul 2019

9 min read

How bad is the gender diversity crisis in AI research? Study analysing 1.5million arxiv papers says it’s “serious”

Fatema Patrawala

18 Jul 2019

9 min read

Yesterday the team at Nesta organization, an innovation firm based out of UK published a research on gender diversity in the AI research workforce. The authors of this research are Juan Mateos Garcis, the Director, Konstantinos Stathoulopoulos, the Principal Researcher and Hannah Owen, the Programme Coordinator at Nesta. https://twitter.com/JMateosGarcia/status/1151517641103872006 They have prepared an analysis purely based on 1.5 million arxiv papers. The team claims that it is the first ever study of gender diversity in AI which is not on any convenience sampling or proprietary database. The team posted on its official blog post, “We conducted a large-scale analysis of gender diversity in AI research using publications from arXiv, a repository with more than 1.5 million preprints widely used by the AI community. We aim to expand the evidence base on gender diversity in AI research and create a baseline with which to interrogate the impact of current and future policies and interventions. To achieve this, we enriched the ArXiv data with geographical, discipline and gender information in order to study the evolution of gender diversity in various disciplines, countries and institutions as well as examine the semantic differences between AI papers with and without female co-authors.” With this research the team also aims to bring prominent female figures they have identified under the spotlight. Key findings from the research Serious gender diversity crisis in AI research The team found a severe gender diversity gap in AI research with only 13.83% of authors being women. Moreover, in relative terms, the proportion of AI papers co-authored by at least one woman has not improved since the 1990s. Juan Mateos thinks this kind of crisis is a waste of talent and it increases the risk of discriminatory AI systems. https://twitter.com/JMateosGarcia/status/1151517642236276736 Location and research domain are significant drivers of gender diversity Women in the Netherlands, Norway and Denmark are more likely to publish AI papers while those in Japan and Singapore are less likely. In the UK, 26.62% of the AI papers have at least one female co-author, placing the country at the 22nd spot worldwide. The US follows the UK in terms of having at least one female co-authors at 25% and for the unique female author US leads one position above UK. Source: Nesta research report Regarding the research domains, women working in Physics and Education, Computer Ethics and other societal issues and Biology are more likely to publish their work on AI in comparison to those working in Computer Science or Mathematics. Source: Nesta research report Significant gender diversity gap in universities, big tech companies and other research institutions Apart from the University of Washington, every other academic institution and organisation in the dataset has less than 25% female AI researchers. Regarding some of the big tech, only 11.3% of Google’s employees who have published their AI research on arXiv are women, while the proportion is similar for Microsoft (11.95%) and is slightly better for IBM (15.66%). Important semantic differences between AI paper with and without a female co-author When examining the publications in the Machine Learning and Societal topics in the UK in 2012 and 2015, papers involving at least one female co-author tend to be more semantically similar to each other than with those without any female authors. Moreover, papers with at least one female co-author tend to be more applied and socially aware, with terms such as fairness, human mobility, mental, health, gender and personality being among the most salient ones. Juan Mateos noted that this is an area which deserves further research. https://twitter.com/JMateosGarcia/status/1151517647361781760 The top 15 women with the most AI publications on arXiv identified Aarti Singh, Associate Professor at the Machine learning department of Carnegie Mellon University Cordelia Schmid, is a part of Google AI team and holds a permanent research position at Inria Grenoble Rhone-Alpes Cynthia Rudin, an associate professor of computer science, electrical and computer engineering, statistical science and mathematics at Duke University Devi Parikh, an Assistant Professor in the School of Interactive Computing at Georgia Tech Karen Livescu, an Associate Professor at Toyota Technical Institute at Chicago Kate Saenko, an Associate Professor at the Department of Computer at Boston University Kristina Lerman, a Project Leader at the Information Sciences Institute at the University of Southern California Marilyn A. Walker, a Professor at the Department of Computer Science at the University of California Mihaela van der Schaar, is John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Turing Fellow at The Alan Turing Institute in London Petia Radeva, a professor at the Department of Mathematics and Computer Science, Faculty of Mathematics and Computer Science at the Universitat de Barcelona Regina Barzilay is a professor at the Massachusetts Institute of Technology and a member of the MIT Computer Science and Artificial Intelligence Laboratory Svetha Venkatesh, an ARC Australian Laureate Fellow, Alfred Deakin Professor and Director of the Centre for Pattern Recognition and Data Analytics (PRaDA) at Deakin University Xiaodan Liang, an Associate Professor at the School of Intelligent Systems Engineering, Sun Yat-sen University Yonina C. Elda, a Professor of Electrical Engineering, Weizmann Faculty of Mathematics and Computer Science at the University of Israel Zeynep Akata, an Assistant Professor with the University of Amsterdam in the Netherlands There are 5 other women researchers who were not identified in the study. Interviews bites from few women contributors and institutions The research team also interviewed few researchers and institutions identified in their work and they think a system wide reform is needed. When the team discussed the findings with the most cited female researcher Mihaela Van Der Schaar, she did feel that her presence in the field has only started to be recognised, having begun her career in 2003, ‘I think that part of the reason for this is because I am a woman, and the experience of (the few) other women in AI in the same period has been similar.’ she says. Professor Van Der Schaar also described herself and many of her female colleagues as ‘faceless’, she suggested that the work of celebrating leading women in the field could have a positive impact on the representation of women, as well as the disparity in the recognition that these women receive. This suggests that work is needed across the pipeline, not just with early-stage invention in education, but support for those women in the field. She also highlighted the importance of open discussion about the challenges women face in the AI sector and that workplace changes such as flexible hours are needed to enable researchers to participate in a fast-paced sector without sacrificing their family life. The team further discussed the findings with the University of Washington’s Eve Riskin, Associate Dean of Diversity and Access in the College of Engineering. Riskin described that much of her female faculty experienced a ‘toxic environment’ and pervasive imposter syndrome. She also emphasized the fact that more research is needed in terms of the career trajectories of the male and female researchers including the recruitment and retention. Some recent examples of exceptional women in AI research and their contribution While these women talk about the diversity gaps in this field recently we have seen works from female researchers like Katie Bouman which gained significant attention. Katie is a post-doctoral fellow at MIT whose algorithm led to an image of a supermassive black hole. But then all the attention became a catalyst for a sexist backlash on social media and YouTube. It set off “what can only be described as a sexist scavenger hunt,” as The Verge described it, in which an apparently small group of vociferous men questioned Bouman’s role in the project. “People began going over her work to see how much she’d really contributed to the project that skyrocketed her to unasked-for fame.” Another incredible example in the field of AI research and ethics is of Meredith Whittaker, an ex-Googler, now a program manager, activist, and co-founder of the AI Now Institute at New York University. Meredith is committed to the AI Now Institute, her AI ethics work, and to organize an accountable tech industry. On Tuesday, Meredith left Google after facing retaliation from company for organizing last year’s protest of Google Walkout for Real Change demanding the company for structural changes to ensure a safe and conducive work environment for everyone.. Other observations from the research and next steps The research also highlights the fact that women are as capable as men in contributing to technical topics while they tend to contribute more than men to publications with a societal or ethical output. Some of the leading AI researchers in the field shared their opinion on this: Petia Radeva, Professor at the Department of Mathematics and Computer Science at the University of Barcelona, was positive that the increasingly broad domains of application for AI and the potential impact of this technology will attract more women into the sector. Similarly, Van Der Schaar suggests that “publicising the interdisciplinary scope of possibilities and career paths that studying AI can lead to will help to inspire a more diverse group of people to pursue it. In parallel, the industry will benefit from a pipeline of people who are motivated by combining a variety of ideas and applying them across domains.” The research team in future will explore the temporal co-authorship network of AI papers to examine how different the career trajectory of male and female researchers might be. They will survey AI researchers on arXiv and investigate the drivers of the diversity gap in more detail through their innovation mapping methods. They also plan to extend this analysis to identify the representation of other underrepresented groups. Meredith Whittaker, Google Walkout organizer, and AI ethics researcher is leaving the company, adding to its brain-drain woes over ethical concerns “I’m concerned about Libra’s model for decentralization”, says co-founder of Chainspace, Facebook’s blockchain acquisition DeepMind’s Alphastar AI agent will soon anonymously play with European StarCraft II players

0
0
8442

article-image-visualizing-my-social-graph-d3js

Packt

24 Oct 2013

7 min read

Visualizing my Social Graph with d3.js

Packt

24 Oct 2013

7 min read

(For more resources related to this topic, see here.) The Social Networks Analysis Social Networks Analysis (SNA) is not new, sociologists have been using it for a long time to study human relationships (sociometry), to find communities and to simulate how information or a disease is spread in a population. With the rise of social networking sites such as Facebook, Twitter, LinkedIn, and so on. The acquisition of large amounts of social network data is easier. We can use SNA to get insight about customer behavior or unknown communities. It is important to say that this is not a trivial task and we will come across sparse data and a lot of noise (meaningless data). We need to understand how to distinguish between false correlation and causation. A good start is by knowing our graph through visualization and statistical analysis. Social networking sites bring us the opportunities to ask questions that otherwise are too hard to approach, because polling enough people is time-consuming and expensive. In this article, we will obtain our social network's graph from Facebook (FB) website in order to visualize the relationships between our friends. Finally we will create an interactive visualization of our graph using D3.js. Getting ready The easiest method to get our friends list is by using a third-party application. Netvizz is a Facebook app developed by Bernhard Rieder, which allows exporting social graph data to gdf and tab formats. Netvizz may export information about our friends such as gender, age, locale, posts, and likes. In order to get our social graph from Netvizz we need to access the link below and giving access to your Facebook profile. https://apps.facebook.com/netvizz/ As is shown in the following screenshot, we will create a gdf file from our personal friend network by clicking on the link named here in the Step 2. Then we will download the GDF (Graph Modeling Language) file. Netvizz will give us the number of nodes and edges (links); finally we will click on the gdf file link, as we can see in the following screenshot: The output file myFacebookNet.gdf will look like this: nodedef>name VARCHAR,label VARCHAR,gender VARCHAR,locale VARCHAR,agerankINT23917067,Jorge,male,en_US,10623931909,Haruna,female,en_US,10535702006,Joseph,male,en_US,104503839109,Damian,male,en_US,103532735006,Isaac,male,es_LA,102. . .edgedef>node1 VARCHAR,node2 VARCHAR23917067,3570200623917067,62939583723917067,74734348223917067,75560507523917067,1186286815. . . In the following screenshot we may see the visualization of the graph (106 nodes and 279 links). The nodes represent my friends and the links represent how my friends are connected between them. Transforming GDF to JSON In order to work with the graph in the web with d3.js, we need to transform our gdf file to json format. Firstly, we need to import the libraries numpy and json. import numpy as npimport json The numpy function, genfromtxt, will obtain only the ID and name from the nodes.csv file using the usecols attribute in the 'object' format. nodes = np.genfromtxt("nodes.csv",dtype='object',delimiter=',',skip_header=1,usecols=(0,1)) Then, the numpy function, genfromtxt, will obtain links with the source node and target node from the links.csv file using the usecols attribute in the 'object' format. links = np.genfromtxt("links.csv",dtype='object',delimiter=',',skip_header=1,usecols=(0,1)) The JSON format used in the D3.js Force Layout graph implemented in this article requires transforming the ID (for example, 100001448673085) into a numerical position in the list of nodes. Then, we need to look for each appearance of the ID in the links and replace them by their position in the list of nodes. for n in range(len(nodes)):for ls in range(len(links)):if nodes[n][0] == links[ls][0]:links[ls][0] = nif nodes[n][0] == links[ls][1]:links[ls][1] = n Now, we need to create a dictionary "data" to store the JSON file. data ={} Next, we need to create a list of nodes with the names of the friends in the format as follows: "nodes": [{"name": "X"},{"name": "Y"},. . .] and add it to thedata dictionary.lst = []for x in nodes:d = {}d["name"] = str(x[1]).replace("b'","").replace("'","")lst.append(d)data["nodes"] = lst Now, we need to create a list of links with the source and target in the format as follows: "links": [{"source": 0, "target": 2},{"source": 1, "target":2},. . .] and add it to the data dictionary.lnks = []for ls in links:d = {}d["source"] = ls[0]d["target"] = ls[1]lnks.append(d)data["links"] = lnks Finally, we need to create the file, newJson.json, and write the data dictionary in the file with the function dumps of the json library. with open("newJson.json","w") as f:f.write(json.dumps(data)) The file newJson.json will look as follows: {"nodes": [{"name": "Jorge"},{"name": "Haruna"},{"name": "Joseph"},{"name": "Damian"},{"name": "Isaac"},. . .],"links": [{"source": 0, "target": 2},{"source": 0, "target": 12},{"source": 0, "target": 20},{"source": 0, "target": 23},{"source": 0, "target": 31},. . .]} Graph visualization with D3.js D3.js provides us with the d3.layout.force() function that use the Force Atlas layout algorithm and help us to visualize our graph. First, we need to define the CSS style for the nodes, links, and node labels. <style>.link {fill: none;stroke: #666;stroke-width: 1.5px;}.node circle{fill: steelblue;stroke: #fff;stroke-width: 1.5px;}.node text{pointer-events: none;font: 10px sans-serif;}</style> Then, we need to refer the d3js library. <script src = "http://d3js.org/d3.v3.min.js"></script> Then, we need to define the width and height parameters for the svg container and include into the body tag. var width = 1100,height = 800var svg = d3.select("body").append("svg").attr("width", width).attr("height", height); Now, we define the properties of the force layout such as gravity, distance, and size. var force = d3.layout.force().gravity(.05).distance(150).charge(-100).size([width, height]); Then, we need to acquire the data of the graph using the JSON format. We will configure the parameters for nodes and links. d3.json("newJson.json", function(error, json) {force.nodes(json.nodes).links(json.links).start(); For a complete reference about the d3js Force Layout implementation, visit the link https://github.com/mbostock/d3/wiki/Force-Layout. Then, we define the links as lines from the json data. var link = svg.selectAll(".link").data(json.links).enter().append("line").attr("class", "link");var node = svg.selectAll(".node").data(json.nodes).enter().append("g").attr("class", "node").call(force.drag); Now, we define the node as circles of size 6 and include the labels of each node. node.append("circle").attr("r", 6);node.append("text").attr("dx", 12).attr("dy", ".35em").text(function(d) { return d.name }); Finally, with the function, tick, run step-by-step the force layout simulation. force.on("tick", function(){link.attr("x1", function(d) { return d.source.x; }).attr("y1", function(d) { return d.source.y; }).attr("x2", function(d) { return d.target.x; }).attr("y2", function(d) { return d.target.y; });node.attr("transform", function(d){return "translate(" + d.x + "," + d.y + ")";})});});</script> In the image below we can see the result of the visualization. In order to run the visualization we just need to open a Command Terminal and run the following Python command or any other web server. >>python –m http.server 8000 Then you just need to open a web browser and type the direction http://localhost:8000/ForceGraph.html. In the HTML page we can see our Facebook graph with a gravity effect and we can interactively drag-and-drop the nodes. All the codes and datasets of this article may be found in the author github repository in the link below.https://github.com/hmcuesta/PDA_Book/tree/master/Chapter10 Summary In this article we developed our own social graph visualization tool with D3js, transforming the data obtained from Netvizz with GDF format into JSON. Resources for Article: Further resources on this subject: GNU Octave: Data Analysis Examples [Article] Securing data at the cell level (Intermediate) [Article] Analyzing network forensic data (Become an expert) [Article]

0
0
8406

Packt

16 Aug 2016

5 min read

Preprocessing the Data

Packt

16 Aug 2016

5 min read

In this article, by Sampath Kumar Kanthala, the author of the book Practical Data Analysis discusses how to obtain, clean, normalize, and transform raw data into a standard format like CVS or JSON using OpenRefine. In this article we will cover: Data Scrubbing Statistical methods Text Parsing Data Transformation (For more resources related to this topic, see here.) Data scrubbing Scrubbing data also called data cleansing, is the process of correcting or removing data in a dataset that is incorrect, inaccurate, incomplete, improperly formatted, or duplicated. The result of the data analysis process not only depends on the algorithms, it depends on the quality of the data. That's why the next step after obtaining the data, is the data scrubbing. In order to avoid dirty data our dataset should possess the following characteristics: Correct Completeness Accuracy Consistency Uniformity Dirty data can be detected by applying some simple statistical data validation also by parsing the texts or deleting duplicate values. Missing or sparse data can lead you to highly misleading results. Statistical methods In this method we need some context about the problem (knowledge domain) to find values that are unexpected and thus erroneous, even if the data type match but the values are out of the range, it can be resolved by setting the values to an average or mean value. Statistical validations can be used to handle missing values which can be replaced by one or more probable values using Interpolation or by reducing the data set using decimation. Mean: Value calculated by summing up all values and then dividing by the number of values. Median: The median is defined as the value where 50% of values in a range will be below, 50% of values above the value. Range constraints: Numbers or dates should fall within a certain range. That is, they have minimum and/or maximum possible values. Clustering: Usually, when we obtain data directly from the user some values include ambiguity or refer to the same value with a typo. For example, "Buchanan Deluxe 750ml 12x01 "and "Buchanan Deluxe 750ml 12x01." which are different only by a "." or in the case of "Microsoft" or "MS" instead of "Microsoft Corporation" which refer to the same company and all values are valid. In those cases, grouping can help us to get accurate data and eliminate duplicated enabling a faster identification of unique values. Text parsing We perform parsing to help us to validate if a string of data is well formatted and avoid syntax errors. Regular expression patterns usually, text fields would have to be validated this way. For example, dates, e-mail, phone numbers, and IP address. Regex is a common abbreviation for "regular expression"): In Python we will use re module to implement regular expressions. We can perform text search and pattern validations. First, we need to import the re module. import re In the follow examples, we will implement three of the most common validations (e-mail, IP address, and date format). E-mail validation: myString = 'From: readers@packt.com (readers email)' result = re.search('([w.-]+)@([w.-]+)', myString) if result: print (result.group(0)) print (result.group(1)) print (result.group(2)) Output: >>> readers@packt.com >>> readers >>> packt.com The function search() scans through a string, searching for any location where the Regex matches. The function group() helps us to return the string matched by the Regex. The pattern w matches any alphanumeric character and is equivalent to the class [a-zA-Z0-9_]. IP address validation: isIP = re.compile('d{1,3}.d{1,3}.d{1,3}.d{1,3}') myString = " Your IP is: 192.168.1.254 " result = re.findall(isIP,myString) print(result) Output: >>> 192.168.1.254 The function findall() finds all the substrings where the Regex matches, and returns them as a list. The pattern d matches any decimal digit, is equivalent to the class [0-9]. Date format: myString = "01/04/2001" isDate = re.match('[0-1][0-9]/[0-3][0-9]/[1-2][0-9]{3}', myString) if isDate: print("valid") else: print("invalid") Output: >>> 'valid' The function match() finds if the Regex matches with the string. The pattern implements the class [0-9] in order to parse the date format. For more information about regular expressions: http://docs.python.org/3.4/howto/regex.html#regex-howto Data transformation Data transformation is usually related with databases and data warehouse where values from a source format are extract, transform, and load in a destination format. Extract, Transform, and Load (ETL) obtains data from data sources, performs some transformation function depending on our data model and loads the result data into destination. Data extraction allows us to obtain data from multiple data sources, such as relational databases, data streaming, text files (JSON, CSV, XML), and NoSQL databases. Data transformation allows us to cleanse, convert, aggregate, merge, replace, validate, format, and split data. Data loading allows us to load data into destination format, like relational databases, text files (JSON, CSV, XML), and NoSQL databases. In statistics data transformation refers to the application of a mathematical function to the dataset or time series points. Summary In this article, we explored the common data sources and implemented a web scraping example. Next, we introduced the basic concepts of data scrubbing like statistical methods and text parsing. Resources for Article: Further resources on this subject: MicroStrategy 10 [article] Expanding Your Data Mining Toolbox [article] Machine Learning Using Spark MLlib [article]

0
0
8387

article-image-creating-reports-using-sql-server-2016-reporting-services

Kunal Chaudhari

04 Jan 2018

6 min read

Creating reports using SQL Server 2016 Reporting Services

Kunal Chaudhari

04 Jan 2018

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Dinesh Priyankara and Robert C. Cain, titled SQL Server 2016 Reporting Services Cookbook.This book will help you create cross-browser and cross-platform reports using SQL Server 2016 Reporting Services.[/box] In today’s tutorial, we explore steps to create reports on multiple axis charts with SQL Server 2016. Often you will want to have multiple items plotted on a chart. In this article, we will plot two values over time, in this case, the Total Sales Amount (Excluding Tax) and the Total Tax Amount. As you might expect though, the tax amounts are going to be a small percentage of the sales amounts. By default, this would create a chart with a huge gap in the middle and a Y Axis that is quite large and difficult to pinpoint values on. To prevent this, Reporting Services allows us to place a second Y Axis on our charts. With this article, we'll explore both adding a second line to our chart as well as having it plotted on a second Y-Axis. Getting ready First, we'll create a new Reporting Services project to contain it. Name this new project Chapter03. Within the new project, create a Shared Data Source that will connect to the WideWorldImportersDW database. Name the new data source after the database, WideWorldImportersDW. Next, we'll need data. Our data will come from the sales table, and we will want to sum our totals by year, so we can plot our years across the X-Axis. For the Y-Axis, we'll use the totals of two fields: TotalExcludingTax and TaxAmount. Here is the query by which we will accomplish this: SELECT YEAR([Invoice Date Key]) AS InvoiceYear ,SUM([Total Excluding Tax]) AS TotalExcludingTax ,SUM([Tax Amount]) AS TotalTaxAmount FROM [Fact].[Sale] GROUP BY YEAR([Invoice Date Key]) How to do it… Right-click on the Reports branch in the Solution Explorer. Go to Add | New Item… from the pop-up menu. On the Add New Item dialog, select Report from the choice of templates in the middle (do not select Report wizard). At the bottom, name the report Report 03-01 Multi Axis Charts.rdl and click on Add. Go to the Report Data tool pane. Right-click on Data Sources and then click Add Data Source… from the menu. In the Name: area, enter WideWorldImportersDW. Change the data source option to the Use shared dataset source reference. In the dropdown, select WideWorldImportersDW. Click on OK to close the DataSet Properties window. Right-click on the Datasets branch and select Add Dataset…. Name the dataset SalesTotalsOverTime. Select the Use a dataset embedded in my report option. Select WideWorldImportersDW in the Data source dropdown. Paste in the query from the Getting ready area of this article: When your window resembles that of the preceding figure, click on OK. Next, go to the Toolbox pane. Drag and drop a Chart tool onto the report. Select the leftmost Line chart from the Select Chart Type window, and click on OK. Resize the chart to a larger size. (For this demo, the exact size is not important. For your production reports, you can resize as needed using the Properties window, as seen previously.) Click inside the main chart area to make the Chart Data dialog appear to the right of the chart. Click on the + (plus button) to the right of the Values. Select TotalExcludingTax. Click on the plus button again, and now pick TotalTaxAmount. Click on the + (plus button) beside Category Groups, and pick InvoiceYear. Click on Preview. You will note the large gap between the two graphed lines. In addition, the values for the Total Tax Amount are almost impossible to guess, as shown in the following figure: Return to the designer, and again click in the chart area to make the Chart Data dialog appear. In the Chart Data dialog, click on the dropdown beside TotalTaxAmount: Select Series Properties…. Click on the Axes and Chart Area page, and for Vertical axis, select Secondary: Click on OK to close the Series Properties window. Right-click on the numbers now appearing on the right in the vertical axis area, and select Secondary Vertical Axis Properties in the menu. In the Axis Options, uncheck Always include zero. Click on the Number page. Under Category, select Currency. Change the Decimal places to 0, and place a check in Use 1000 separator. Click on OK to close this window. Now move to the vertical axis on the left-hand side of the chart, right-click, and pick Vertical Axis Properties. Uncheck Always include zero. On the Number page, pick Currency, set Decimal places to 0, and check Use 1000 separator. Click on OK to close. Click on the Preview tab to see the results: You can now see a chart with a second axis. The monetary amounts are much easier to read. Further, the plotted lines have a similar rise and fall, indicating the taxes collected matched the sales totals in terms of trending. SSRS is capable of plotting multiple lines on a chart. Here we've just placed two fields, but you can add as many as you need. But do realize that the more lines included, the harder the chart can become to read. All that is needed is to put the additional fields into the Values area of the Chart Data window. When these values are of similar scale, for example, sales broken up by state, this works fine. There are times though when the scale between plotted values is so great that it distorts the entire chart, leaving one value in a slender line at the top and another at the bottom, with a huge gap in the middle. To fix this, SSRS allows a second Y-Axis to be included. This will create a scale for the field (or fields) assigned to that axis in the Series Properties window. To summarize, we learned how creating reports with multiple axis is much more simpler with SQL Server 2016 Reporting Services. If you liked our post, check out the book SQL Server 2016 Reporting Services Cookbook to know more about different types of reportings and Power BI integrations.

0
0
8384

article-image-animating-graphic-objects-using-python

Packt

01 Dec 2010

9 min read

Animating Graphic Objects using Python

Packt

01 Dec 2010

9 min read

Python 2.6 Graphics Cookbook Over 100 great recipes for creating and animating graphics using Python Create captivating graphics with ease and bring them to life using Python Apply effects to your graphics using powerful Python methods Develop vector as well as raster graphics and combine them to create wonders in the animation world Create interactive GUIs to make your creation of graphics simpler Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to accomplish the task of creation and animation of graphics as efficiently as possible Precise collisions using floating point numbers Here the simulation flaws caused by the coarseness of integer arithmetic are eliminated by using floating point numbers for all ball position calculations. How to do it... All position, velocity, and gravity variables are made floating point by writing them with explicit decimal points. The result is shown in the following screenshot, showing the bouncing balls with trajectory tracing. from Tkinter import * root = Tk() root.title("Collisions with Floating point") cw = 350 # canvas width ch = 200 # canvas height GRAVITY = 1.5 chart_1 = Canvas(root, width=cw, height=ch, background="black") chart_1.grid(row=0, column=0) cycle_period = 80 # Time between new positions of the ball # (milliseconds). time_scaling = 0.2 # This governs the size of the differential steps # when calculating changes in position. # The parameters determining the dimensions of the ball and it's # position. ball_1 = {'posn_x':25.0, # x position of box containing the # ball (bottom). 'posn_y':180.0, # x position of box containing the # ball (left edge). 'velocity_x':30.0, # amount of x-movement each cycle of # the 'for' loop. 'velocity_y':100.0, # amount of y-movement each cycle of # the 'for' loop. 'ball_width':20.0, # size of ball - width (x-dimension). 'ball_height':20.0, # size of ball - height (y-dimension). 'color':"dark orange", # color of the ball 'coef_restitution':0.90} # proportion of elastic energy # recovered each bounce ball_2 = {'posn_x':cw - 25.0, 'posn_y':300.0, 'velocity_x':-50.0, 'velocity_y':150.0, 'ball_width':30.0, 'ball_height':30.0, 'color':"yellow3", 'coef_restitution':0.90} def detectWallCollision(ball): # Collision detection with the walls of the container if ball['posn_x'] > cw - ball['ball_width']: # Collision # with right-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] # reverse direction. ball['posn_x'] = cw - ball['ball_width'] if ball['posn_x'] < 1: # Collision with left-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] ball['posn_x'] = 2 # anti-stick to the wall if ball['posn_y'] < ball['ball_height'] : # Collision # with ceiling. ball['velocity_y'] = -ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ball['ball_height'] if ball['posn_y'] > ch - ball['ball_height']: # Floor # collision. ball['velocity_y'] = - ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ch - ball['ball_height'] def diffEquation(ball): # An approximate set of differential equations of motion # for the balls ball['posn_x'] += ball['velocity_x'] * time_scaling ball['velocity_y'] = ball['velocity_y'] + GRAVITY # a crude # equation incorporating gravity. ball['posn_y'] += ball['velocity_y'] * time_scaling chart_1.create_oval( ball['posn_x'], ball['posn_y'], ball['posn_x'] + ball['ball_width'], ball ['posn_y'] + ball['ball_height'], fill= ball['color']) detectWallCollision(ball) # Has the ball collided with # any container wall? for i in range(1,2000): # end the program after 1000 position shifts. diffEquation(ball_1) diffEquation(ball_2) chart_1.update() # This refreshes the drawing on the canvas. chart_1.after(cycle_period) # This makes execution pause for 200 # milliseconds. chart_1.delete(ALL) # This erases everything on the root.mainloop() How it works... Use of precision arithmetic has allowed us to notice simulation behavior that was previously hidden by the sins of integer-only calculations. This is the UNIQUE VALUE OF GRAPHIC SIMULATION AS A DEBUGGING TOOL. If you can represent your ideas in a visual way rather than as lists of numbers you will easily pick up subtle quirks in your code. The human brain is designed to function best in graphical images. It is a direct consequence of being a hunter. A graphic debugging tool... There is another very handy trick in the software debugger's arsenal and that is the visual trace. A trace is some kind of visual trail that shows the history of dynamic behavior. All of this is revealed in the next example. Trajectory tracing and ball-to-ball collisions Now we introduce one of the more difficult behaviors in our simulation of ever increasing complexity – the mid-air collision. The hardest thing when you are debugging a program is to try to hold in your short term memory some recently observed behavior and compare it meaningfully with present behavior. This kind of memory is an imperfect recorder. The way to overcome this is to create a graphic form of memory – some sort of picture that shows accurately what has been happening in the past. In the same way that military cannon aimers use glowing tracer projectiles to adjust their aim, a graphic programmer can use trajectory traces to examine the history of execution. How to do it... In our new code there is a new function called detect_ball_collision (ball_1, ball_2) whose job is to anticipate imminent collisions between the two balls no matter where they are. The collisions will come from any direction and therefore we need to be able to test all possible collision scenarios and examine the behavior of each one and see if it does not work as planned. This can be too difficult unless we create tools to test the outcome. In this recipe, the tool for testing outcomes is a graphic trajectory trace. It is a line that trails behind the path of the ball and shows exactly where it went right since the beginning of the simulation. The result is shown in the following screenshot, showing the bouncing with ball-to-ball collision rebounds. # kinetic_gravity_balls_1.py # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from Tkinter import * import math root = Tk() root.title("Balls bounce off each other") cw = 300 # canvas width ch = 200 # canvas height GRAVITY = 1.5 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 80 # Time between new positions of the ball # (milliseconds). time_scaling = 0.2 # The size of the differential steps # The parameters determining the dimensions of the ball and its # position. ball_1 = {'posn_x':25.0, 'posn_y':25.0, 'velocity_x':65.0, 'velocity_y':50.0, 'ball_width':20.0, 'ball_height':20.0, 'color':"SlateBlue1", 'coef_restitution':0.90} ball_2 = {'posn_x':180.0, 'posn_y':ch- 25.0, 'velocity_x':-50.0, 'velocity_y':-70.0, 'ball_width':30.0, 'ball_height':30.0, 'color':"maroon1", 'coef_restitution':0.90} def detect_wall_collision(ball): # detect ball-to-wall collision if ball['posn_x'] > cw - ball['ball_width']: # Right-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] ball['posn_x'] = cw - ball['ball_width'] if ball['posn_x'] < 1: # Left-hand wall. ball['velocity_x'] = -ball['velocity_x'] * ball['coef_ restitution'] ball['posn_x'] = 2 if ball['posn_y'] < ball['ball_height'] : # Ceiling. ball['velocity_y'] = -ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ball['ball_height'] if ball['posn_y'] > ch - ball['ball_height'] : # Floor ball['velocity_y'] = - ball['velocity_y'] * ball['coef_ restitution'] ball['posn_y'] = ch - ball['ball_height'] def detect_ball_collision(ball_1, ball_2): #detect ball-to-ball collision # firstly: is there a close approach in the horizontal direction if math.fabs(ball_1['posn_x'] - ball_2['posn_x']) < 25: # secondly: is there also a close approach in the vertical # direction. if math.fabs(ball_1['posn_y'] - ball_2['posn_y']) < 25: ball_1['velocity_x'] = -ball_1['velocity_x'] # reverse # direction. ball_1['velocity_y'] = -ball_1['velocity_y'] ball_2['velocity_x'] = -ball_2['velocity_x'] ball_2['velocity_y'] = -ball_2['velocity_y'] # to avoid internal rebounding inside balls ball_1['posn_x'] += ball_1['velocity_x'] * time_scaling ball_1['posn_y'] += ball_1['velocity_y'] * time_scaling ball_2['posn_x'] += ball_2['velocity_x'] * time_scaling ball_2['posn_y'] += ball_2['velocity_y'] * time_scaling def diff_equation(ball): x_old = ball['posn_x'] y_old = ball['posn_y'] ball['posn_x'] += ball['velocity_x'] * time_scaling ball['velocity_y'] = ball['velocity_y'] + GRAVITY ball['posn_y'] += ball['velocity_y'] * time_scaling chart_1.create_oval( ball['posn_x'], ball['posn_y'], ball['posn_x'] + ball['ball_width'], ball['posn_y'] + ball['ball_height'], fill= ball['color'], tags="ball_tag") chart_1.create_line( x_old, y_old, ball['posn_x'], ball ['posn_y'], fill= ball['color']) detect_wall_collision(ball) # Has the ball # collided with any container wall? for i in range(1,5000): diff_equation(ball_1) diff_equation(ball_2) detect_ball_collision(ball_1, ball_2) chart_1.update() chart_1.after(cycle_period) chart_1.delete("ball_tag") # Erase the balls but # leave the trajectories root.mainloop() How it works... Mid-air ball against ball collisions are done in two steps. In the first step, we test whether the two balls are close to each other inside a vertical strip defined by if math.fabs(ball_1['posn_x'] - ball_2['posn_x']) < 25. In plain English, this asks "Is the horizontal distance between the balls less than 25 pixels?" If the answer is yes, then the region of examination is narrowed down to a small vertical distance less than 25 pixels by the statement if math.fabs(ball_1['posn_y'] - ball_2['posn_y']) < 25. So every time the loop is executed, we sweep the entire canvas to see if the two balls are both inside an area where their bottom-left corners are closer than 25 pixels to each other. If they are that close then we simply cause a rebound off each other by reversing their direction of travel in both the horizontal and vertical directions. There's more... Simply reversing the direction is not the mathematically correct way to reverse the direction of colliding balls. Certainly billiard balls do not behave that way. The law of physics that governs colliding spheres demands that momentum be conserved. Why do we sometimes get tkinter.TckErrors? If we click the close window button (the X in the top right) while Python is paused, when Python revives and then calls on Tcl (Tkinter) to draw something on the canvas we will get an error message. What probably happens is that the application has already shut down, but Tcl has unfinished business. If we allow the program to run to completion before trying to shut the window then termination is orderly.

0
0
8314

article-image-oracle-bi-publisher-11g-working-multiple-data-sources

Packt

25 Oct 2011

5 min read

Oracle BI Publisher 11g: Working with Multiple Data Sources

Packt

25 Oct 2011

5 min read

(For more resources on Oracle 11g, see here.) The Data Model Editor's interface deals with all the components and functionalities needed for the data model to achieve the structure you need. However, the main component is the Data Set. In order to create a data model structure in BIP, you can choose from a variety of data set types, such as: SQL Query MDX Query Oracle BI Analysis View Object Web Service LDAP Query XML file Microsoft Excel file Oracle BI Discoverer HTTP Taking advantage of this variety requires multiple Data Sources of different types to be defined in the BIP. In this article, we will see: How data sources are configured How the data is retrieved from different data sets How data set type characteristics and the links between elements influence the data model structure Administration Let's first see, how you can verify or configure your data sources. You must choose the Administration link found in the upper-right corner of any of the BIP interface pages, as shown in the following screenshot: The connection to your database can be choosen from the following connection types: Java Database Connectivity (JDBC) Java Naming and Directory Interface (JNDI) Lightweight Directory Access Protocol (LDAP) Online Analytical Processing (OLAP) Available Data Sources To get to your data source, BIP offers two possibilities: YOu can use a connection. In order to use a connection, these are the available connection types: JDBC JNDI LDAP OLAP You can also use a file. In the following sections, the Data Source types&mdashJDBC, JNDI, OLAP Connections, and File&mdashwill be explained in detail. JDBC Connection Let's take the first example. To configure a Data Source to use JDBC, from the Administration page, choose JDBC Connection from the Data Sources types list, as shown in the following screenshot: You can see the requested parameters for configuring a JDBC connection in the following screenshot: Data Source Name: Enter a name of your choice. Driver Type: Choose a type from the list. The relating parameters are: Database Driver Class: A driver, matching your database type. Connection String: Information containing the computer name on which your database server is running, for example, port, database name, and so on. Username: Enter a database username. Password: Provide the database user's password. The Use System User option allows you to use the operating system's credentials as your credentials. For example, in this case, your MS SQL Database Server uses Windows authentication as the only authentication method. When you have a system administrator in-charge of these configurations, all you have to do is to find which are the available Data Sources and eventually you can check if the connection works. Click on the Test Connection button at the bottom of the page to test the connection: JNDI Connection JNDI Connection pool is in fact another way to access your JDBC Data Sources. Using a connection pool increases efficiency by maintaining a cache of physical connections that can be reused, allowing multiple clients to share a small number of physical connections. In order to configure a Data Source to use JNDI, from the Administration page, choose JNDI Connection from the Data Sources types list. The following screen will appear: As you can see in the preceding screenshot, on the Add Data Source page you must enter the following parameters: Data Source Name: Enter a name of your choice JNDI Name: This is the JNDI location for the pool set up in your application server, for example, jdbc/BIP10gSource The users having roles included in Allowed Roles list only will be able to create reports using this Data Source. OLAP Connection Use the OLAP Connection to connect to OLAP databases. BI Publisher supports the following OLAP types: Oracle Hyperion Essbase Microsoft SQL Server 2000 Analysis Services Microsoft SQL Server 2005 Analysis Services SAP BW In order to configure a connection to an OLAP database, from the Administration page, choose OLAP Connection from the Data Sources types list. The following screen will appear: On the Add Data Source page, the following parameters must be entered: Data Source Name: Enter a name of your choice OLAP Type: Choose a type from the list Connection String: Depending on the supported OLAP databases, the connection string format is as follows: Oracle Hyperion Essbase Format: [server name] Microsoft SQL Server 2000 Analysis Services Format: Data Source=[server];Provider=msolap;Initial Catalog=[catalog] Microsoft SQL Server 2005 Analysis Services Format: Data Source=[server];Provider=msolap.3;Initial Catalog=[catalog] SAP BW Format: ASHOST=[server] SYSNR=[system number] CLIENT=[client] LANG=[language] Username and Password: Used for OLAP database authentication File Another example of a data source type is File. In order to gain access to XML or Excel files, you need a File Data Source. In order to set up this kind of Data Source, only one step is required&mdashenter the path to the Directory in which your files reside. You can see in the following screenshot that demo files Data Source points to the default BIP files directory. The file needs to be accessible from the BI Server (not on your local machine):

0
0
8313

article-image-pentaho-data-integration-4-working-complex-data-flows

Packt

27 Jun 2011

7 min read

Pentaho Data Integration 4: working with complex data flows

Packt

27 Jun 2011

7 min read

Joining two or more streams based on given conditions There are occasions where you will need to join two datasets. If you are working with databases, you could use SQL statements to perform this task, but for other kinds of input (XML, text, Excel), you will need another solution. Kettle provides the Merge Join step to join data coming from any kind of source. Let's assume that you are building a house and want to track and manage the costs of building it. Before starting, you prepared an Excel file with the estimated costs for the different parts of your house. Now, you are given a weekly file with the progress and the real costs. So, you want to compare both to see the progress. Getting ready To run this recipe, you will need two Excel files, one for the budget and another with the real costs. The budget.xls has the estimated starting date, estimated end date, and cost for the planned tasks. The costs.xls has the real starting date, end date, and cost for tasks that have already started. You can download the sample files from here. How to do it... Carry out the following steps: Create a new transformation. Drop two Excel input steps into the canvas. Use one step for reading the budget information (budget.xls file) and the other for reading the costs information (costs.xls file). Under the Fields tab of these steps, click on the Get fields from header row... button in order to populate the grid automatically. Apply the format dd/MM/yyyy to the fields of type Date and $0.00 to the fields with costs. Add a Merge Join step from the Join category, and create a hop from each Excel input step toward this step. The following diagram depicts what you have so far: Configure the Merge Join step, as shown in the following screenshot: If you do a preview on this step, you will obtain the result of the two Excel files merged. In order to have the columns more organized, add a Select values step from the Transform category. In this new step, select the fields in this order: task, starting date (est.), starting date, end date (est.), end date, cost (est.), cost. Doing a preview on the last step, you will obtain the merged data with the columns of both Excel files interspersed, as shown in the following screenshot: How it works... In the example, you saw how to use the Merge Join step to join data coming from two Excel files. You can use this step to join any other kind of input. In the Merge Join step, you set the name of the incoming steps, and the fields to use as the keys for joining them. In the recipe, you joined the streams by just a single field: the task field. The rows are expected to be sorted in an ascending manner on the specified key fields. There's more... In the example, you set the Join Type to LEFT OUTER JOIN. Let's see explanations of the possible join options: Interspersing new rows between existent rows In most Kettle datasets, all rows share a common meaning; they represent the same kind of entity, for example: In a dataset with sold items, each row has data about one item In a dataset with the mean temperature for a range of days in five different regions, each row has the mean temperature for a different day in one of those regions In a dataset with a list of people ordered by age range (0-10, 11-20, 20-40, and so on), each row has data about one person Sometimes, there is a need of interspersing new rows between your current rows. Taking the previous examples, imagine the following situations: In the sold items dataset, every 10 items, you have to insert a row with the running quantity of items and running sold price from the first line until that line. In the temperature's dataset, you have to order the data by region and the last row for each region has to have the average temperature for that region. In the people's dataset, for each age range, you have to insert a header row just before the rows of people in that range. In general, the rows you need to intersperse can have fixed data, subtotals of the numbers in previous rows, header to the rows coming next, and so on. What they have in common is that they have a different structure or meaning compared to the rows in your dataset. Interspersing these rows is not a complicated task, but is a tricky one. In this recipe, you will learn how to do it. Suppose that you have to create a list of products by category. For each category, you have to insert a header row with the category description and the number of products inside that category. The final result should be as follows: Getting ready This recipe uses an outdoor database with the structure shown in Appendix, Data Structures (Download here). As source, you can use a database like this or any other source, for example a text file with the same structure. How to do it... Carry out the following steps: Create a transformation, drag into the canvas a Table Input step, select the connection to the outdoor database, or create it if it doesn't exist. Then enter the following statement: SELECT category , desc_product FROM products p ,categories c WHERE p.id_category = c.id_category ORDER by category Do a preview of this step. You already have the product list! Now, you have to create and intersperse the header rows. In order to create the headers, do the following: From the Statistics category, add a Group by step and fill in the grids, as shown in the following screenshot: From the Scripting category, add a User Defined Java Expression step, and use it to add two fields: The first will be a String named desc_product, with value ("Category: " + category).toUpperCase(). The second will be an Integer field named order with value 1. Use a Select values step to reorder the fields as category, desc_product, qty_product, and order. Do a preview on this step; you should see the following result: Those are the headers. The next step is mixing all the rows in the proper order. Drag an Add constants step into the canvas and a Sort rows step. Link them to the other steps as shown: Use the Add constants to add two Integer fields: qty_prod and order. As Value, leave the first field empty, and type 2 for the second field. Use the Sort rows step for sorting by category, order, and desc_product. Select the last step and do a preview. You should see the rows exactly as shown in the introduction. How it works... When you have to intersperse rows between existing rows, there are just four main tasks to do, as follows: Create a secondary stream that will be used for creating new rows. In this case, the rows with the headers of the categories. In each stream, add a field that will help you intersperse rows in the proper order. In this case, the key field was named order. Before joining the two streams, add, remove, and reorder the fields in each stream to make sure that the output fields in each stream have the same metadata. Join the streams and sort by the fields that you consider appropriate, including the field created earlier. In this case, you sorted by category, inside each category by the field named order and finally by the products description. Note that in this case, you created a single secondary stream. You could create more if needed, for example, if you need a header and footer for each category.

0
0
8270

article-image-how-to-stream-and-store-tweets-in-apache-kafka

Fatema Patrawala

22 Dec 2017

8 min read

How to stream and store tweets in Apache Kafka

Fatema Patrawala

22 Dec 2017

8 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Ankit Jain titled Mastering Apache Storm. This book explores various real-time processing functionalities offered by Apache Storm such as parallelism, data partitioning, and more.[/box] Today, we are going to cover how to stream tweets from Twitter using the twitter streaming API. We are also going to explore how we can store fetched tweets in Kafka for later processing through Storm. Setting up a single node Kafka cluster Following are the steps to set up a single node Kafka cluster: Download the Kafka 0.9.x binary distribution named kafka_2.10-0.9.0.1.tar.gz from http://apache.claz.org/kafka/0.9.0. or 1/kafka_2.10-0.9.0.1.tgz. Extract the archive to wherever you want to install Kafka with the following command: tar -xvzf kafka_2.10-0.9.0.1.tgz cd kafka_2.10-0.9.0.1 Change the following properties in the $KAFKA_HOME/config/server.properties file: log.dirs=/var/kafka- logszookeeper.connect=zoo1:2181,zoo2:2181,zoo3:2181 Here, zoo1, zoo2, and zoo3 represent the hostnames of the ZooKeeper nodes. The following are the definitions of the important properties in the server.properties file: broker.id: This is a unique integer ID for each of the brokers in a Kafka cluster. port: This is the port number for a Kafka broker. Its default value is 9092. If you want to run multiple brokers on a single machine, give a unique port to each broker. host.name: The hostname to which the broker should bind and advertise itself. log.dirs: The name of this property is a bit unfortunate as it represents not the log directory for Kafka, but the directory where Kafka stores the actual data sent to it. This can take a single directory or a comma-separated list of directories to store data. Kafka throughput can be increased by attaching multiple physical disks to the broker node and specifying multiple data directories, each lying on a different disk. It is not much use specifying multiple directories on the same physical disk, as all the I/O will still be happening on the same disk. num.partitions: This represents the default number of partitions for newly created topics. This property can be overridden when creating new topics. A greater number of partitions results in greater parallelism at the cost of a larger number of files. log.retention.hours: Kafka does not delete messages immediately after consumers consume them. It retains them for the number of hours defined by this property so that in the event of any issues the consumers can replay the messages from Kafka. The default value is 168 hours, which is 1 week. zookeeper.connect: This is the comma-separated list of ZooKeeper nodes in hostname:port form. Start the Kafka server by running the following command: > ./bin/kafka-server-start.sh config/server.properties [2017-04-23 17:44:36,667] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2017-04-23 17:44:36,668] INFO Kafka version : 0.9.0.1 (org.apache.kafka.common.utils.AppInfoParser) [2017-04-23 17:44:36,668] INFO Kafka commitId : a7a17cdec9eaa6c5 (org.apache.kafka.common.utils.AppInfoParser) [2017-04-23 17:44:36,670] INFO [Kafka Server 0], started (kafka.server.KafkaServer) If you get something similar to the preceding three lines on your console, then your Kafka broker is up-and-running and we can proceed to test it. Now we will verify that the Kafka broker is set up correctly by sending and receiving some test messages. First, let's create a verification topic for testing by executing the following command: > bin/kafka-topics.sh --zookeeper zoo1:2181 --replication-factor 1 --partition 1 --topic verification-topic --create Created topic "verification-topic". Now let's verify if the topic creation was successful by listing all the topics: > bin/kafka-topics.sh --zookeeper zoo1:2181 --list verification-topic The topic is created; let's produce some sample messages for the Kafka cluster. Kafka comes with a command-line producer that we can use to produce messages: > bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic verification-topic Write the following messages on your console: Message 1 Test Message 2 Message 3 Let's consume these messages by starting a new console consumer on a new console window: > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic verification-topic --from-beginning Message 1 Test Message 2 Message 3 Now, if we enter any message on the producer console, it will automatically be consumed by this consumer and displayed on the command line. Collecting Tweets We are assuming you already have a twitter account, and that the consumer key and access token are generated for your application. You can refer to: https://bdthemes.com/support/knowledge-base/generate-api-key-consumer-token-acc ess-key-twitter-oauth/ to generate a consumer key and access token. Take the following steps: Create a new maven project with groupId, com.storm advance and artifactId, kafka_producer_twitter. Add the following dependencies to the pom.xml file. We are adding the Kafka and Twitter streaming Maven dependencies to pom.xml to support the Kafka Producer and the streaming tweets from Twitter: <dependencies> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.10</artifactId> <version>0.9.0.1</version> <exclusions> <exclusion> <groupId>com.sun.jdmk</groupId> <artifactId>jmxtools</artifactId> </exclusion> <exclusion> <groupId>com.sun.jmx</groupId> <artifactId>jmxri</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> <version>2.0-beta9</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-1.2-api</artifactId> <version>2.0-beta9</version> </dependency>  <dependency> <groupId>org.twitter4j</groupId> <artifactId>twitter4j-stream</artifactId> <version>4.0.6</version> </dependency> </dependencies> 3. Now, we need to create a class, TwitterData, that contains the code to consume/stream data from Twitter and publish it to the Kafka cluster. We are assuming you already have a running Kafka cluster and topic, twitterData, created in the Kafka cluster. for information on the installation of the Kafka cluster and the creation of a Kafka please refer to . The class contains an instance of the twitter4j.conf.ConfigurationBuilder class; we need to set the access token and consumer keys in configuration, as mentioned in the source code.4. The twitter4j.StatusListener class returns the continuous stream of tweets inside the onStatus() method. We are using the Kafka Producer code inside the onStatus() method to publish the tweets in Kafka. The following is the source code for the TwitterData class: public class TwitterData { /** The actual Twitter stream. It's set up to collect raw JSON data */ private TwitterStream twitterStream; static String consumerKeyStr = "r1wFskT3q"; static String consumerSecretStr = "fBbmp71HKbqalpizIwwwkBpKC"; static String accessTokenStr = "298FPfE16frABXMcRIn7aUSSnNneMEPrUuZ"; static String accessTokenSecretStr = "1LMNZZIfrAimpD004QilV1pH3PYTvM"; public void start() { ConfigurationBuilder cb = new ConfigurationBuilder(); cb.setOAuthConsumerKey(consumerKeyStr); cb.setOAuthConsumerSecret(consumerSecretStr); cb.setOAuthAccessToken(accessTokenStr); cb.setOAuthAccessTokenSecret(accessTokenSecretStr); cb.setJSONStoreEnabled(true); cb.setIncludeEntitiesEnabled(true); // instance of TwitterStreamFactory twitterStream = new TwitterStreamFactory(cb.build()).getInstance(); final Producer<String, String> producer = new KafkaProducer<String, String>(getProducerConfig());// topicDetails CreateTopic("127.0.0.1:2181").createTopic("twitterData", 2, 1); /** Twitter listener **/ StatusListener listener = new StatusListener() { public void onStatus(Status status) { ProducerRecord<String, String> data = new ProducerRecord<String, String>("twitterData", DataObjectFactory.getRawJSON(status)); // send the data to kafka producer.send(data); } public void onException(Exception arg0) { System.out.println(arg0); } arg0) { }; public void onDeletionNotice(StatusDeletionNotice } public void onScrubGeo(long arg0, long arg1) { } public void onStallWarning(StallWarning arg0) { } public void onTrackLimitationNotice(int arg0) { } /** Bind the listener **/ twitterStream.addListener(listener); /** GOGOGO **/ twitterStream.sample(); } private Properties getProducerConfig() { Properties props = new Properties(); // List of kafka borkers. Complete list of brokers is not required as // the producer will auto discover the rest of the brokers. props.put("bootstrap.servers", "localhost:9092"); props.put("batch.size", 1); // new sending // Serializer used for sending data to kafka. Since we are // string, // we are using StringSerializer. props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("producer.type", "sync"); return props; } public static void main(String[] args) throws InterruptedException { new TwitterData().start(); } Use valid Kafka properties before executing the TwitterData. After executing the preceding class, the user will have a real-time stream of Twitter tweets in Kafka. In the next section, we are going to cover how we can use Storm to calculate the sentiments of the collected tweets. To summarize we covered how to install a single node Apache Kafka cluster and how to collect tweet from Twitter to store in a Kafka cluster If you enjoyed this post, check out the book Mastering Apache Storm to know more about different types of real time processing techniques used to create distributed applications.

0
0
8247

Packt

30 Oct 2014

10 min read

Theming with Highcharts

Packt

30 Oct 2014

10 min read

Besides the charting capabilities offered by Highcharts, theming is yet another strong feature of Highcharts. With its extensive theming API, charts can be customized completely to match the branding of a website or an app. Almost all of the chart elements are customizable through this API. In this article by Bilal Shahid, author of Highcharts Essentials, we will do the following things: (For more resources related to this topic, see here.) Use different fill types and fonts Create a global theme for our charts Use jQuery easing for animations Using Google Fonts with Highcharts Google provides an easy way to include hundreds of high quality web fonts to web pages. These fonts work in all major browsers and are served by Google CDN for lightning fast delivery. These fonts can also be used with Highcharts to further polish the appearance of our charts. This section assumes that you know the basics of using Google Web Fonts. If you are not familiar with them, visit https://developers.google.com/fonts/docs/getting_started. We will style the following example with Google Fonts. We will use the Merriweather family from Google Fonts and link to its style sheet from our web page inside the <head> tag: <link href='http://fonts.googleapis.com/css?family=Merriweather:400italic,700italic' rel='stylesheet' type='text/css'> Having included the style sheet, we can actually use the font family in our code for the labels in yAxis: yAxis: [{ ... labels: { style: { fontFamily: 'Merriweather, sans-serif', fontWeight: 400, fontStyle: 'italic', fontSize: '14px', color: '#ffffff' } } }, { ... labels: { style: { fontFamily: 'Merriweather, sans-serif', fontWeight: 700, fontStyle: 'italic', fontSize: '21px', color: '#ffffff' }, ... } }] For the outer axis, we used a font size of 21px with font weight of 700. For the inner axis, we lowered the font size to 14px and used font weight of 400 to compensate for the smaller font size. The following is the modified speedometer: In the next section, we will continue with the same example to include jQuery UI easing in chart animations. Using jQuery UI easing for series animation Animations occurring at the point of initialization of charts can be disabled or customized. The customization requires modifying two properties: animation.duration and animation.easing. The duration property accepts the number of milliseconds for the duration of the animation. The easing property can have various values depending on the framework currently being used. For a standalone jQuery framework, the values can be either linear or swing. Using the jQuery UI framework adds a couple of more options for the easing property to choose from. In order to follow this example, you must include the jQuery UI framework to the page. You can also grab the standalone easing plugin from http://gsgd.co.uk/sandbox/jquery/easing/ and include it inside your <head> tag. We can now modify the series to have a modified animation: plotOptions: { ... series: { animation: { duration: 1000, easing: 'easeOutBounce' } } } The preceding code will modify the animation property for all the series in the chart to have duration set to 1000 milliseconds and easing to easeOutBounce. Each series can have its own different animation by defining the animation property separately for each series as follows: series: [{ ... animation: { duration: 500, easing: 'easeOutBounce' } }, { ... animation: { duration: 1500, easing: 'easeOutBounce' } }, { ... animation: { duration: 2500, easing: 'easeOutBounce' } }] Different animation properties for different series can pair nicely with column and bar charts to produce visually appealing effects. Creating a global theme for our charts A Highcharts theme is a collection of predefined styles that are applied before a chart is instantiated. A theme will be applied to all the charts on the page after the point of its inclusion, given that the styling options have not been modified within the chart instantiation. This provides us with an easy way to apply custom branding to charts without the need to define styles over and over again. In the following example, we will create a basic global theme for our charts. This way, we will get familiar with the fundamentals of Highcharts theming and some API methods. We will define our theme inside a separate JavaScript file to make the code reusable and keep things clean. Our theme will be contained in an options object that will, in turn, contain styling for different Highcharts components. Consider the following code placed in a file named custom-theme.js. This is a basic implementation of a Highcharts custom theme that includes colors and basic font styles along with some other modifications for axes: Highcharts.customTheme = { colors: ['#1BA6A6', '#12734F', '#F2E85C', '#F27329', '#D95D30', '#2C3949', '#3E7C9B', '#9578BE'], chart: { backgroundColor: { radialGradient: {cx: 0, cy: 1, r: 1}, stops: [ [0, '#ffffff'], [1, '#f2f2ff'] ] }, style: { fontFamily: 'arial, sans-serif', color: '#333' } }, title: { style: { color: '#222', fontSize: '21px', fontWeight: 'bold' } }, subtitle: { style: { fontSize: '16px', fontWeight: 'bold' } }, xAxis: { lineWidth: 1, lineColor: '#cccccc', tickWidth: 1, tickColor: '#cccccc', labels: { style: { fontSize: '12px' } } }, yAxis: { gridLineWidth: 1, gridLineColor: '#d9d9d9', labels: { style: { fontSize: '12px' } } }, legend: { itemStyle: { color: '#666', fontSize: '9px' }, itemHoverStyle:{ color: '#222' } } }; Highcharts.setOptions( Highcharts.customTheme ); We start off by modifying the Highcharts object to include an object literal named customTheme that contains styles for our charts. Inside customTheme, the first option we defined is for series colors. We passed an array containing eight colors to be applied to series. In the next part, we defined a radial gradient as a background for our charts and also defined the default font family and text color. The next two object literals contain basic font styles for the title and subtitle components. Then comes the styles for the x and y axes. For the xAxis, we define lineColor and tickColor to be #cccccc with the lineWidth value of 1. The xAxis component also contains the font style for its labels. The y axis gridlines appear parallel to the x axis that we have modified to have the width and color at 1 and #d9d9d9 respectively. Inside the legend component, we defined styles for the normal and mouse hover states. These two states are stated by itemStyle and itemHoverStyle respectively. In normal state, the legend will have a color of #666 and font size of 9px. When hovered over, the color will change to #222. In the final part, we set our theme as the default Highcharts theme by using an API method Highcharts.setOptions(), which takes a settings object to be applied to Highcharts; in our case, it is customTheme. The styles that have not been defined in our custom theme will remain the same as the default theme. This allows us to partially customize a predefined theme by introducing another theme containing different styles. In order to make this theme work, include the file custom-theme.js after the highcharts.js file: <script src="js/highcharts.js"></script> <script src="js/custom-theme.js"></script> The output of our custom theme is as follows: We can also tell our theme to include a web font from Google without having the need to include the style sheet manually in the header, as we did in a previous section. For that purpose, Highcharts provides a utility method named Highcharts.createElement(). We can use it as follows by placing the code inside the custom-theme.js file: Highcharts.createElement( 'link', { href: 'http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,700italic,400,300,700', rel: 'stylesheet', type: 'text/css' }, null, document.getElementsByTagName( 'head' )[0], null ); The first argument is the name of the tag to be created. The second argument takes an object as tag attributes. The third argument is for CSS styles to be applied to this element. Since, there is no need for CSS styles on a link element, we passed null as its value. The final two arguments are for the parent node and padding, respectively. We can now change the default font family for our charts to 'Open Sans': chart: { ... style: { fontFamily: "'Open Sans', sans-serif", ... } } The specified Google web font will now be loaded every time a chart with our custom theme is initialized, hence eliminating the need to manually insert the required font style sheet inside the <head> tag. This screenshot shows a chart with 'Open Sans' Google web font. Summary In this article, you learned about incorporating Google fonts and jQuery UI easing into our chart for enhanced styling. Resources for Article: Further resources on this subject: Integrating with other Frameworks [Article] Highcharts [Article] More Line Charts, Area Charts, and Scatter Plots [Article]

0
0
8224

article-image-microsoft-sql-server-2008-high-availability-understanding-domains-users-and-security

Packt

21 Jan 2011

14 min read

Microsoft SQL Server 2008 High Availability: Understanding Domains, Users, and Security

Packt

21 Jan 2011

14 min read

Microsoft SQL Server 2008 High Availability Minimize downtime, speed up recovery, and achieve the highest level of availability and reliability for SQL server applications by mastering the concepts of database mirroring,log shipping,clustering, and replication Install various SQL Server High Availability options in a step-by-step manner A guide to SQL Server High Availability for DBA aspirants, proficient developers and system administrators Learn the pre and post installation concepts and common issues you come across while working on SQL Server High Availability Tips to enhance performance with SQL Server High Availability External references for further study Windows domains and domain users In the early era of Windows, operating system user were created standalone until Windows NT operating system hit the market. Windows NT, that is, Windows New Technology introduced some great feature to the world—including domains. A domain is a group of computers that run on Windows operating systems. Amongst them is a computer that holds all the information related to user authentication and user database and is called the domain controller (server), whereas every user who is part of this user database on the domain controller is called a domain user. Domain users have access to any resource across the domain and its subdomains with the privilege they have, unlike the standalone user who has access to the resources available to a specific system. With the release of Windows Server 2000, Microsoft released Active Directory (AD), which is now widely used with Windows operating system networks to store, authenticate, and control users who are part of the domain. A Windows domain uses various modes to authenticate users—encrypted passwords, various handshake methods such as PKI, Kerberos, EAP, SSL certificates, NAP, LDAP, and IP Sec policy—and makes it robust authentication. One can choose the authentication method that suits business needs and based on the environment. Let's now see various authentication methods in detail. Public Key Infrastructure (PKI): This is the most common method used to transmit data over insecure channels such as the Internet using digital certificates. It has generally two parts—the public and private keys. These keys are generated by a Certificate Authority, such as, Thawte. Public keys are stored in a directory where they are accessible by all parties. The public key is used by the message sender to send encrypted messages, which then can be decrypted using the private key. Kerberos: This is an authentication method used in client server architecture to authorize the client to use service(s) on a server in a network. In this method, when a client sends a request to use a service to a server, a request goes to the authentication server, which will generate a session key and a random value based on the username. This session key and a random value are then passed to the server, which grants or rejects the request. These sessions are for certain time period, which means for that particular amount of time the client can use the service without having to re-authenticate itself. Extensible Authentication Protocol (EAP): This is an authentication protocol generally used in wireless and point-to-point connections. SSL Certificates: A Secure Socket Layer certificate (SSL) is a digital certificate that is used to identify a website or server that provides a service to clients and sends the data in an encrypted form. SSL certificates are typically used by websites such as GMAIL. When we type a URL and press Enter, the web browser sends a request to the web server to identify itself. The web server then sends a copy of its SSL certificate, which is checked by the browser. If the browser trusts the certificate (this is generally done on the basis of the CA and Registration Authority and directory verification), it will send a message back to the server and in reply the web server sends an acknowledgement to the browser to start an encrypted session. Network Access Protection (NAP): This is a new platform introduced by Microsoft with the release of Windows Server 2008. It will provide access to the client, based on the identity of the client, the group it belongs to, and the level compliance it has with the policy defined by the Network Administrators. If the client doesn't have a required compliance level, NAP has mechanisms to bring the client to the compliance level dynamically and allow it to access the network. Lightweight Directory Access Protocol (LDAP): This is a protocol that runs over TCP/IP directly. It is a set of objects, that is, organizational units, printers, groups, and so on. When the client sends a request for a service, it queries the LDAP server to search for availability of information, and based on that information and level of access, it will provide access to the client. IP Security (IPSEC): IP Security is a set of protocols that provides security at the network layer. IP Sec provides two choices: Authentication Header: Here it encapsulates the authentication of the sender in a header of the network packet. Encapsulating Security Payload: Here it supports encryption of both the header and data. Now that we know basic information on Windows domains, domain users, and various authentication methods used with Windows servers, I will walk you through some of the basic and preliminary stuff about SQL Server security! Understanding SQL Server Security Security!! Now-a-days we store various kinds of information into databases and we just want to be sure that they are secured. Security is the most important word to the IT administrator and vital for everybody who has stored their information in a database as he/she needs to make sure that not a single piece of data should be made available to someone who shouldn't have access. Because all the information stored in the databases is vital, everyone wants to prevent unauthorized access to highly confidential data and here is how security implementation in SQL Server comes into the picture. With the release of SQL Server 2000, Microsoft (MS) has introduced some great security features such as authentication roles (fixed server roles and fixed database roles), application roles, various permissions levels, forcing protocol encryption, and so on, which are widely used by administrators to tighten SQL Server security. Basically, SQL Server security has two paradigms: one is SQL Server's own set of security measures and other is to integrate them with the domain. SQL Server has two methods of authentication. Windows authentication In Windows authentication mode, we can integrate domain accounts to authenticate users, and based on the group they are members of and level of access they have, DBAs can provide them access on the particular SQL server box. Whenever a user tries to access the SQL Server, his/her account is validated by the domain controller first, and then based on the permission it has, the domain controller allows or rejects the request; here it won't require separate login ID and password to authenticate. Once the user is authenticated, SQL server will allow access to the user based on the permission it has. These permissions are in form of Roles including Server, fixed DB Roles, and Application roles. Fixed Server Roles: These are security principals that have server-wide scope. Basically, fixed server roles are expected to manage the permissions at server level. We can add SQL logins, domain accounts, and domain groups to these roles. There are different roles that we can assign to a login, domain account, or group—the following table lists them. Fixed DB Roles: These are the roles that are assigned to a particular login for the particular database; its scope is limited to the database it has permission to. There are various fixed database roles, including the ones shown in the following table: Application Role: The Application role is a database principal that is widely used to assign user permissions for an application. For example, in a home-grown ERP, some users require only to view the data; we can create a role and add a db_datareader permission to it and then can add all those users who require read-only permission. Mixed Authentication: In the Mixed authentication mode, logins can be authenticated by the Windows domain controller or by SQL Server itself. DBAs can create logins with passwords in SQL Server. With the release of SQL Server 2005, MS has introduced password policies for SQL Server logins. Mixed mode authentication is used when one has to run a legacy application and it is not on the domain network. In my opinion, Windows authentication is good because we can use various handshake methods such as PKI, Kerberos, EAP, SSL NAP, LDAP, or IPSEC to tighten the security. SQL Server 2005 has enhancements in its security mechanisms. The most important features amongst them are password policy, native encryption, separation of users and schema, and no need to provide system administrator (SA) rights to run profiler. These are good things because SA is the super user, and with the power this account has, a user can do anything on the SQL box, including: The user can grant ALTER TRACE permission to users who require to run profiler The user can create login and users The user can grant or revoke permission to other users A schema is an object container that is owned by a user and is transferable to any other users. In earlier versions, objects are owned by users, so if the user leaves the company we cannot delete his/her account from the SQL box just because there is some object he/she has created. We first have to change the owner of that object and then we can delete that account. On the other hand, in the case of a schema, we could have dropped the user account because the object is owned by the schema. Now, SQL Server 2008 will give you more control over the configuration of security mechanisms. It allows you to configure metadata access, execution context, and auditing events using DDL triggers—the most powerful feature to audit any DDL event. If one wishes to know more about what we have seen till now, he/she can go through the following links: http://www.microsoft.com/sqlserver/2008/en/us/Security.aspx http://www.microsoft.com/sqlserver/2005/en/us/security-features.aspx http://technet.microsoft.com/en-us/library/cc966507.aspx In this section, we understood the basics of domains, domain users, and SQL Server security. We also learned why security is given so much emphasize these days. In the next section, we will understand the basics of clustering and its components. What is clustering? Before starting with SQL Server clustering, let's have a look at clustering in general and Windows clusters. The word Cluster itself is self-descriptive—a bunch or group. When two or more than two computers are connected to each other by means of a network and share some of the common resources to provide redundancy or performance improvement, they are known as a cluster of computers. Clustering is usually deployed when there is a critical business application running that needs to be available 24 X 7 or in terminology—High Availability. These clusters are known as Failover clusters because the primary goal to set up the cluster is to make services or business processes that are business critical available 24 X 7. MS Windows server Enterprise and Datacenter edition supports failover clustering. This is achieved by having two identical nodes connected to each other by means of private network or commonly used resources. In case of failure of any common resource or services, the first node (Active) passes the ownership to another node (Passive). SQL Server Clustering is built on top of Windows Clustering, which means before we go about installing SQL Server clustering, we should have Windows clustering installed. Before we start, let's understand the commonly used shared resources for the cluster server. Clusters with 2, 4, 8, 12 or 32 nodes can be built Windows Server 2008 R2. Clusters are categorized in the following manner: High-Availability Clusters: This type of cluster is known as a Failover cluster. High Availability clusters are implemented when the purpose is to provide highly available services. For implementing a failover or high availability cluster one may have up to 16 nodes in a Microsoft Cluster. Clustering in Windows operating systems was first introduced with the release of Windows NT 4.0 Enterprise Edition, and was enhanced gradually. Even though we can have non-identical hardware, we should use identical hardware. This is because if the node to which cluster fails over has lower configuration, then we might face degradation in performance. Load Balancing: This is the second form of cluster that can be configured. This type of cluster can be configured by linking multiple computers with each other and making use of each resource they need for operation. From the user's point of view, all of these servers/nodes linked to each other are different. However, it is collectively and virtually a single system, with the main goal being to balance work by sharing CPU, disk, and every possible resource among the linked nodes and that is why it is known as a Load Balancing cluster. SQL Server doesn't support this form of clustering. Compute Clusters: When computers are linked together with the purpose of using them for simulation for aircraft, they are known as a compute cluster. A well-known example is Beowulf computers. Grid Computing: This is one kind of clustering solution, but it is more often used when there is a dispersed location. This kind of cluster is called a Supercomputer or HPC. The main application is scientific research, academic, mathematical, or weather forecasting where lots of CPUs and disks are required—SETI@home is a well-known example. If we talk about SQL Server clusters, there are some cool new features that are added in the latest release of SQL Server 2008, although with the limitation that these features are available only if SQL Server 2008 is used with Windows Server 2008. So, let's have a glance at these features: Service SID: Service SIDs were introduced with Windows Vista and Windows Server 2008. They enable us to bind permissions directly to Windows services. In the earlier version of SQL Server 2005, we need to have a SQL Server Services account that is a member of a domain group so that it can have all the required permissions. This is not the case with SQL Server 2008 and we may choose Service SIDs to bypass the need to provision domain groups. Support for 16 nodes: We may add up to 16 nodes in our SQL Server 2008 cluster with SQL Server 2008 Enterprise 64-bit edition. New cluster validation: As a part of the installation steps, a new cluster validation step is implemented. Prior to adding a node into an existing cluster, this validation step checks whether or not the cluster environment is compatible. Mount Drives: Drive letters are no longer essential. If we have a cluster configured that has limited drive letters available, we may mount a drive and use it in our cluster environment, provided it is associated with the base drive letter. Geographically dispersed cluster node: This is the super-cool feature introduced with SQL Server 2008, which uses VLAN to establish connectivity with other cluster nodes. It breaks the limitation of having all clustered nodes at a single location. IPv6 Support: SQL Server 2008 natively supports IPv6, which increases network IP address size to 128 bit from 32 bit. DHCP Support: Windows server 2008 clustering introduced the use of DHCP-assigned IP addresses by the cluster services and it is supported by SQL Server 2008 clusters. It is recommended to use static IP addresses. This is because if some of our application depends on IP addresses, or in case of failure of renewing IP address from DHCP server, there would be a failure of IP address resources. iSCSI Support: Windows server 2008 supports iSCSI to be used as storage connection; in earlier versions, only SAN and fibre channels were supported.

0
0
8215

Packt

17 May 2012

18 min read

PL/SQL: Using Collections

Packt

17 May 2012

18 min read

0
0
8214

article-image-pricing-double-no-touch-option

Packt

10 Mar 2015

19 min read

Pricing the Double-no-touch option

Packt

10 Mar 2015

19 min read

In this article by Balázs Márkus, coauthor of the book Mastering R for Quantitative Finance, you will learn about pricing and life of Double-no-touch (DNT) option. (For more resources related to this topic, see here.) A Double-no-touch (DNT) option is a binary option that pays a fixed amount of cash at expiry. Unfortunately, the fExoticOptions package does not contain a formula for this option at present. We will show two different ways to price DNTs that incorporate two different pricing approaches. In this section, we will call the function dnt1, and for the second approach, we will use dnt2 as the name for the function. Hui (1996) showed how a one-touch double barrier binary option can be priced. In his terminology, "one-touch" means that a single trade is enough to trigger the knock-out event, and "double barrier" binary means that there are two barriers and this is a binary option. We call this DNT as it is commonly used on the FX markets. This is a good example for the fact that many popular exotic options are running under more than one name. In Haug (2007a), the Hui-formula is already translated into the generalized framework. S, r, b, s, and T have the same meaning. K means the payout (dollar amount) while L and U are the lower and upper barriers. Where Implementing the Hui (1996) function to R starts with a big question mark: what should we do with an infinite sum? How high a number should we substitute as infinity? Interestingly, for practical purposes, small number like 5 or 10 could often play the role of infinity rather well. Hui (1996) states that convergence is fast most of the time. We are a bit skeptical about this since a will be used as an exponent. If b is negative and sigma is small enough, the (S/L)a part in the formula could turn out to be a problem. First, we will try with normal parameters and see how quick the convergence is: dnt1 <- function(S, K, U, L, sigma, T, r, b, N = 20, ploterror = FALSE){ if ( L > S | S > U) return(0) Z <- log(U/L) alpha <- -1/2*(2*b/sigma^2 - 1) beta <- -1/4*(2*b/sigma^2 - 1)^2 - 2*r/sigma^2 v <- rep(0, N) for (i in 1:N) v[i] <- 2*pi*i*K/(Z^2) * (((S/L)^alpha - (-1)^i*(S/U)^alpha ) / (alpha^2+(i*pi/Z)^2)) * sin(i*pi/Z*log(S/L)) * exp(-1/2 * ((i*pi/Z)^2-beta) * sigma^2*T) if (ploterror) barplot(v, main = "Formula Error"); sum(v) } print(dnt1(100, 10, 120, 80, 0.1, 0.25, 0.05, 0.03, 20, TRUE)) The following screenshot shows the result of the preceding code: The Formula Error chart shows that after the seventh step, additional steps were not influencing the result. This means that for practical purposes, the infinite sum can be quickly estimated by calculating only the first seven steps. This looks like a very quick convergence indeed. However, this could be pure luck or coincidence. What about decreasing the volatility down to 3 percent? We have to set N as 50 to see the convergence: print(dnt1(100, 10, 120, 80, 0.03, 0.25, 0.05, 0.03, 50, TRUE)) The preceding command gives the following output: Not so impressive? 50 steps are still not that bad. What about decreasing the volatility even lower? At 1 percent, the formula with these parameters simply blows up. First, this looks catastrophic; however, the price of a DNT was already 98.75 percent of the payout when we used 3 percent volatility. Logic says that the DNT price should be a monotone-decreasing function of volatility, so we already know that the price of the DNT should be worth at least 98.75 percent if volatility is below 3 percent. Another issue is that if we choose an extreme high U or extreme low L, calculation errors emerge. However, similar to the problem with volatility, common sense helps here too; the price of a DNT should increase if we make U higher or L lower. There is still another trick. Since all the problem comes from the a parameter, we can try setting b as 0, which will make a equal to 0.5. If we also set r to 0, the price of a DNT converges into 100 percent as the volatility drops. Anyway, whenever we substitute an infinite sum by a finite sum, it is always good to know when it will work and when it will not. We made a new code that takes into consideration that convergence is not always quick. The trick is that the function calculates the next step as long as the last step made any significant change. This is still not good for all the parameters as there is no cure for very low volatility, except that we accept the fact that if implied volatilities are below 1 percent, than this is an extreme market situation in which case DNT options should not be priced by this formula: dnt1 <- function(S, K, U, L, sigma, Time, r, b) { if ( L > S | S > U) return(0) Z <- log(U/L) alpha <- -1/2*(2*b/sigma^2 - 1) beta <- -1/4*(2*b/sigma^2 - 1)^2 - 2*r/sigma^2 p <- 0 i <- a <- 1 while (abs(a) > 0.0001){ a <- 2*pi*i*K/(Z^2) * (((S/L)^alpha - (-1)^i*(S/U)^alpha ) / (alpha^2 + (i *pi / Z)^2) ) * sin(i * pi / Z * log(S/L)) * exp(-1/2*((i*pi/Z)^2-beta) * sigma^2 * Time) p <- p + a i <- i + 1 } p } Now that we have a nice formula, it is possible to draw some DNT-related charts to get more familiar with this option. Later, we will use a particular AUDUSD DNT option with the following parameters: L equal to 0.9200, U equal to 0.9600, K (payout) equal to USD 1 million, T equal to 0.25 years, volatility equal to 6 percent, r_AUD equal to 2.75 percent, r_USD equal to 0.25 percent, and b equal to -2.5 percent. We will calculate and plot all the possible values of this DNT from 0.9200 to 0.9600; each step will be one pip (0.0001), so we will use 2,000 steps. The following code plots a graph of price of underlying: x <- seq(0.92, 0.96, length = 2000) y <- z <- rep(0, 2000) for (i in 1:2000){ y[i] <- dnt1(x[i], 1e6, 0.96, 0.92, 0.06, 0.25, 0.0025, -0.0250) z[i] <- dnt1(x[i], 1e6, 0.96, 0.92, 0.065, 0.25, 0.0025, -0.0250) } matplot(x, cbind(y,z), type = "l", lwd = 2, lty = 1, main = "Price of a DNT with volatility 6% and 6.5% ", cex.main = 0.8, xlab = "Price of underlying" ) The following output is the result of the preceding code: It can be clearly seen that even a small change in volatility can have a huge impact on the price of a DNT. Looking at this chart is an intuitive way to find that vega must be negative. Interestingly enough even just taking a quick look at this chart can convince us that the absolute value of vega is decreasing if we are getting closer to the barriers. Most end users think that the biggest risk is when the spot is getting close to the trigger. This is because end users really think about binary options in a binary way. As long as the DNT is alive, they focus on the positive outcome. However, for a dynamic hedger, the risk of a DNT is not that interesting when the value of the DNT is already small. It is also very interesting that since the T-Bill price is independent of the volatility and since the DNT + DOT = T-Bill equation holds, an increasing volatility will decrease the price of the DNT by the exact same amount just like it will increase the price of the DOT. It is not surprising that the vega of the DOT should be the exact mirror of the DNT. We can use the GetGreeks function to estimate vega, gamma, delta, and theta. For gamma we can use the GetGreeks function in the following way: GetGreeks <- function(FUN, arg, epsilon,...) { all_args1 <- all_args2 <- list(...) all_args1[[arg]] <- as.numeric(all_args1[[arg]] + epsilon) all_args2[[arg]] <- as.numeric(all_args2[[arg]] - epsilon) (do.call(FUN, all_args1) - do.call(FUN, all_args2)) / (2 * epsilon) } Gamma <- function(FUN, epsilon, S, ...) { arg1 <- list(S, ...) arg2 <- list(S + 2 * epsilon, ...) arg3 <- list(S - 2 * epsilon, ...) y1 <- (do.call(FUN, arg2) - do.call(FUN, arg1)) / (2 * epsilon) y2 <- (do.call(FUN, arg1) - do.call(FUN, arg3)) / (2 * epsilon) (y1 - y2) / (2 * epsilon) } x = seq(0.9202, 0.9598, length = 200) delta <- vega <- theta <- gamma <- rep(0, 200) for(i in 1:200){ delta[i] <- GetGreeks(FUN = dnt1, arg = 1, epsilon = 0.0001, x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.02, -0.02) vega[i] <- GetGreeks(FUN = dnt1, arg = 5, epsilon = 0.0005, x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.0025, -0.025) theta[i] <- - GetGreeks(FUN = dnt1, arg = 6, epsilon = 1/365, x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.0025, -0.025) gamma[i] <- Gamma(FUN = dnt1, epsilon = 0.0001, S = x[i], K = 1e6, U = 0.96, L = 0.92, sigma = 0.06, Time = 0.5, r = 0.02, b = -0.02) } windows() plot(x, vega, type = "l", xlab = "S",ylab = "", main = "Vega") The following chart is the result of the preceding code: After having a look at the value chart, the delta of a DNT is also very close to intuitions; if we are coming close to the higher barrier, our delta gets negative, and if we are coming closer to the lower barrier, the delta gets positive as follows: windows() plot(x, delta, type = "l", xlab = "S",ylab = "", main = "Delta") This is really a non-convex situation; if we would like to do a dynamic delta hedge, we will lose money for sure. If the spot price goes up, the delta of the DNT decreases, so we should buy some AUDUSD as a hedge. However, if the spot price goes down, we should sell some AUDUSD. Imagine a scenario where AUDUSD goes up 20 pips in the morning and then goes down 20 pips in the afternoon. For a dynamic hedger, this means buying some AUDUSD after the price moved up and selling this very same amount after the price comes down. The changing of the delta can be described by the gamma as follows: windows() plot(x, gamma, type = "l", xlab = "S",ylab = "", main = "Gamma") Negative gamma means that if the spot goes up, our delta is decreasing, but if the spot goes down, our delta is increasing. This doesn't sound great. For this inconvenient non-convex situation, there is some compensation, that is, the value of theta is positive. If nothing happens, but one day passes, the DNT will automatically worth more. Here, we use theta as minus 1 times the partial derivative, since if (T-t) is the time left, we check how the value changes as t increases by one day: windows() plot(x, theta, type = "l", xlab = "S",ylab = "", main = "Theta") The more negative the gamma, the more positive our theta. This is how time compensates for the potential losses generated by the negative gamma. Risk-neutral pricing also implicates that negative gamma should be compensated by a positive theta. This is the main message of the Black-Scholes framework for vanilla options, but this is also true for exotics. See Taleb (1997) and Wilmott (2006). We already introduced the Black-Scholes surface before; now, we can go into more detail. This surface is also a nice interpretation of how theta and delta work. It shows the price of an option for different spot prices and times to maturity, so the slope of this surface is the theta for one direction and delta for the other. The code for this is as follows: BS_surf <- function(S, Time, FUN, ...) { n <- length(S) k <- length(Time) m <- matrix(0, n, k) for (i in 1:n) { for (j in 1:k) { l <- list(S = S[i], Time = Time[j], ...) m[i,j] <- do.call(FUN, l) } } persp3D(z = m, xlab = "underlying", ylab = "Time", zlab = "option price", phi = 30, theta = 30, bty = "b2") } BS_surf(seq(0.92,0.96,length = 200), seq(1/365, 1/48, length = 200), dnt1, K = 1000000, U = 0.96, L = 0.92, r = 0.0025, b = -0.0250, sigma = 0.2) The preceding code gives the following output: We can see what was already suspected; DNT likes when time is passing and the spot is moving to the middle of the (L,U) interval. Another way to price the Double-no-touch option Static replication is always the most elegant way of pricing. The no-arbitrage argument will let us say that if, at some time in the future, two portfolios have the same value for sure, then their price should be equal any time before this. We will show how double-knock-out (DKO) options could be used to build a DNT. We will need to use a trick; the strike price could be the same as one of the barriers. For a DKO call, the strike price should be lower than the upper barrier because if the strike price is not lower than the upper barrier, the DKO call would be knocked out before it could become in-the-money, so in this case, the option would be worthless as nobody can ever exercise it in-the-money. However, we can choose the strike price to be equal to the lower barrier. For a put, the strike price should be higher than the lower barrier, so why not make it equal to the upper barrier. This way, the DKO call and DKO put option will have a very convenient feature; if they are still alive, they will both expiry in-the-money. Now, we are almost done. We just have to add the DKO prices, and we will get a DNT that has a payout of (U-L) dollars. Since DNT prices are linear in the payout, we only have to multiply the result by K*(U-L): dnt2 <- function(S, K, U, L, sigma, T, r, b) { a <- DoubleBarrierOption("co", S, L, L, U, T, r, b, sigma, 0, 0,title = NULL, description = NULL) z <- a@price b <- DoubleBarrierOption("po", S, U, L, U, T, r, b, sigma, 0, 0,title = NULL, description = NULL) y <- b@price (z + y) / (U - L) * K } Now, we have two functions for which we can compare the results: dnt1(0.9266, 1000000, 0.9600, 0.9200, 0.06, 0.25, 0.0025, -0.025) [1] 48564.59 dnt2(0.9266, 1000000, 0.9600, 0.9200, 0.06, 0.25, 0.0025, -0.025) [1] 48564.45 For a DNT with a USD 1 million contingent payout and an initial market value of over 48,000 dollars, it is very nice to see that the difference in the prices is only 14 cents. Technically, however, having a second pricing function is not a big help since low volatility is also an issue for dnt2. We will use dnt1 for the rest of the article. The life of a Double-no-touch option – a simulation How has the DNT price been evolving during the second quarter of 2014? We have the open-high-low-close type time series with five minute frequency for AUDUSD, so we know all the extreme prices: d <- read.table("audusd.csv", colClasses = c("character", rep("numeric",5)), sep = ";", header = TRUE) underlying <- as.vector(t(d[, 2:5])) t <- rep( d[,6], each = 4) n <- length(t) option_price <- rep(0, n) for (i in 1:n) { option_price[i] <- dnt1(S = underlying[i], K = 1000000, U = 0.9600, L = 0.9200, sigma = 0.06, T = t[i]/(60*24*365), r = 0.0025, b = -0.0250) } a <- min(option_price) b <- max(option_price) option_price_transformed = (option_price - a) * 0.03 / (b - a) + 0.92 par(mar = c(6, 3, 3, 5)) matplot(cbind(underlying,option_price_transformed), type = "l", lty = 1, col = c("grey", "red"), main = "Price of underlying and DNT", xaxt = "n", yaxt = "n", ylim = c(0.91,0.97), ylab = "", xlab = "Remaining time") abline(h = c(0.92, 0.96), col = "green") axis(side = 2, at = pretty(option_price_transformed), col.axis = "grey", col = "grey") axis(side = 4, at = pretty(option_price_transformed), labels = round(seq(a/1000,1000,length = 7)), las = 2, col = "red", col.axis = "red") axis(side = 1, at = seq(1,n, length=6), labels = round(t[round(seq(1,n, length=6))]/60/24)) The following is the output for the preceding code: The price of a DNT is shown in red on the right axis (divided by 1000), and the actual AUDUSD price is shown in grey on the left axis. The green lines are the barriers of 0.9200 and 0.9600. The chart shows that in 2014 Q2, the AUDUSD currency pair was traded inside the (0.9200; 0.9600) interval; thus, the payout of the DNT would have been USD 1 million. This DNT looks like a very good investment; however, reality is just one trajectory out of an a priori almost infinite set. It could have happened differently. For example, on May 02, 2014, there were still 59 days left until expiry, and AUDUSD was traded at 0.9203, just three pips away from the lower barrier. At this point, the price of this DNT was only USD 5,302 dollars which is shown in the following code: dnt1(0.9203, 1000000, 0.9600, 0.9200, 0.06, 59/365, 0.0025, -0.025) [1] 5302.213 Compare this USD 5,302 to the initial USD 48,564 option price! In the following simulation, we will show some different trajectories. All of them start from the same 0.9266 AUDUSD spot price as it was on the dawn of April 01, and we will see how many of them stayed inside the (0.9200; 0.9600) interval. To make it simple, we will simulate geometric Brown motions by using the same 6 percent volatility as we used to price the DNT: library(matrixStats) DNT_sim <- function(S0 = 0.9266, mu = 0, sigma = 0.06, U = 0.96, L = 0.92, N = 5) { dt <- 5 / (365 * 24 * 60) t <- seq(0, 0.25, by = dt) Time <- length(t) W <- matrix(rnorm((Time - 1) * N), Time - 1, N) W <- apply(W, 2, cumsum) W <- sqrt(dt) * rbind(rep(0, N), W) S <- S0 * exp((mu - sigma^2 / 2) * t + sigma * W ) option_price <- matrix(0, Time, N) for (i in 1:N) for (j in 1:Time) option_price[j,i] <- dnt1(S[j,i], K = 1000000, U, L, sigma, 0.25-t[j], r = 0.0025, b = -0.0250)*(min(S[1:j,i]) > L & max(S[1:j,i]) < U) survivals <- sum(option_price[Time,] > 0) dev.new(width = 19, height = 10) par(mfrow = c(1,2)) matplot(t,S, type = "l", main = "Underlying price", xlab = paste("Survived", survivals, "from", N), ylab = "") abline( h = c(U,L), col = "blue") matplot(t, option_price, type = "l", main = "DNT price", xlab = "", ylab = "")} set.seed(214) system.time(DNT_sim()) The following is the output for the preceding code: Here, the only surviving trajectory is the red one; in all other cases, the DNT hits either the higher or the lower barrier. The line set.seed(214) grants that this simulation will look the same anytime we run this. One out of five is still not that bad; it would suggest that for an end user or gambler who does no dynamic hedging, this option has an approximate value of 20 percent of the payout (especially since the interest rates are low, the time value of money is not important). However, five trajectories are still too few to jump to such conclusions. We should check the DNT survivorship ratio for a much higher number of trajectories. The ratio of the surviving trajectories could be a good estimator of the a priori real-world survivorship probability of this DNT; thus, the end user value of it. Before increasing N rapidly, we should keep in mind how much time this simulation took. For my computer, it took 50.75 seconds for N = 5, and 153.11 seconds for N = 15. The following is the output for N = 15: Now, 3 out of 15 survived, so the estimated survivorship ratio is still 3/15, which is equal to 20 percent. Looks like this is a very nice product; the price is around 5 percent of the payout, while 20 percent is the estimated survivorship ratio. Just out of curiosity, run the simulation for N equal to 200. This should take about 30 minutes. The following is the output for N = 200: The results are shocking; now, only 12 out of 200 survive, and the ratio is only 6 percent! So to get a better picture, we should run the simulation for a larger N. The movie Whatever Works by Woody Allen (starring Larry David) is 92 minutes long; in simulation time, that is N = 541. For this N = 541, there are only 38 surviving trajectories, resulting in a survivorship ratio of 7 percent. What is the real expected survivorship ratio? Is it 20 percent, 6 percent, or 7 percent? We simply don't know at this point. Mathematicians warn us that the law of large numbers requires large numbers, where large is much more than 541, so it would be advisable to run this simulation for as large an N as time allows. Of course, getting a better computer also helps to do more N during the same time. Anyway, from this point of view, Hui's (1996) relatively fast converging DNT pricing formula gets some respect. Summary We started this article by introducing exotic options. In a brief theoretical summary, we explained how exotics are linked together. There are many types of exotics. We showed one possible way of classification that is consistent with the fExoticOptions package. We showed how the Black-Scholes surface (a 3D chart that contains the price of a derivative dependent on time and the underlying price) can be constructed for any pricing function. Resources for Article: Further resources on this subject: What is Quantitative Finance? [article] Learning Option Pricing [article] Derivatives Pricing [article]

0
0
8088

article-image-overview-sql-server-reporting-services-2012-architecture-features-and-tools

Packt

08 Aug 2013

15 min read

Overview of SQL Server Reporting Services 2012 Architecture, Features, and Tools

Packt

08 Aug 2013

15 min read

0
0
8004

How-To Tutorials - Data

A quick start – OpenCV fundamentals

Extending ElasticSearch with Scripting

How bad is the gender diversity crisis in AI research? Study analysing 1.5million arxiv papers says it’s “serious”

Visualizing my Social Graph with d3.js

Preprocessing the Data

Creating reports using SQL Server 2016 Reporting Services

Animating Graphic Objects using Python

Oracle BI Publisher 11g: Working with Multiple Data Sources

Pentaho Data Integration 4: working with complex data flows

How to stream and store tweets in Apache Kafka

Trending Topics

Theming with Highcharts

Microsoft SQL Server 2008 High Availability: Understanding Domains, Users, and Security

PL/SQL: Using Collections

Pricing the Double-no-touch option

Overview of SQL Server Reporting Services 2012 Architecture, Features, and Tools

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access