Data | Tech News, Tutorials & Expert Insights

article-image-structural-equation-modeling-and-confirmatory-factor-analysis

06 Feb 2015

30 min read

Structural Equation Modeling and Confirmatory Factor Analysis

06 Feb 2015

0
0
6841

article-image-9-recommended-blockchain-online-courses

Guest Contributor

27 Sep 2018

7 min read

9 recommended blockchain online courses

Guest Contributor

27 Sep 2018

7 min read

Blockchain is reshaping the world as we know it. And we are not talking metaphorically because the new technology is really influencing everything from online security and data management to governance and smart contracting. Statistical reports support these claims. According to the study, the blockchain universe grows by over 40% annually, while almost 70% of banks are already experimenting with this technology. IT experts at the Editing AussieWritings.com Services claim that the potential in this field is almost limitless: “Blockchain offers a myriad of practical possibilities, so you definitely want to get acquainted with it more thoroughly.” Developers who are curious about blockchain can turn it into a lucrative career opportunity since it gives them the chance to master the art of cryptography, hierarchical distribution, growth metrics, transparent management, and many more. There were 5,743 mostly full-time job openings calling for blockchain skills in the last 12 months - representing the 320% increase - while the biggest freelancing website Upwork reported more than 6,000% year-over-year growth. In this post, we will recommend our 9 best blockchain online courses. Let’s take a look! Udemy Udemy offers users one of the most comprehensive blockchain learning sources. The target audience is people who have heard a little bit about the latest developments in this field, but want to understand more. This online course can help you to fully understand how the blockchain works, as well as get to grips with all that surrounds it. Udemy breaks down the course into several less complicated units, allowing you to figure out this complex system rather easily. It costs $19.99, but you can probably get it with a 40% discount. The one downside, however, is that content quality in terms of subject scope can vary depending on the instructor, but user reviews are a good way to gauge quality. Each tutorial lasts approximately 30 minutes, but it also depends on your own tempo and style of work. Pluralsight Pluralsight is an excellent beginner-level blockchain course. It comes in three versions: Blockchain Fundamentals, Surveying Blockchain Technologies for Enterprise, and Introduction to Bitcoin and Decentralized Technology. Course duration varies from 80 to 200 minutes depending on the package. The price of Pluralsight is $29 a month or $299 a year. Choosing one of these options, you are granted access to the entire library of documents, including course discussions, learning paths, channels, skill assessments, and other similar tools. Packt Publishing Packt Publishing has a wide portfolio of learning products on Blockchain for varying levels of experience in the field from beginners to experts. And what’s even more interesting is that you can choose your learning format from books, ebooks to videos, courses and live courses. Or you could simply subscribe to MAPT, their library to gain access to all products at a reasonable price of $29 monthly and $150 annually. It offers several books and videos on the leading blockchain technology. You can purchase 5 blockchain titles at a discounted rate of $50. Here’s the list of top blockchain courses offered by Packt Publishing: Exploring Blockchain and Crypto-currencies: You will gain the foundational understanding of blockchain and crypto-currencies through various use-cases. Building Blockchain Projects: In this, you will be able to develop real-time practical DApps with Ethereum and JavaScript. Mastering Blockchain - Second Edition: You can learn about cryptography and cryptocurrencies, so you can build highly secure, decentralized applications and conduct trusted in-app transactions. Hands-On Blockchain with Hyperledger: This book will help you leverage the power of Hyperledger Fabric to develop Blockchain-based distributed ledgers with ease. Learning Blockchain Application Development [video ]: This interactive video will help you learn build smart contracts and DApps on Ethereum. Create Ethereum and Blockchain Applications using Solidity [video ]: This video will help you learn about Ethereum, Solidity, DAO, ICO, Bitcoin, Altcoin, Website Security, Ripple, Litecoin, Smart Contracts, and Apps. Cryptozombies Cryptozombies is an online blockchain course based on gamification elements. The tool teaches you to write smart contracts in Solidity through building your own crypto-collectibles game. It is entirely Ethereum-focused, but you don’t need any previous experience to understand how Solidity works. There is a step by step guide that explains to you even the smallest details, so you can quickly learn to create your own fully-functional blockchain-based game. The best thing about Cryptozombies is that you can test it for free and give up in case you don’t like it. Coursera The blockchain is the epicenter of the cryptocurrency world, so it’s necessary to study it if you want to deal with Bitcoin and other digital currencies. Coursera is the leading online resource in the field of virtual currencies, so you might want to check it out. After this course like Blockchain Specialization, you’ll know everything you need to be able to separate fact from fiction when reading claims about Bitcoin and other cryptocurrencies. You’ll have the conceptual foundations you need to engineer to secure software that interacts with the Bitcoin network. And you’ll be able to integrate ideas from Bitcoin in your own projects. The course is a 4-part course spanning a duration 4 weeks, but you can take each part separately. The price depends on the level and features you choose. LinkedIn Learning (formerly known as Lynda) LinkedIn Learning (what used to be Lynda) doesn't offer a specific blockchain course, but it does have a wide range of industry-related learning sources. A search for ‘blockchain’ will present you with almost 100 relevant video courses. You can find all sorts of lessons here, from beginner to expert levels. Lynda allows you to customize selection according to video duration, authors, software, subjects, etc. You can access the library for $15 a month. B9Lab B9Lab ETH-25 Certified Online Ethereum Developer Course is another course that promotes blockchain technology aimed at the Ethereum platform. It’s a 12-week in-depth learning solution that targets experienced programmers. B9Lab introduces everything there is to know about blockchain and how to build useful applications. Participants are taught about the Ethereum platform, the programming language Solidity, how to use web3 and the Truffle framework, and how to tie everything together. The price is €1450 or about $1700. IBM IBM made a self-paced blockchain course, titled Blockchain Essentials that lasts over two hours. The video lectures and lab in this course help you learn about blockchain for business and explore key use cases that demonstrate how the technology adds value. You can learn how to leverage blockchain benefits, transform your business with the new technology, and transfer assets. Besides that, you get a nice wrap-up and a quiz to test your knowledge upon completion. IBM’s course is free of charge. Khan Academy Khan Academy is the last, but certainly not the least important online course on our list. It gives users a comprehensive overview of blockchain-powered systems, particularly Bitcoin. Using this platform, you can learn more on cryptocurrency transactions, security, proof of work, etc. As an online education platform, Khan Academy won’t cost you a dime. [dropcap]B[/dropcap]lockchain is the groundbreaking technology that opens new boundaries in almost every field of business. It directly influences financial markets, data management, digital security, and a variety of other industries. In this post, we presented 9 best blockchain online courses you should try. These sources can teach you everything there is to know about the blockchain basics. Take some time to check them out and you won’t regret it! Author Bio: Olivia is a passionate blogger who writes on topics of digital marketing, career, and self-development. She constantly tries to learn something new and to share this experience on various websites. Connect with her on Facebook and Twitter. Google introduces Machine Learning courses for AI beginners Microsoft start AI School to teach Machine Learning and Artificial Intelligence.

0
0
6810

article-image-oracle-e-business-suite-creating-bank-accounts-and-cash-forecasts

Packt

19 Aug 2011

3 min read

Oracle E-Business Suite: Creating Bank Accounts and Cash Forecasts

Packt

19 Aug 2011

3 min read

Oracle E-Business Suite 12 Financials Cookbook Take the hard work out of your daily interactions with E-Business Suite financials by using the 50+ recipes from this cookbook Introduction Oracle E-business suite The liquidity of an organization is managed in Oracle Cash Management; this includes the reconciliation of the cashbook to the bank statements, and forecasting future cash requirements. In this article, we will look at how to create bank accounts and cash forecasts. Cash management integrates with Payables, Receivables, Payroll, Treasury, and General Ledger. Let's start by looking at the cash management process: The Bank generates statements. The statements are sent to the organization electronically or by post. The Treasury Administrator loads and verifies the bank statement into cash management. The statements can also be manually entered into cash management. The loaded statements are reconciled to the cash book transactions. The results are reviewed, and amended if required. The Treasury Administrator creates the journals for transactions in the General Ledger. Creating bank accounts Oracle Cash Management provides us with the functionality to create bank accounts. In this recipe, we will create a bank account for a bank called Shepherd Bank, for one of their branches called Kings Cross branch. Getting ready Log in to Oracle E-Business Suite R12 with the username and password assigned to you by the system administrator. If you are working on the Vision demonstration database, you can use OPERATIONS/WELCOME as the USERNAME/PASSWORD. We also need to create a bank before we can create the bank account. Let's look at how to create a bank and the branch: Select the Cash Management responsibility. Navigate to Setup | Banks | Banks.(Move the mouse over the image to enlarge it.) In the Banks tab, click on the Create button. Select the Create new bank option. In the Country field, enter United States. In the Bank Name field, enter Shepherds Bank. In the Bank Number field, enter JN316. Click on the Finish button. Let's create the branch and the address: (Move the mouse over the image to enlarge it.) Click the Create Branch icon: The Country and the Bank Name are automatically entered. Click on the Continue button.(Move the mouse over the image to enlarge it.) In the Branch Name field, enter Kings Cross. Select ABA as the Branch Type. Click on the Save and Next button to create the Branch address.(Move the mouse over the image to enlarge it.) In the Branch Address form, click on the create button. In the Country field, enter United States. In the Address Line 1 field, enter 4234 Red Eagle Road. In the City field, enter Sacred Heart. In the County field, enter Renville. In the State field, enter MN. In the Postal Code field, enter 56285. Ensure that the Status field is Active. Click on the Apply button. Click on the Finish button.

0
0
6802

Packt

20 May 2016

28 min read

Visualizations Using CCC

Packt

20 May 2016

28 min read

0
0
6801

Packt

21 Sep 2015

18 min read

Scraping the Data

Packt

21 Sep 2015

18 min read

In this article by Richard Lawson, author of the book Web Scraping with Python, we will first cover a browser extension called Firebug Lite to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the article will conclude with a comparison of these three scraping alternatives. (For more resources related to this topic, see here.) Analyzing a web page To understand how a web page is structured, we can try examining the source code. In most web browsers, the source code of a web page can be viewed by right-clicking on the page and selecting the View page source option: The data we are interested in is found in this part of the HTML: <table> <tr id="places_national_flag__row"><td class="w2p_fl"><label for="places_national_flag" id="places_national_flag__label">National Flag: </label></td><td class="w2p_fw"><img src="/places/static/images/flags/gb.png" /></td><td class="w2p_fc"></td></tr> … <tr id="places_neighbours__row"><td class="w2p_fl"><label for="places_neighbours" id="places_neighbours__label">Neighbours: </label></td><td class="w2p_fw"><div><a href="/iso/IE">IE </a></div></td><td class="w2p_fc"></td></tr></table> This lack of whitespace and formatting is not an issue for a web browser to interpret, but it is difficult for us. To help us interpret this table, we will use the Firebug Lite extension, which is available for all web browsers at https://getfirebug.com/firebuglite. Firefox users can install the full Firebug extension if preferred, but the features we will use here are included in the Lite version. Now, with Firebug Lite installed, we can right-click on the part of the web page we are interested in scraping and select Inspect with Firebug Lite from the context menu, as shown here: This will open a panel showing the surrounding HTML hierarchy of the selected element: In the preceding screenshot, the country attribute was clicked on and the Firebug panel makes it clear that the country area figure is included within a <td> element of class w2p_fw, which is the child of a <tr> element of ID places_area__row. We now have all the information needed to scrape the area data. Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, firstly with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at https://docs.python.org/2/howto/regex.html. To scrape the area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> url = 'http://example.webscraping.com/view/United Kingdom-239' >>> html = download(url) >>> re.findall('<td class="w2p_fw">(.*?)</td>', html) ['<img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', '<a href="/continent/EU">EU</a>', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', '<div><a href="/iso/IE">IE </a></div>'] This result shows that the <td class="w2p_fw"> tag is used for multiple country attributes. To isolate the area, we can select the second element, as follows: >>> re.findall('<td class="w2p_fw">(.*?)</td>', html)[1] '244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if the website is updated and the population data is no longer available in the second table row. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data in future, we want our solution to be as robust against layout changes as possible. To make this regular expression more robust, we can include the parent <tr> element, which has an ID, so it ought to be unique: >>> re.findall('<tr id="places_area__row"><td class="w2p_fl"><label for="places_area" id="places_area__label">Area: </label></td><td class="w2p_fw">(.*?)</td>', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra space could be added between the <td> tags, or the area_label could be changed. Here is an improved version to try and support these various possiblilities: >>> re.findall('<tr id="places_area__row">.*?<tds*class=["']w2p_fw["']>(.*?) </td>', html)[0] '244,820 square kilometres' This regular expression is more future-proof but is difficult to construct, becoming unreadable. Also, there are still other minor layout changes that would break it, such as if a title attribute was added to the <td> tag. From this example, it is clear that regular expressions provide a simple way to scrape data but are too brittle and will easily break when a web page is updated. Fortunately, there are better solutions. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have it installed, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Most web pages do not contain perfectly valid HTML and Beautiful Soup needs to decide what is intended. For example, consider this simple web page of a list with missing attribute quotes and closing tags: <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print fixed_html <html> <body> <ul class="country"> <li>Area</li> <li>Population</li> </ul> </body> </html> Here, BeautifulSoup was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] Beautiful Soup overview Here are the common methods and parameters you will use when scraping web pages with Beautiful Soup: BeautifulSoup(markup, builder): This method creates the soup object. The markup parameter can be a string or file object, and builder is the library that parses the markup parameter. find_all(name, attrs, text, **kwargs): This method returns a list of elements matching the given tag name, dictionary of attributes, and text. The contents of kwargs are used to match attributes. find(name, attrs, text, **kwargs): This method is the same as find_all(), except that it returns only the first match. If no element matches, it returns None. prettify(): This method returns the parsed HTML in an easy-to-read format with indentation and line breaks. For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the area from our example country: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/ United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the area tag >>> area = td.text # extract the text from this tag >>> print area 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. Lxml Lxml is a Python wrapper on top of the libxml2 XML parsing library written in C, which makes it faster than Beautiful Soup but also harder to install on some computers. The latest installation instructions are available at http://lxml.de/installation.html. As with Beautiful Soup, the first step is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> import lxml.html >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = lxml.html.fromstring(broken_html) # parse the HTML >>> fixed_html = lxml.html.tostring(tree, pretty_print=True) >>> print fixed_html <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here and in future examples, because they are more compact. Also, some readers will already be familiar with them from their experience with jQuery selectors. Here is an example using the lxml CSS selectors to extract the area data: >>> tree = lxml.html.fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print area 244,820 square kilometres The key line with the CSS selector is highlighted. This line finds a table row element with the places_area__row ID, and then selects the child table data tag with the w2p_fw class. CSS selectors CSS selectors are patterns used for selecting elements. Here are some examples of common selectors you will need: Select any tag: * Select by tag <a>: a Select by class of "link": .link Select by tag <a> with class "link": a.link Select by tag <a> with ID "home": a#home Select by child <span> of tag <a>: a > span Select by descendant <span> of tag <a>: a span Select by tag <a> with attribute title of "Home": a[title=Home] The CSS3 specification was produced by the W3C and is available for viewing at http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. Lxml implements most of CSS3, and details on unsupported features are available at https://pythonhosted.org/cssselect/#supported-selectors. Note that, internally, lxml converts the CSS selectors into an equivalent XPath. Comparing performance To help evaluate the trade-offs of the three scraping approaches described in this article, it would help to compare their relative efficiency. Typically, a scraper would extract multiple fields from a web page. So, for a more realistic comparison, we will implement extended versions of each scraper that extract all the available data from a country's web page. To get started, we need to return to Firebug to check the format of the other country features, as shown here: Firebug shows that each table row has an ID starting with places_ and ending with __row. Then, the country data is contained within these rows in the same format as the earlier area example. Here are implementations that use this information to extract all of the available country data: FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') import re def re_scraper(html): results = {} for field in FIELDS: results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).groups()[0] return results from bs4 import BeautifulSoup def bs_scraper(html): soup = BeautifulSoup(html, 'html.parser') results = {} for field in FIELDS: results[field] = soup.find('table').find('tr', id='places_%s__row' % field).find('td', class_='w2p_fw').text return results import lxml.html def lxml_scraper(html): tree = lxml.html.fromstring(html) results = {} for field in FIELDS: results[field] = tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() return results Scraping results Now that we have complete implementations for each scraper, we will test their relative performance with this snippet: import time NUM_ITERATIONS = 1000 # number of times to test each scraper html = download('http://example.webscraping.com/places/view/ United-Kingdom-239') for name, scraper in [('Regular expressions', re_scraper), ('BeautifulSoup', bs_scraper), ('Lxml', lxml_scraper)]: # record start time of scrape start = time.time() for i in range(NUM_ITERATIONS): if scraper == re_scraper: re.purge() result = scraper(html) # check scraped result is as expected assert(result['area'] == '244,820 square kilometres') # record end time of scrape and output the total end = time.time() print '%s: %.2f seconds' % (name, end – start) This example will run each scraper 1000 times, check whether the scraped results are as expected, and then print the total time taken. Note the highlighted line calling re.purge(); by default, the regular expression module will cache searches and this cache needs to be cleared to make a fair comparison with the other scraping approaches. Here are the results from this script on my computer: $ python performance.py Regular expressions: 5.50 seconds BeautifulSoup: 42.84 seconds Lxml: 7.06 seconds The results on your computer will quite likely be different because of the different hardware used. However, the relative difference between each approach should be equivalent. The results show that Beautiful Soup is over six times slower than the other two approaches when used to scrape our example web page. This result could be anticipated because lxml and the regular expression module were written in C, while BeautifulSoup is pure Python. An interesting fact is that lxml performed comparatively well with regular expressions, since lxml has the additional overhead of having to parse the input into its internal format before searching for elements. When scraping many features from a web page, this initial parsing overhead is reduced and lxml becomes even more competitive. It really is an amazing module! Overview The following table summarizes the advantages and disadvantages of each approach to scraping: Scraping approach Performance Ease of use Ease to install Regular expressions Fast Hard Easy (built-in module) Beautiful Soup Slow Easy Easy (pure Python) Lxml Fast Easy Moderately difficult If the bottleneck to your scraper is downloading web pages rather than extracting data, it would not be a problem to use a slower approach, such as Beautiful Soup. Or, if you just need to scrape a small amount of data and want to avoid additional dependencies, regular expressions might be an appropriate choice. However, in general, lxml is the best choice for scraping, because it is fast and robust, while regular expressions and Beautiful Soup are only useful in certain niches. Adding a scrape callback to the link crawler Now that we know how to scrape the country data, we can integrate this into the link crawler. To allow reusing the same crawling code to scrape multiple websites, we will add a callback parameter to handle the scraping. A callback is a function that will be called after certain events (in this case, after a web page has been downloaded). This scrape callback will take a url and html as parameters and optionally return a list of further URLs to crawl. Here is the implementation, which is simple in Python: def link_crawler(..., scrape_callback=None): … links = [] if scrape_callback: links.extend(scrape_callback(url, html) or []) … The new code for the scraping callback function are highlighted in the preceding snippet. Now, this crawler can be used to scrape multiple websites by customizing the function passed to scrape_callback. Here is a modified version of the lxml example scraper that can be used for the callback function: def scrape_callback(url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() for field in FIELDS] print url, row This callback function would scrape the country data and print it out. Usually, when scraping a website, we want to reuse the data, so we will extend this example to save results to a CSV spreadsheet, as follows: import csv class ScrapeCallback: def __init__(self): self.writer = csv.writer(open('countries.csv', 'w')) self.fields = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') self.writer.writerow(self.fields) def __call__(self, url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [] for field in self.fields: row.append(tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field)) [0].text_content()) self.writer.writerow(row) To build this callback, a class was used instead of a function so that the state of the csv writer could be maintained. This csv writer is instantiated in the constructor, and then written to multiple times in the __call__ method. Note that __call__ is a special method that is invoked when an object is "called" as a function, which is how the cache_callback is used in the link crawler. This means that scrape_callback(url, html) is equivalent to calling scrape_callback.__call__(url, html). For further details on Python's special class methods, refer to https://docs.python.org/2/reference/datamodel.html#special-method-names. This code shows how to pass this callback to the link crawler: link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=ScrapeCallback()) Now, when the crawler is run with this callback, it will save results to a CSV file that can be viewed in an application such as Excel or LibreOffice: Success! We have completed our first working scraper. Summary In this article, we walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, and we will use it in future examples. Resources for Article: Further resources on this subject: Scientific Computing APIs for Python [article] Bizarre Python [article] Optimization in Python [article]

0
0
6792

article-image-working-with-sparks-graph-processing-library-graphframes

Pravin Dhandre

11 Jan 2018

12 min read

Working with Spark’s graph processing library, GraphFrames

Pravin Dhandre

11 Jan 2018

12 min read

0
0
6789

Packt

16 Dec 2014

9 min read

Ridge Regression

Packt

16 Dec 2014

9 min read

In this article by Patrick R. Nicolas, the author of the book Scala for Machine Learning, we will cover the basics of ridge regression. The purpose of regression is to minimize a loss function, the residual sum of squares (RSS) being the one commonly used. The problem of overfitting can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization. (For more resources related to this topic, see here.) Ln roughness penalty Regularization consists of adding a penalty function J(w) to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (or weights) from reaching high values. A model that fits a training set very well tends to have many features variable with relatively large weights. This process is known as shrinkage. Practically, shrinkage consists of adding a function with model parameters as an argument to the loss function: The penalty function is completely independent from the training set {x,y}. The penalty term is usually expressed as a power to function of the norm of the model parameters (or weights) wd. For a model of D dimension the generic Lp-norm is defined as follows: Notation Regularization applies to parameters or weights associated to an observation. In order to be consistent with our notation w0 being the intercept value, the regularization applies to the parameters w1 …wd. The two most commonly used penalty functions for regularization are L1 and L2. Regularization in machine learning The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS. The L1 regularization applied to the linear regression is known as the Lasso regularization. The Ridge regression is a linear regression that uses the L2 regularization penalty. You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularizations differ in terms of computation efficiency, estimation, and features selection (refer to the 13.3 L1 regularization: basics section in the book Machine Learning: A Probabilistic Perspective, and the Feature selection, L1 vs. L2 regularization, and rotational invariance paper available at http://www.machinelearning.org/proceedings/icml2004/papers/354.pdf). The various differences between the two regularizations are as follows: Model estimation: L1 generates a sparser estimation of the regression parameters than L2. For large non-sparse dataset, L2 has a smaller estimation error than L1. Feature selection: L1 is more effective in reducing the regression weights for features with high value than L2. Therefore, L1 is a reliable features selection tool. Overfitting: Both L1 and L2 reduce the impact of overfitting. However, L1 has a significant advantage in overcoming overfitting (or excessive complexity of a model) for the same reason it is more appropriate for selecting features. Computation: L2 is conducive to a more efficient computation model. The summation of the loss function and L2 penalty w2 is a continuous and differentiable function for which the first and second derivative can be computed (convex minimization). The L1 term is the summation of |wi|, and therefore, not differentiable. Terminology The ridge regression is sometimes called the penalized least squares regression. The L2 regularization is also known as the weight decay. Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor. Ridge regression The ridge regression is a multivariate linear regression with a L2 norm penalty term, and can be calculated as follows: The computation of the ridge regression parameters requires the resolution of the system of linear equations similar to the linear regression. Matrix representation of ridge regression closed form is as follows: I is the identity matrix and it is using the QR decomposition, as shown here: Implementation The implementation of the ridge regression adds L2 regularization term to the multiple linear regression computation of the Apache Commons Math library. The methods of RidgeRegression have the same signature as its ordinary least squares counterpart. However, the class has to inherit the abstract base class AbstractMultipleLinearRegression in the Apache Commons Math and override the generation of the QR decomposition to include the penalty term, as shown in the following code: class RidgeRegression[T <% Double](val xt: XTSeries[Array[T]], val y: DblVector, val lambda: Double) { extends AbstractMultipleLinearRegression with PipeOperator[Array[T], Double] { private var qr: QRDecomposition = null private[this] val model: Option[RegressionModel] = … … } Besides the input time series xt and the labels y, the ridge regression requires the lambda factor of the L2 penalty term. The instantiation of the class train the model. The steps to create the ridge regression models are as follows: Extract the Q and R matrices for the input values, newXSampleData (line 1) Compute the weights using the calculateBeta defined in the base class (line 2) Return the tuple regression weights calculateBeta and the residuals calculateResiduals private val model: Option[(DblVector, Double)] = { this.newXSampleData(xt.toDblMatrix) //1 newYSampleData(y) val _rss = calculateResiduals.toArray.map(x => x*x).sum val wRss = (calculateBeta.toArray, _rss) //2 Some(RegressionModel(wRss._1, wRss._2)) } The QR decomposition in the AbstractMultipleLinearRegression base class does not include the penalty term (line 3); the identity matrix with lambda factor in the diagonal has to be added to the matrix to be decomposed (line 4). override protected def newXSampleData(x: DblMatrix): Unit = { super.newXSampleData(x) //3 val xtx: RealMatrix = getX val nFeatures = xt(0).size Range(0, nFeatures).foreach(i => xtx.setEntry(i,i,xtx.getEntry(i,i) + lambda)) //4 qr = new QRDecomposition(xtx) } The regression weights are computed by resolving the system of linear equations using substitution on the QR matrices. It overrides the calculateBeta function from the base class: override protected def calculateBeta: RealVector = qr.getSolver().solve(getY()) Test case The objective of the test case is to identify the impact of the L2 penalization on the RSS value, and then compare the predicted values with original values. Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as feature. The implementation of the extraction of observations is identical as with the least squares regression: val src = DataSource(path, true, true, 1) val price = src |> YahooFinancials.adjClose val volatility = src |> YahooFinancials.volatility val volume = src |> YahooFinancials.volume //1 val _price = price.get.toArray val deltaPrice = XTSeries[Double](_price .drop(1) .zip(_price.take(_price.size -1)) .map( z => z._1 - z._2)) //2 val data = volatility.get .zip(volume.get) .map(z => Array[Double](z._1, z._2)) //3 val features = XTSeries[DblVector](data.take(data.size-1)) val regression = new RidgeRegression[Double](features, deltaPrice, lambda) //4 regression.rss match { case Some(rss) => Display.show(rss, logger) …. The observed data, ETF daily price, and the features (volatility and volume) are extracted from the source src (line 1). The daily price change, deltaPrice, is computed using a combination of Scala take and drop methods (line 2). The features vector is created by zipping volatility and volume (line 3). The model is created by instantiating the RidgeRegression class (line 4). The RSS value, rss, is finally displayed (line 5). The RSS value, rss, is plotted for different values of lambda <= 1.0 in the following graph: Graph of RSS versus Lambda for Copper ETF The residual sum of squares decreased as λ increases. The curve seems to be reaching for a minimum around λ=1. The case of λ = 0 corresponds to the least squares regression. Next, let's plot the RSS value for λ varying between 1 and 100: Graph RSS versus large value Lambda for Copper ETF This time around RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings (refer to Lecture 5: Model selection and assessment, a lecture by H. Bravo and R. Irizarry from department of Computer Science, University of Maryland, in 2010, available at http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/selection.pdf). As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases. The regression weights can by simply outputted as follows: regression.weights.get Let's plot the predicted price variation of the Copper ETF using the ridge regression with different value of lambda (λ): Graph of ridge regression on Copper ETF price variation with variable Lambda The original price variation of the Copper ETF Δ = price(t+1)-price(t) is plotted as λ =0. The predicted values for λ = 0.8 is very similar to the original data. The predicted values for λ = 0.8 follows the pattern of the original data with reduction of large variations (peaks and troves). The predicted values for λ = 5 corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced. The reader is invited to apply the more elaborate K-fold validation routine and compute precision, recall, and F1 measure to confirm the findings. Summary The ridge regression is a powerful alternative to the more common least squares regression because it reduces the risk of overfitting. Contrary to the Naïve Bayes classifiers, it does not require conditional independence of the model features. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code [Article] Dependency Management in SBT [Article] Introduction to MapReduce [Article]

0
0
6786

article-image-getting-started-pentaho-data-integration

Packt

30 Oct 2013

16 min read

Getting Started with Pentaho Data Integration

Packt

30 Oct 2013

16 min read

(For more resources related to this topic, see here.) Pentaho Data Integration and Pentaho BI Suite Before introducing PDI, let’s talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are: Analysis: The analysis engine serves multidimensional analysis. It’s provided by the Mondrian OLAP server. Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on), from different kinds of sources. Data Mining: Data mining is used for running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to the Weka Project. Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). The Community Dashboard Framework (CDF), a plugin developed by the community and integrated in the Pentaho BI Suite, allows the creation of interesting dashboards including charts, reports, analysis views, and other Pentaho content, without much effort. Data Integration: Data integration is used to integrate scattered information from different sources (applications, databases, files, and so on), and make the integrated information available to the final user. All of this functionality can be used standalone but also integrated. In order to run analysis, reports, and so on, integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services, for example, authentication, scheduling, security, and web services. This set of software and services form a complete BI Platform, which makes Pentaho Suite the world’s leading open source Business Intelligence Suite. Exploring the Pentaho Demo The Pentaho BI Platform Demo is a pre-configured installation that allows you to explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles. The following screenshot is a sample dashboard available in the demo: The Pentaho BI Platform Demo is free and can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. You can find out more about Pentaho BI Suite Community Edition at http://community.pentaho.com/projects/bi_platform. There is also an Enterprise Edition of the platform with additional features and support. You can find more on this at www.pentaho.org. Pentaho Data Integration Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle. The name Kettle didn’t come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now. It came from KDE Extraction, Transportation, Transformation, and Loading Environment, since the tool was planned to be written on top of KDE, a Linux desktop environment, as mentioned in the introduction of the article. In April 2006, the Kettle project was acquired by the Pentaho Corporation and Matt Casters, the Kettle founder, also joined the Pentaho team as a Data Integration Architect. When Pentaho announced the acquisition, James Dixon, Chief Technology Officer said: We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community. By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project. From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the users improvements in performance, existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho: June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included among other changes, enhancements for large-scale environments and multilingual capabilities. February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support, enhanced database support, and a single designer for jobs and transformations, the two main kind of elements you design in Kettle. May 2007: PDI 2.5 is released including many new features; the most relevant being the advanced error handling. November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel had also changed completely. October 2008: PDI 3.1 arrives, bringing a tool which was easier to use, and with a lot of new functionality as well. April 2009: PDI 3.2 is released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements, and a huge amount of bug fixes. The main change in this version was the incorporation of dynamic clustering. June 2010: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements such as the mouseover assistance that you will experiment with soon. November 2010: PDI 4.1 is released with many bug fixes. August 2011: PDI 4.2 comes to light not only with a large amount of bug fixes, but also with a lot of improvements and new features. In particular, several of them were related to the work with repositories. April 2012: PDI 4.3 is released also with a lot of fixes, and a bunch of improvements and new features. November 2012: PDI 4.4 is released. This version incorporates a lot of enhancements and new features. In this version there is a special emphasis on Big Data—the ability of reading, searching, and in general transforming large and complex collections of datasets. 2013: PDI 5.0 will be released, delivering interesting low-level features such as step load balancing, job transactions, and restartability. Using PDI in real-world scenarios Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data. In fact, PDI not only serves as a data integrator or an ETL tool. PDI is such a powerful tool, that it is common to see it used for these and for many other purposes. Here you have some examples. Loading data warehouses or datamarts The loading of a data warehouse or a datamart involves many steps, and there are many variants depending on business area, or business rules. But in every case, no exception, the process involves the following steps: Extracting information from one or different databases, text files, XML files and other sources. The extract process may include the task of validating and discarding data that doesn’t match expected patterns or rules. Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks as converting data types, doing some calculations, filtering irrelevant data, and summarizing. Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed. Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with Kettle: Integrating data Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP (Enterprise Resource Planning) application and a CRM (Customer Relationship Management) application, though they’re not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data. Some conversions, validation, and transport of data have to be done. Kettle is meant to do all of those tasks. Data cleansing It’s important and even critical that data be correct and accurate for the efficiency of business, to generate trust conclusions in data mining or statistical studies, to succeed when integrating data. Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on. These are tasks that Kettle makes possible thanks to its vast set of transformation and validation capabilities. Migrating information Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch, nor type the information by hand. Kettle makes the migration possible thanks to its ability to interact with most kind of sources and destinations such as plain files, commercial and free databases, and spreadsheets, among others. Exporting data Data may need to be exported for numerous reasons: To create detailed business reports To allow communication between different departments within the same company To deliver data from your legacy systems to obey government regulations, and so on Kettle has the power to take raw data from the source and generate these kind of ad-hoc reports. Integrating PDI along with other Pentaho tools The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a dataflow. Some examples are pre-processing data for an online report, sending mails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on. The use of PDI integrated with other tools is beyond the scope of this article. If you are interested, you can find more information on this subject in the Pentaho Data Integration 4 Cookbook by Packt Publishing at http://www.packtpub.com/pentaho-data-integration-4-cookbook/book. Installing PDI In order to work with PDI, you need to install the software. It’s a simple task, so let’s do it now. Time for action – installing PDI These are the instructions to install PDI, for whatever operating system you may be using. The only prerequisite to install the tool is to have JRE 6.0 installed. If you don’t have it, please download it from www.javasoft.com and install it before proceeding. Once you have checked the prerequisite, follow these steps: Go to the download page at http://sourceforge.net/projects/pentaho/files/Data Integration. Choose the newest stable release. At this time, it is 4.4.0, as shown in the following screenshot: Download the file that matches your platform. The preceding screenshot should help you. Unzip the downloaded file in a folder of your choice, that is, c:/util/kettle or /home/pdi_user/kettle. If your system is Windows, you are done. Under Unix-like environments, you have to make the scripts executable. Assuming that you chose /home/pdi_user/kettle as the installation folder, execute: cd /home/pdi_user/kettle chmod +x *.sh In Mac OS you have to give execute permissions to the JavaApplicationStub file. Look for this file; it is located in Data Integration 32-bit.appContentsMacOS, or Data Integration 64-bit.appContentsMacOS depending on your system. What just happened? You have installed the tool in just a few minutes. Now, you have all you need to start working. Launching the PDI graphical designer – Spoon Now that you’ve installed PDI, you must be eager to do some stuff with data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let’s launch Spoon and see what it looks like. Time for action – starting and customizing Spoon In this section, you are going to launch the PDI graphical designer, and get familiarized with its main features. Start Spoon. If your system is Windows, run Spoon.bat You can just double-click on the Spoon.bat icon, or Spoon if your Windows system doesn’t show extensions for known file types. Alternatively, open a command window—by selecting Run in the Windows start menu, and executing cmd, and run Spoon.bat in the terminal. In other platforms such as Unix, Linux, and so on, open a terminal window and type spoon.sh If you didn’t make spoon.sh executable, you may type sh spoon.sh Alternatively, if you work on Mac OS, you can execute the JavaApplicationStub file, or click on the Data Integration 32-bit.app, or Data Integration 64-bit.app icon As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click on the Cancel button. A small window labeled Spoon tips... appears. You may want to navigate through various tips before starting. Eventually, close the window and proceed. Finally, the main window shows up. A Welcome! window appears with some useful links for you to see. Close the window. You can open it later from the main menu. Click on Options... from the menu Tools. A window appears where you can change various general and visual characteristics. Uncheck the highlighted checkboxes, as shown in the following screenshot: Select the tab window Look & Feel. Change the Grid size and Preferred Language settings as shown in the following screenshot: Click on the OK button. Restart Spoon in order to apply the changes. You should not see the repository dialog, or the Welcome! window. You should see the following screenshot full of French words instead: What just happened? You ran for the first time Spoon, the graphical designer of PDI. Then you applied some custom configuration. In the Option… tab, you chose not to show the repository dialog or the Welcome! window at startup. From the Look & Feel configuration window, you changed the size of the dotted grid that appears in the canvas area while you are working. You also changed the preferred language. These changes were applied as you restarted the tool, not before. The second time you launched the tool, the repository dialog didn’t show up. When the main window appeared, all of the visible texts were shown in French which was the selected language, and instead of the Welcome! window, there was a blank screen. You didn’t see the effect of the change in the Grid option. You will see it only after creating or opening a transformation or job, which will occur very soon! Spoon Spoon, the tool you’re exploring in this section, is the PDI’s desktop design tool. With Spoon, you design, preview, and test all your work, that is, Transformations and Jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. Setting preferences in the Options window In the earlier section, you changed some preferences in the Options window. There are several look and feel characteristics you can modify beyond those you changed. Feel free to experiment with these settings. Remember to restart Spoon in order to see the changes applied. In particular, please take note of the following suggestion about the configuration of the preferred language. If you choose a preferred language other than English, you should select a different language as an alternative. If you do so, every name or description not translated to your preferred language, will be shown in the alternative language. One of the settings that you changed was the appearance of the Welcome! window at startup. The Welcome! window has many useful links, which are all related with the tool: wiki pages, news, forum access, and more. It’s worth exploring them. You don’t have to change the settings again to see the Welcome! window. You can open it by navigating to Help | Welcome Screen. Storing transformations and jobs in a repository The first time you launched Spoon, you chose not to work with repositories. After that, you configured Spoon to stop asking you for the Repository option. You must be curious about what the repository is and why we decided not to use it. Let’s explain it. As we said, the results of working with PDI are transformations and jobs. In order to save the transformations and jobs, PDI offers two main methods: Database repository: When you use the database repository method, you save jobs and transformations in a relational database specially designed for this purpose. Files: The files method consists of saving jobs and transformations as regular XML files in the filesystem, with extension KJB and KTR respectively. It’s not allowed to mix the two methods in the same project. That is, it makes no sense to mix jobs and transformations in a database repository with jobs and transformations stored in files. Therefore, you must choose the method when you start the tool. By clicking on Cancel in the repository window, you are implicitly saying that you will work with the files method. Why did we choose not to work with repositories? Or, in other words, to work with the files method? Mainly for two reasons: Working with files is more natural and practical for most users. Working with a database repository requires minimal database knowledge, and that you have access to a database engine from your computer. Although it would be an advantage for you to have both preconditions, maybe you haven’t got both of them. There is a third method called File repository, that is a mix of the two above—it’s a repository of jobs and transformations stored in the filesystem. Between the File repository and the files method, the latest is the most broadly used. Therefore, throughout this article we will use the files method. Creating your first transformation Until now, you’ve seen the very basic elements of Spoon. You must be waiting to do some interesting task beyond looking around. It’s time to create your first transformation.

0
0
6774

Packt

19 Feb 2015

17 min read

Visualize This!

Packt

19 Feb 2015

17 min read

This article is written by Michael Phillips, the author of the book TIBCO Spotfire: A Comprehensive Primer, discusses that human beings are fundamentally visual in the way they process information. The invention of writing was as much about visually representing our thoughts to others as it was about record keeping and accountancy. In the modern world, we are bombarded with formalized visual representations of information, from the ubiquitous opinion poll pie chart to clever and sophisticated infographics. The website http://data-art.net/resources/history_of_vis.php provides an informative and entertaining quick history of data visualization. If you want a truly breathtaking demonstration of the power of data visualization, seek out Hans Rosling's The best stats you've ever seen at http://ted.com. (For more resources related to this topic, see here.) We will spend time getting to know some of Spotfire's data capabilities. It's important that you continue to think about data; how it's structured, how it's related, and where it comes from. Building good visualizations requires visual imagination, but it also requires data literacy. This article is all about getting you to think about the visualization of information and empowering you to use Spotfire to do so. Apart from learning the basic features and properties of the various Spotfire visualization types, there is much more to learn about the seamless interactivity that Spotfire allows you to build in to your analyses. We will be taking a close look at 7 of the 16 visualization types provided by Spotfire, but these 7 visualization types are the most commonly used. We will cover the following topics: Displaying information quickly in tabular form Enriching your visualizations with color categorization Visualizing categorical information using bar charts Dividing a visualization across a trellis grid Key Spotfire concept—marking Visualizing trends using line charts Visualizing proportions using pie charts Visualizing relationships using scatter plots Visualizing hierarchical relationships using treemaps Key Spotfire concept—filters Enhancing tabular presentations using graphical tables Now let's have some fun! Displaying information quickly in tabular form While working through the data examples, we used the Spotfire Table visualization, but now we're going to take a closer look. People will nearly always want to see the "underlying data", the details behind any visualization you create. The Table visualization meets this need. It's very important not to confuse a table in the general data sense with the Spotfire Table visualization; the underlying data table remains immutable and complete in the background. The Table visualization is a highly manipulatable view of the underlying data table and should be treated as a visualization, not a data table. The data used here is BaseballPlayerData.xls There is always more than one way to do the same thing in Spotfire, and this is particularly true for the manipulation of visualizations. Let's start with some very quick manipulations: First, insert a table visualization by going to the Insert menu, selecting New Visualization, and then Table. To move a column, left-click on the column name, hold, and drag it. To sort by a column, left-click on the column name. To sort by more than one column, left-click on the first column name and then press Shift + left-click on the subsequent columns in order of sort precedence. To widen or narrow a column, hover the mouse over the right-hand edge of the column title until you see the cursor change to a two-way arrow, and then click and drag it. These and other properties of the Table visualization are also accessed via visualization properties. As you work through the various Spotfire visualizations, you'll notice that some types have more options than others, but there are common trends and an overall consistency in conventions. Visualization properties can be opened in a number of ways: By right-clicking on the visualization, a table in this case, and selecting Properties. By going to the Edit menu and selecting Visualization Properties. By clicking on the Visualization Properties icon, as shown in the following screenshot, in the icon tray below the main menu bar. It's beyond the scope of this book to explore every property and option. The context-sensitive help provided by Spotfire is excellent and explains all the options in glorious detail. I'd like to highlight four important properties of the Table visualization: The General property allows you to change the table visualization title, not the name of the underlying data table. It also allows you to hide the title altogether. The Data property allows you to switch the underlying data table, if you have more than one table loaded into your analysis. The Columns property allows you to hide columns and order the columns you do want to show. The Show/Hide Items property allows you to limit what is shown by a rule you define, such as top five hitters. After clicking on the Add button, you select the relevant column from a dropdown list, choose Rule type (Top), and finally, choose Value for the rule (5). The resulting visualization will only show the rows of data that meet the rule you defined. Enriching your visualizations with color categorization Color is a strong feature in Spotfire and an important visualization tool, often underestimated by report creators. It can be seen as merely a nice-to-have customization, but paying attention to color can be the difference between creating a stimulating and intuitive data visualization rather than an uninspiring and even confusing corporate report. Take some pride and care in the visual aesthetics of your analytics creations! Let's take a look at the color properties of the Table visualization. Open the Table visualization properties again, select Colors, and then Add the column Runs. Now, you can play with a color gradient, adding points by clicking on the Add Point button and customizing the colors. It's as easy as left-clicking on any color box and then selecting from a prebuilt palette or going into a full RGB selection dialog by choosing More Colors…. The result is a heatmap type effect for runs scored, with yellow representing low run totals, transitioning to green as the run total approaches the average value in the data, and becoming blue for the highest run totals. Visualizing categorical information using bar charts We saw how the Table visualization is perfect for showing and ordering detailed information. It's quite similar to a spreadsheet. The Bar Chart visualization is very good for visualizing categorical information, that is, where you have categories with supporting hard numbers—sales by region, for example. The region is the category, whereas the sales is the hard number or fact. Bar charts are typically used to show a distribution. Depending on your data or your analytic requirement, the bars can be ordered by value, placed side by side, stacked on top of each other, or arranged vertically or horizontally. There is a special case of the category and value combination and that is where you want to plot the frequencies of a set of numerical values. This type of bar chart is referred to as a histogram, and although it is number against number, it is still, in essence, a distribution plot. It is very common in fact to transform the continuous number range in such cases into a set of discrete bins or categories for the plot. For example, you could take some demographic data and plot age as the category and the number of people at that age as the value (the frequency) on a bar chart. The result, for a general population, would approach a bell-shaped curve. Let's create a bar chart using the baseball data. The data we will use is BaseballPlayerData.xls, which you can download from http://www.insidespotfire.com. Create a new page by right-clicking on any page tab and selecting New Page. You can also select New Page from the Insert menu or click on the new page icon in the icon bar below the main menu. Create a Bar Chart visualization by left-clicking on the bar chart icon or by selecting New Visualization and then Bar Chart from the Insert menu. Spotfire will automatically create a default chart, that is, rarely exactly what you want, so the next step is to configure the chart. Two distributions might be interesting to look at: the distribution of home runs across all the teams and the distribution of player salaries across all the teams. The axes are easy to change; simply use the axes selectors. If the bars are vertical, it means that the category—Team, in our case—should be on the horizontal axis, with the value—Home Runs or Salary—on the vertical axis, representing the height of the bars. We're going to pick Home Runs from the vertical axis selector and then an appropriate aggregation dropdown, which is highlighted in red in the screenshot. Sum would be a valid option, but let's go with Avg (Average). Similarly, select Team from the horizontal axis dropdown selector. The vertical, or value, axis must be an aggregation because there is more than one home run value for each category. You must decide if you want a sum, an average, a minimum, and so on. You can modify the visualization properties just as you did for the Table visualization. Some of the options are the same; some are specific to the bar chart. We're going to select the Sort bars by value option in the Appearance property. This will order the bars in descending order of value. We're also going to check the option Vertically under Scale labels | Show labels for the Category Axis property. There are two more actions to perform: create an identical bar chart except with average salary as the value axis, and give each bar chart an appropriate title (Visualization Properties|General|Title:). To copy an existing visualization, simply right-click on it and select Duplicate Visualization. We can now compare the distribution of home run average and salary average across all the baseball teams, but there's a better way to do this in a single visualization using color. Close the salary distribution bar chart by left-clicking on X in the upper right-hand corner of the visualization (X appears when you hover the mouse) or right-clicking on the visualization and selecting Close. Now, open the home run bar chart visualization properties, go to the Colors property, and color by Avg(Salary). Select a Gradient color mode, and add a median point by clicking on the Add Point button and selecting Median from the dropdown list of options on the added point. Finally, choose a suitable heat map range of colors; something like blue (min) through pale yellow (median) through red (max). You will still see the distribution of home runs across the baseball teams, but now you will have a superimposed salary heat map. Texas and Cleveland appear to be getting much more bang for their buck than the NY Yankees. Dividing a visualization across a trellis grid Trellising, whereby you divide a series of visualizations into individual panels, is a useful technique when you want to subdivide your analysis. In the example we've been working with, we might, for instance, want to split the visualization by league. Open the visualization properties for the home runs distribution bar chart colored by salary and select the Trellis property. Go to Panels and split by League (use the dropdown column selector). Spotfire allows you to build layers of information with even basic visualizations such as the bar chart. In one chart, we see the home run distribution by team, salary distribution by team, and breakdown by league. Key Spotfire concept – marking It's time to introduce one of the most important Spotfire concepts, called marking, which is central to the interactivity that makes Spotfire such a powerful analysis tool. Marking refers to the action of selecting data in a visualization. Every element you see is selectable, or markable, that is, a single row or multiple rows in a table, a single bar or multiple bars in a bar chart. You need to understand two aspects to marking. First, there is the visual effect, or color(s) you see, when you mark (select) visualization elements. Second, there is the behavior that follows marking: what happens to data and the display of data when you mark something. How to change the marking color From Spotfire v5.5 onward, you can choose, on a visualization-by-visualization basis, two distinct visual effects for marking: Use a separate color for marked items: all marked items are uniformly colored with the marking color, and all unmarked items retain their existing color. Keep existing color attributes and fade out unmarked items: all marked items keep their existing color, and all unmarked items also keep their existing color but with a high degree of color fade applied, leaving the marked items strongly highlighted. The second option is not available in versions older than v5.5 but is the default option in Versions 5.5 onward. The setting is made in the visualization's Appearance property by checking or unchecking the option Use separate color for marked items. The default color when using a separate color for marked items is dark green, but this can be changed by going to Edit|Document Properties|Markings|Edit. The new option has the advantage of retaining any underlying coloring you defined, but you might not like how the rest of the chart is washed out. Which approach you choose depends on what information you think is critical for your particular situation. When you create a new analysis, a default marking is created and applied to every visualization you create by default. You can change the color of the marking in Document Properties, which is found in the Edit menu. Just open Document Properties, click on the Markings tab, select the marking, click on the Edit button, and change the color. You can also create as many markings as you need, giving them convenient names for reference purposes, but we'll just focus on using one for now. How to set the marking behavior of a visualization Marking behavior depends fundamentally on data relationships. The data within a single data table is intrinsically related; the data in separate data tables must be explicitly related before you configure marking behavior for visualizations based on separate datasets. When you mark something in a visualization, five things can happen depending on the data involved and how you configured your visualizations: Conditions Behavior Two visualizations with the same underlying data table (they can be on different pages in the analysis file) and the same marking scheme applied. Marking data on one visualization will automatically mark the same data on the other. Two visualizations with related underlying data tables and the same marking scheme applied. The same as the previous condition's behavior, but subject to differences in data granularity. For example, marking a baseball team in one visualization will mark all the team's players in another visualization that is based on a more detailed table related by team. Two visualizations with the same or related data tables where one has been configured with data dependency on the marking in the other. Nothing will display in the marking-dependent visualization other than what is marked in the reference visualization. Visualizations with unrelated underlying data tables. No marking interaction will occur, and the visualizations will mark completely independently of one another. Two visualizations with the same underlying data table or related data tables and with different marking schemes applied. Marking data on one visualization will not show on the other because the marking schemes are different. Here's how we set these behaviors: Open the visualization properties of the bar chart we have been working with and navigate to the Data property. You'll notice that two settings refer to marking: Marking and Limit data using markings. Use the dropdown under Marking to select the marking to be used for the visualization. Having no marking is an option. Visualizations with the same marking will display synchronous selection, subject to the data relation conditions described earlier. The options under Limit data using markings determine how the visualization will be limited to marking elsewhere in the analysis. The default here is no dependency. If you select a marking, then the visualization will only display data selected elsewhere with that marking. It's not good to have the same marking for Marking and Limit data using markings. If you are using the limit data setting, select no marking, or create a second marking and select it under Marking. You're possibly a bit confused by now. Fortunately, marking is much harder to describe than to use! Let's build a tangible example. We'll start a new analysis, so close any analysis you have open and create a new one, loading the player-level baseball data (BaseballPlayerData.xls). Add two bar charts and a table. You can rearrange the layout by left-clicking on the title bar of a visualization, holding, and dragging it. Position the visualizations any way you wish, but you can place the two bar charts side by side, with the table below them spanning both. Save your analysis file at this point and at regular intervals. It's good behavior to save regularly as you build an analysis. It will save you a lot of grief if your PC fails in any way. There is no autosave function in Spotfire. For the first bar chart, set the following visualization properties: Property Value General | Title Home Runs Data | Marking Marking Data | Limit data using markings Nothing checked Appearance | Orientation Vertical bars Appearance | Sort bars by value Check Category Axis | Columns Team Value Axis | Columns Avg(Home Runs) Colors | Columns Avg(Salary) Colors | Color mode Gradient Add Point for median Max = strong red; Median = pale yellow; Min = strong blue Labels | Show labels for Marked Rows Labels | Types of labels | Complete bar Check For the second bar chart, set the following visualization properties: Property Value General | Title Roster Data | Marking Marking Data | Limit data using markings Nothing checked Appearance | Orientation Horizontal bars Appearance | Sort bars by value Check Category Axis | Columns Team Value Axis | Columns Count(Player Name) Colors | Columns Position Colors | Color mode Categorical For the table, set the following visualization properties: Property Value General | Title Details Data | Marking (None) Data | Limit data using markings Check Marking Columns Team, Player Name, Games Played, Home Runs, Salary, Position Now start selecting visualization elements with your mouse. You can click on elements such as bars or segments of bars, or you can click and drag a rectangular block around multiple elements. When you select a bar on the Home Runs bar chart, the corresponding team bar automatically selects the Roster bar chart, and details for all the players in that team display in the Details table. When you select a bar segment on the Roster bar chart, the corresponding team bar automatically selects on the Home Runs bar chart and only players in the selected position for the team selected appear in the details. There are some very useful additional functions associated with marking, and you can access these by right-clicking on a marked item. They are Unmark, Invert, Delete, Filter To, and Filer Out. You can also unmark by left-clicking on any blank space in the visualization. Play with this analysis file until you are comfortable with the marking concept and functionality. Summary This article is a small taste of the book TIBCO Spotfire: A comprehensive primer. You've seen how the Table visualization is an easy and traditional way to display detailed information in tabular form and how the Bar Chart visualization is excellent for visualizing categorical information, such as distributions. You've learned how to enrich visualizations with color categorization and how to divide a visualization across a trellis grid. You've also been introduced to the key Spotfire concept of marking. Apart from gaining a functional understanding of these Spotfire concepts and techniques, you should have gained some insight into the science and art of data visualization. Resources for Article: Further resources on this subject: The Spotfire Architecture Overview [article] Interacting with Data for Dashboards [article] Setting Up and Managing E-mails and Batch Processing [article]

0
0
6769

Packt

30 Jul 2013

6 min read

First steps with R

Packt

30 Jul 2013

6 min read

(For more resources related to this topic, see here.) Obtaining and installing R The way to obtain R is downloading it from the CRAN website (http://www.r-project.org/). The Comprehensive R Archive Network (CRAN) is a network of FTP and web servers around the world that stores identical, up-to-date versions of code and documentation for R. The CRAN is directly accessible from the R website and on such website it is also possible to find information about R, some technical manuals, the R journal, and details about the packages developed for R and stored on the CRAN repositories. The functionalities of the R environment can then also be expanded thanks to software libraries which can be installed and recalled if needed. These libraries or packages are a collection of source code and other additional files that, when installed in R, allow the user to load them in the workspace via a call to the library() function. An example of code to load the package lattice may be found as follows: > library(lattice) An R installation contains one or more libraries of packages. Some of these packages are part of the basic installation and are loaded automatically as soon as the session is started. Other can be installed from the CRAN, the official R repository, or downloaded and installed manually. Interacting with the console As soon as you will start R, you will see that a workspace is open; you can see a screenshot of the R Console window in the image below. The workspace is the environment in which you are working, where you will load your data, and create your variables. The screen prompt > is the R prompt that waits for commands. On the starting screen, you can either type any function, command, or you can use R to perform basic calculation. R uses the usual symbols for addition (+), subtraction (-), multiplication (*), division (/), and exponentiation (^). Parentheses ( ) can be used to specify the order of operations. R also provides %% for taking the modulus and %/% for integer division. Comments in R are defined by the character #, so everything after such character up to the end of the line will be ignored by R. R has a number of built-in functions, for example, sin(x), cos(x), tan(x), (all in radians), exp(x), log(x), and sqrt(x). Some special constants such as pi are also pre-defined. You can see an example of the use of such function in the following code: > exp(2.5)[1] 12.18249 Understanding R objects In every computer language, variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures called objects. These objects are referred to through symbols or variables. Vectors The basic object in R is the vector; even scalars are vectors of length one. Vectors can be thought of as a series of data of the same class. There are six basic vector type (called atomic vectors): logical, integer, real, complex, string (or character), and raw. Integer and real represent numeric objects; logicals are Boolean data type with possible value TRUE or FALSE. Among such atomic vectors, the more common ones are logical, string, and numeric (integer and real). There are several ways to create vectors. For instance the operator : (colon) is a sequence-generating operator, it creates sequences by incrementing or decrementing by one. > 1:10 [1] 1 2 3 4 5 6 7 8 9 10> 5:-6 [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 If the interval between the numbers is not one, you can use the seq() function. Here an example > seq(from=2, to=2.5, by=0.1)[1] 2.0 2.1 2.2 2.3 2.4 2.5 One of the more important features of R is the possibility to use entire vector as arguments of functions, thus avoiding the use of cyclic loops. Most of the functions in R allow the use of vector as argument, as example the use of some of these functions is reported as follows > x <- c(12,10,4,6,9)> max(x)[1] 12> min(x)[1] 4> mean(x)[1] 8.2 Matrices and arrays In R, the matrix notation is extended to elements of any kind, so in example it is possible to have a matrix of character strings. Matrices and arrays are basically vectors with a dimension attribute. The function matrix() may be used to create matrices. By default, such function creates the matrix by column; as alternative it is possible to specify to the function to build the matrix by row: > matrix(1:9,nrow=3,byrow=TRUE) [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9 Lists A list in R is a collection of different objects. One of the main advantages of lists is that the object contained within a list may be of different type, for example, numeric and character values. In order to define a list, you simply will need to provide the object that you want to include as argument of the function list(). Data frame A data frame corresponds to a data set; it is basically a special list in which the elements have the same length. Elements may be different type in different columns, but within the same column all the elements are of the same type. You can easily create data frames using the function data.frame(), and a specific column can be recall using the operator $. Top features you’ll want to know about In addition to the basic object creation and manipulation, many more complex tasks can be performed with R, spanning from data manipulation, programming, statistical analysis and the realization of very high quality graphs. Some of the most useful features are Data input and output Flow control (for, if…else, while) Create your own functions Debugging functions and handling exceptions Plotting data Summary In this article we saw what is R, how to obtain and install R, and how to interacting with the console. We also saw at few R objects and also looked at the top features you would want to know about Resources for Article: Further resources on this subject: Organizing, Clarifying and Communicating the R Data Analyses [Article] Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article] Graphical Capabilities of R [Article]

0
0
6724

article-image-setting-most-popular-journal-articles-your-personalized-community-liferay-portal

Packt

21 Oct 2009

6 min read

Setting up the most Popular Journal Articles in your Personalized Community in Liferay Portal

Packt

21 Oct 2009

6 min read

Personal community is a dynamic feature of Liferay portal. By default, the personal community is a portal-wide setting that will affect all of the users. It would be nice to have more features in the personal community such as showing the most popular journal articles. This article by Jonas Yuan will address how to set up the most popular journal articles in you personalized community and view the counter for other assets. In a web site, we will have a lot of journal articles (that is, web content) for a given article type. For example, for the article type Article Content, we will have articles talking about product family. We may want to know how many times the end users read each article. Meanwhile, it would be nice if we could show the most popular articles (for example, TOP 10 articles) for this given article type. As shown in the following screenshot, a journal article My EDI Product I is shown via a portlet Ext Web Content Display. Rating and comments on this article are also exhibited. At the same time, the medium-size image, polls, and related content of this article are listed, too. A view counter of this article is especially displayed under the ratings. Moreover, the most popular articles are exhibited with article title and number of views under related content. All these articles belong to the article type article-content. That is, the article in the current portlet Ext Web Content Display has the most popular articles only for the article type article-content. Of course, you can customize the portlet Web Content Display directly through changing JSP files. For demo purposes, we will implement the view counter in the portlet Ext Web Content Display. Meanwhile, we will implement the mostly popular articles via VM services and article templates. In addition, we will analyze the view counter for other assets such as Image Gallery images, Document Library documents, Wiki articles, Blog entries, Message Boards threads, and so on. Adding a view counter in the Web Content Display portlet First of all, let's add a view counter in the Ext Web Content Display portlet. As the function of view counter for assets (including journal articles) is provided in the model TagsAssetModel of the com.liferay.portlet.tags.model package in the /portal/portal-service/src folder, we could use this feature in this portlet directly. To do so, use the following steps: Create a folder journal_content in the folder /ext/ext-web/docroot/html/portlet/. Copy the JSP file view.jsp in the folder /portal/portal-web/docroot/html/portlet/ to the folder /ext/ext-web/docroot/html/portlet/journal_content and open it. Add the line <%@ page import="com.liferay.portlet.tags.model.TagsAsset" %> after the line <%@ include file="/html/portlet/journal_content/init.jsp" %>, and check the following lines: JournalArticleDisplay articleDisplay = (JournalArticleDisplay) request.getAttribute( WebKeys.JOURNAL_ARTICLE_DISPLAY); if (articleDisplay != null) { TagsAssetLocalServiceUtil.incrementViewCounter( JournalArticle.class.getName(), articleDisplay.getResourcePrimKey());} Then add the following lines after the line <c:if test="<%=enableComments %>"> and save it: <span class="view-count"> <% TagsAsset asset = TagsAssetLocalServiceUtil.getAsset (JournalArticle.class.getName(), articleDisplay.getResourcePrimKey());%> <c:choose> <c:when test="<%= asset.getViewCount() == 1 %>"> <%= asset.getViewCount() %> <liferay-ui:message key="view" />, </c:when> <c:when test="<%= asset.getViewCount() > 1 %>"> <%= asset.getViewCount() %> <liferay-ui:message key="views" />, </c:when> </c:choose></span> The code above shows a way to increase the view counter via the TagsAssetLocalServiceUtil.incrementViewCounter method. This method takes two parameters className and classPK as inputs. For the current journal article, the two parameters are JournalArticle.class.getName() and articleDisplay.getResourcePrimKey(). Then, this code shows a way to display view counted through the TagsAssetLocalServiceUtil.getAsset method. Similarly, this method also takes two parameters, className and classPK, as inputs. This approach would be useful for other assets, as the className parameter could be Image Gallery, Document Library, Wiki, Blogs, Message Boards, Bookmark, and so on. Setting up VM service We can set up the VM service to exhibit the most popular articles. We can also add the getMostPopularArticles method in the custom velocity tool ExtVelocityToolUtil. To do so, first add the following method in the ExtVelocityToolService interface: public List<TagsAsset> getMostPopularArticles(String companyId, String groupId, String type, int limit); And then add an implementation of the getMostPopularArticles method in the ExtVelocityToolServiceImpl class as follows: public List<TagsAsset> getMostPopularArticles(String companyId, String groupId, String type, int limit) { List<TagsAsset> results = Collections.synchronizedList(new ArrayList<TagsAsset>()); DynamicQuery dq0 = DynamicQueryFactoryUtil.forClass( JournalArticle.class, "journalarticle"). setProjection(ProjectionFactoryUtil.property ("resourcePrimKey")).add(PropertyFactoryUtil. forName("journalarticle.companyId"). eqProperty("tagsasset.companyId")). add(PropertyFactoryUtil.forName( "journalarticle.groupId").eqProperty( "tagsasset.groupId")).add(PropertyFactoryUtil. forName("journalarticle.type").eq( "article-content")); DynamicQuery query = DynamicQueryFactoryUtil.forClass( TagsAsset.class, "tagsasset") .add(PropertyFactoryUtil.forName( "tagsasset.classPK").in(dq0)) .addOrder(OrderFactoryUtil.desc( "tagsasset.viewCount")); try{ List<Object> assets = TagsAssetLocalServiceUtil. dynamicQuery(query); int index = 0; for (Object obj: assets) { TagsAsset asset = (TagsAsset) obj; results.add(asset); index ++; if(index == limit) break; } } catch (Exception e){ return results; } return results; } The preceding code shows a way to get the most popular articles by company ID, group ID, article type, and limited articles to be returned. DynamicQuery API allows us to leverage the existing mapping definitions through access to the Hibernate session. For example, DynamicQuery dq0 selects the journal articles by companyID, groupId, and type; DynamicQuery query selects tagsassets by classPK, which exists in DynamicQuery dq0; and tagsassets are ordered by viewCount as well. Finally, add the following method to register the above method in ExtVelocityToolUtil: public List<TagsAsset> getRelatedArticles(String companyId, String groupId, String articleId, int limit){ return _extVelocityToolService.getRelatedArticles(companyId, groupId, articleId, limit);} The code above shows a generic approach to get TOP 10 articles for any article types. Of course, you can extend this approach to find TOP 10 assets. This can include Image Gallery images, Document Library documents, Wiki articles, Blog entries, Message Boards threads, Bookmark entries, slideshow, videos, games, video queue, video list, playlist, and so on. You may practice these TOP 10 assets feature. Building article template for the most popular journal articles We have added view counter on journal articles. We have already built VM service for the most popular articles too. Now let's build an article template for them. Setting up the default article type As mentioned earlier, there is a set of types of journal articles, for example, announcements, blogs, general, news, press-release, updates, article-tout, article-content, and so on. In real case, only some of these types will require view counter, for example article-content. Let's configure the default article type for mostly popular articles. We can add the following line at the end of portal-ext.properties. ext.most_popular_articles.article_type=article-content The code above shows that the default article type for most_popular_articles is article-content.

0
0
6716

article-image-configuring-and-formatting-ireport-elements

Packt

29 Mar 2010

7 min read

Configuring and Formatting iReport Elements

Packt

29 Mar 2010

7 min read

A complete report is structured by composing a set of sections called bands. Each band has its own configurable height, a particular position in the structure, and is used for a particular objective. The available bands are: Title, Page Header, Column Header, Detail 1, Column Footer, Page Footer, Last Footer, and Summary. A report structured with bands is shown in the following screenshot: Besides the mentioned bands, there are two special bands which are Background and No Data. Band Description Title Is the first band of the report and is printed only once. Title can be shown on a new page. You can configure this from the report properties discussed in the previous section of this chapter. Just to review-go to report Properties | More... and check the Title on a new page checkbox. Page Header Is printed on each page of the report and is used for setting up the page header. Column Header Is printed on each page, if there is a detail band on that page. This band is used for the column heading. Detail This band is repeatedly printed for each row in the data source. In the List of Products report, it is printed for each product record. Column Footer Is printed on each page if there is a detail band on that page. This band is used for the column heading. If the Floating column footer in report Properties is checked, then the column footer will be shown just below the last data of the column, otherwise it will be shown at the bottom of the page (above the page footer). Page Footer Is printed on each page except the last page, if Last Page Footer is set. If Last Page Footer is not set, then it is printed on the last page also. This band is a good place to insert page numbers. Last Page Footer Is printed only on the last page as a page footer. Summary Is printed only once at the end of the report. It can be printed on a separate page if it is configured from the report Properties. In the following chapters, we will produce some reports where you will learn about the suitability of this band. Background Is used for setting a page background. For example, we may want a watermark image for the report pages. No Data When no data is available for the reports, this band is printed if it is set as the When no data option in the report Properties. Showing/hiding bands and inserting elements Now, we are going to configure the report bands (setting height, visibility, and so on) and format the report elements. Select Column Footer from the Report Inspector. You will see the Column Footer - Properties on the right of the designer. Type 25 in the Band height field. Press Enter. Now you can see the Column Footer band in your report, which was invisible before you set the band height. A band becomes invisible in the report if its height is set to zero. We have already learned how to change the height of a band. We can also make a band invisible using the Print When Expression option. If we write new Boolean(false) in Print When Expression of a band, then that will make the band invisible, even though its height is set to greater than zero. If we write new Boolean(true), then the band will be visible. It is true by default. Drag a Static Text element from the Palette window and drop it on the Column Footer band. Double-click on Static Text and type End of Record, replacing the text Static Text. Select the static text element (End of Record). Go to Format | Position and then choose Center. Now the element has been positioned in the center of the Column Footer band. In the same way, insert two Line elements. Place one element at the left and another at the right of the static text. Select both the lines. Go to Format | Position, and then choose Center Vertically . The lines are now positioned in the center of the Column Footer vertically. Select both the lines and go to Format | Size and then choose Same Width. Now both the lines are equal in width. Select the static text element (End of Record) and the left line. Now go to Format | Position and choose Join Sides Right. This moves the line to the right, and it is now connected to the static text element. Repeat the previous step for the right line and finally choose Join Sides Left. Now the line has moved to the left and is connected with the static text element. In the same way, change the column headers as you want by double-clicking the labels on the Column Header band. Now, the columns may be Product Code, Name, and Description. Now your report design should look like the following screenshot: Preview the report, and you will see the lines and static text (End of Record) at the bottom of the column. By default, the Column Footer is placed at the bottom of the page. To show the Column Footer just below the table of data, the Float column footer option must be enabled from the report Properties window. Sizing elements We can increase or decrease the size of an element by dragging the mouse accordingly. Sometimes, we need to set the size of an element automatically based on other elements' sizes. There are various options for setting the automatic size of an element. These options are available in the format menu (Format | Size). Size Options Description Same Width This makes the selected elements of the same width. The width of the element that you select first is used as the new width of the selected elements. Same Width (max) The width of the largest of the selected elements is set as the width of all the selected elements. Same Width (min) The width of the smallest of the selected elements is set as the width of all the selected elements. Same Height This makes the selected elements of the same height. The height of the element that you select first is used as the new height of the selected elements. Same Height (max) The height of the largest of the selected elements is set as the height of all the selected elements. Same Height (min) The height of the smallest of the selected elements is set as the height of all the selected elements. Same Size Both the width and the height of the selected elements become the same. Position Description Center Horizontally (band/cell based) The selected element is placed in the center of the band horizontally. Center Vertically (band/cell based) The selected element is placed in the center of the band vertically. Center (in band/cell) The selected element is placed in the center of the band both horizontally and vertically. Center (in background) If the Background band is visible and if the element is on the Background band, then it will be placed in the center both horizontally and vertically. Join Left Joins two elements. For joining, one element will be moved to the left. Join Right Joins two elements. For joining, one element will be moved to the right. Align to Left Margin The selected element will be joined with the left margin of the report. Align to Right Margin The selected element will be joined with the right margin of the report.

0
0
6711

article-image-use-stylesheets-report-designing-using-birt

Packt

17 Jul 2010

3 min read

Use of Stylesheets for Report Designing using BIRT

Packt

17 Jul 2010

3 min read

Stylesheets BIRT, being a web-based reporting environment, takes a page from general web development toolkits by importing stylesheets. However, BIRT stylesheets function slightly differently to regular stylesheets in a web development environment. We are going to add on to the Customer Orders report we have been working with, and will create some styles that will be used in this report. Open Customer Order.rptDesign. Right-click on the getCustomerInformation dataset and choose Insert into Layout. Modify the table visually to look like the next figure. Create a new dataset called getCustomerOrders using the following query: //insert code 1 Link the dataset parameter to rprmCustomerID. Save the dataset, right-click on it, and select Insert to layout. Select the first ORDERNUMBER column. Under the Property Editor, Select Advanced. In the Property Editor, go to the Suppress duplicates option, and change it to true. This will prevent the OrderNumber data item from repeating the value it displays down the page. In the Outline, right-click on Styles and choose New Style…. In the Pre-Defined Style drop down, choose table-header. A predefined style is an element that is already defined in the BIRT report. When selecting a predefined style, this will affect every element of that type within a report. In this case, for every table in the report, the table header will have this style applied. Under the Font section, apply the following settings: Font: Sans-Serif Font Color: White Size: Large Weight: Bold Under the Background section, set the Background Color to >b>Black. Click OK. Now, when we run the report, we can see that the header line is formatted with a black background and white font. Custom stylesheets In the example we just saw, we didn't have to apply this style to any element, it was automatically applied to the header of the order details table as it was using a predefined style. This would be the case for any table that had the header row populated with something and the same is the case for any of the predefined styles in BIRT. So next, let's look at a custom defined style and apply it to our customer information table. Right-click on the Styles section under the Outline tab and create a new style. Under the Custom Style textbox, enter CustomerHeaderInfo. Under the Font section, enter the following information: Font: Sans Serif Color: White Size: Large Weight: Bold Under the Background section, set the Background Color to Gray. Under the Box section, enter 1 points for all sections. Under the Border section, enter the following information: Style (All): Solid Color (All): White Width (All): Thin Click OK and then click Save. Select the table which contains the customer information. Select the first column. Under the Property Editor, in the list box for the Styles, select CustomerHeaderInfo. The preview report will look like the following screenshot: Right-click on the Styles section, and create a new custom style called CustomerHeaderData. Under Box, put in 1 points for all fields. Under Border, enter the following information: Style – Top: Solid Style – Bottom: Solid Color (All): Gray Click OK. Select the Customer Information table. Select the second column. Right-click on the column selector and select Style | Apply Style | CustomHeaderData. The finished report should look something like the next screenshot:

0
0
6694

article-image-integrating-kettle-and-pentaho-suite

Packt

14 Jul 2011

13 min read

Integrating Kettle and the Pentaho Suite

Packt

14 Jul 2011

13 min read

0
0
6683

article-image-the-u-s-dod-wants-to-dominate-russia-and-china-in-artificial-intelligence-last-week-gave-us-a-glimpse-into-that-vision

Savia Lobo

18 Mar 2019

9 min read

The U.S. DoD wants to dominate Russia and China in Artificial Intelligence. Last week gave us a glimpse into that vision.

Savia Lobo

18 Mar 2019

9 min read

In a hearing on March 12, the sub-committee on emerging threats and capabilities received testimonies on Artificial Intelligence Initiatives within the Department of Defense(DoD). The panel included Peter Highnam, Deputy Director of the Defense Advanced Research Projects Agency; Michael Brown, DoD Defense Innovation Unit Director; and Lieutenant General John Shanahan, director of the Joint Artificial Intelligence Center (JAIC). The panel broadly testified to senators that AI will significantly transform DoD’s capabilities and that it is critical the U.S. remain competitive with China and Russia in developing AI applications. Dr. Peter T. Highnam on DARPA’s achievements and future goals Dr. Peter T. Highnam, Deputy Director, Defense Advanced Research Projects Agency talked about DARPA’s significant role in the development of AI technologies that have produced game-changing capabilities for the Department of Defense and beyond. In his testimony, he mentions, “DARPA’s AI Next effort is simply a continuing part of its 166 historic investment in the exploration and advancement of AI technologies.” Dr. Highnam highlighted different waves of AI technologies. The first wave, which was nearly 70 years ago, emphasized handcrafted knowledge, and computer scientists constructed so-called expert systems that captured the rules that the system could then apply to situations of interest. However, handcrafting rules was costly and time-consuming. The second wave that brought in machine learning that applies statistical and probabilistic methods to large data sets to create generalized representations that can be applied to future samples. However, this required training deep learning (artificial) neural networks with a variety of classification and prediction tasks when adequate historical data. Therein lies the rub, however, as the task of collecting, labelling, and vetting data on which to train. Such a process is prohibitively costly and time-consuming too. He says, “DARPA envisions a future in which machines are more than just tools that execute human programmed rules or generalize from human-curated data sets. Rather, the machines DARPA envisions will function more as colleagues than as tools.” Towards this end, DARPA is focusing its investments on a “third wave” of AI technologies that brings forth machines that can reason in context. Incorporating these technologies in military systems that collaborate with warfighters will facilitate better decisions in complex, time-critical, battlefield environments; enable a shared understanding of massive, incomplete, and contradictory information; and empower unmanned systems to perform critical missions safely and with high degrees of autonomy. DARPA’s more than $2 billion “AI Next” campaign, announced in September 2018, includes providing robust foundations for second wave technologies, aggressively applying the second wave AI technologies into appropriate systems, and exploring and creating third wave AI science and technologies. DARPA’s third wave research efforts will forge new theories and methods that will make it possible for machines to adapt contextually to changing situations, advancing computers from tools to true collaborative partners. Furthermore, the agency will be fearless about exploring these new technologies and their capabilities – DARPA’s core function – pushing critical frontiers ahead of our nation’s adversaries. To know more about this in detail, read Dr. Peter T. Highnam’s complete statement. Michael Brown on (Defense Innovation Unit) DIU’s efforts in Artificial Intelligence Michael Brown, Director of the Defense Innovation Unit, started the talk by highlighting on the fact how China and Russia are investing heavily to become dominant in AI. “By 2025, China will aim to achieve major breakthroughs in AI and increase its domestic market to reach $59.6 billion (RMB 400 billion) To achieve these targets, China’s National Development and Reform Commission (China’s industrial policy-making agency) funded the creation of a national AI laboratory, and Chinese local governments have pledged more than $7 billion in AI funding”, Brown said in his statement. He said that these Chinese firms are in a way leveraging U.S. talent by setting up research institutes in the state, investing in U.S. AI-related startups and firms, recruiting U.S.-based talent, and commercial and academic partnerships. Brown said that DIU will engage with DARPA and JAIC(Joint Artificial Intelligence Center) and also make its commercial knowledge and relationships with potential vendors available to any of the Services and Service Labs. DIU also anticipates that with its close partnership with the JAIC, DIU will be at the leading edge of the Department’s National Mission Initiatives (NMIs), proving that commercial technology can be applied to critical national security challenges via accelerated prototypes that lay the groundwork for future scaling through JAIC. “DIU looks to bring in key elements of AI development pursued by the commercial sector, which relies heavily on continuous feedback loops, vigorous experimentation using data, and iterative development, all to achieve the measurable outcome, mission impact”, Brown mentions. DIU’s AI portfolio team combines depth of commercial AI, machine learning, and data science experience from the commercial sector with military operators. However, they have specifically prioritized projects that address three major impact areas or use cases which employ AI technology, including: Computer vision The DIU is prototyping computer vision algorithms in humanitarian assistance and disaster recovery scenarios. “This use of AI holds the potential to automate post-disaster assessments and accelerate search and rescue efforts on a global scale”, Brown said in his statement. Large dataset analytics and predictions DIU is prototyping predictive maintenance applications for Air Force and Army platforms. For this DIU plans to partner with JAIC to scale this solution across multiple aircraft platforms, as well as ground vehicles beginning with DIU’s complementary predictive maintenance project focusing on the Army’s Bradley Fighting Vehicle. Brown says this is one of DIU’s highest priority projects for FY19 given its enormous potential for impact on readiness and reducing costs. Strategic reasoning DIU is prototyping an application from Project VOLTRON that leverages AI to reason about high-level strategic questions, map probabilistic chains of events, and develop alternative strategies. This will make DoD owned systems more resilient to cyber attacks and inform program offices of configuration errors faster and with fewer errors than humans. Know more about what more DIU plans in partnership with DARPA and JAIC, in detail, in Michael Brown’s complete testimony. Lieutenant General Jack Shanahan on making JAIC “AI-Ready” Lieutenant General Jack Shanahan, Director, Joint Artificial Intelligence Center, touches upon how the JAIC is partnering with the Under Secretary of Defense (USD) Research & Engineering (R&E), the role of the Military Services, the Department’s initial focus areas for AI delivery, and how JAIC is supporting whole-of-government efforts in AI. “To derive maximum value from AI application throughout the Department, JAIC will operate across an end-to-end lifecycle of problem identification, prototyping, integration, scaling, transition, and sustainment. Emphasizing commerciality to the maximum extent practicable, JAIC will partner with the Services and other components across the Joint Force to systematically identify, prioritize, and select new AI mission initiatives”, Shanahan mentions in his testimony. The AI capability delivery efforts that will go through this lifecycle will fall into two categories including National Mission Initiatives (NMI) and Component Mission Initiatives (CMI). NMI is an operational or business reform joint challenge, typically identified from the National Defense Strategy’s key operational problems and requiring multi-service innovation, coordination, and the parallel introduction of new technology and new operating concepts. On the other hand, Component Mission Initiatives (CMI) is a component-level challenge that can be solved through AI. JAIC will work closely with individual components on CMIs to help identify, shape, and accelerate their Component-specific AI deployments through: funding support; usage of common foundational tools, libraries, cloud infrastructure; application of best practices; partnerships with industry and academia; and so on. The Component will be responsible for identifying and implementing the organizational structure required to accomplish its project in coordination and partnership with the JAIC. Following are some examples of early NMI’s by JAIC to deliver mission impact at speed, demonstrate the proof of concept for the JAIC operational model, enable rapid learning and iterative process refinement, and build their library of reusable tools while validating JAIC’s enterprise cloud architecture. Perception Improve the speed, completeness, and accuracy of Intelligence, Surveillance, Reconnaissance (ISR) Processing, Exploitation, and Dissemination (PED). Shanahan says Project Maven’s efforts are included here. Predictive Maintenance (PMx) Provide computational tools to decision-makers to help them better forecast, diagnose, and manage maintenance issues to increase availability, improve operational effectiveness, and ensure safety, at a reduced cost. Humanitarian Assistance/Disaster Relief (HA/DR) Reduce the time associated with search and discovery, resource allocation decisions, and executing rescue and relief operations to save lives and livelihood during disaster operations. Here, JAIC plans to apply lessons learned and reusable tools from Project Maven to field AI capabilities in support of federal responses to events such as wildfires and hurricanes—where DoD plays a supporting role. Cyber Sensemaking Detect and deter advanced adversarial cyber actors who infiltrate and operate within the DoD Information Network (DoDIN) to increase DoDIN security, safeguard sensitive information, and allow warfighters and engineers to focus on strategic analysis and response. Shanahan states, “Under the DoD CIO’s authorities and as delineated in the JAIC establishment memo, JAIC will coordinate all DoD AI-related projects above $15 million annually.” “It does mean that we will start to ensure, for example, that they begin to leverage common tools and libraries, manage data using best practices, reflect a common governance framework, adhere to rigorous testing and evaluation methodologies, share lessons learned, and comply with architectural principles and standards that enable scale”, he further added. To know more about this in detail, read Lieutenant General Jack Shanahan’s complete testimony. To know more about this news in detail, watch the entire hearing on 'Artificial Intelligence Initiatives within the Department of Defense' So, you want to learn artificial intelligence. Here’s how you do it. What can happen when artificial intelligence decides on your loan request Mozilla partners with Ubisoft to Clever-Commit its code, an artificial intelligence assisted assistant

0
0
6681

How-To Tutorials - Data

Structural Equation Modeling and Confirmatory Factor Analysis

9 recommended blockchain online courses

Oracle E-Business Suite: Creating Bank Accounts and Cash Forecasts

Visualizations Using CCC

Scraping the Data

Working with Spark’s graph processing library, GraphFrames

Ridge Regression

Getting Started with Pentaho Data Integration

Visualize This!

First steps with R

Trending Topics

Setting up the most Popular Journal Articles in your Personalized Community in Liferay Portal

Configuring and Formatting iReport Elements

Use of Stylesheets for Report Designing using BIRT

Integrating Kettle and the Pentaho Suite

The U.S. DoD wants to dominate Russia and China in Artificial Intelligence. Last week gave us a glimpse into that vision.

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access