Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-oracle-using-metadata-service-share-xml-artifacts
Packt
23 Jan 2013
11 min read
Save for later

Oracle: Using the Metadata Service to Share XML Artifacts

Packt
23 Jan 2013
11 min read
(For more resources related to this topic, see here.) The WSDL of a web service is made up of the following XML artifacts: WSDL Definition: It defines the various operations that constitute a service, their input and output parameters, and the protocols (bindings) they support. XML Schema Definition (XSD): It is either embedded within the WSDL definition or referenced as a standalone component; this defines the XML elements and types that constitute the input and output parameters. To better facilitate the exchange of data between services, as well as achieve better interoperability and re-usability, it is good practice to de?ne a common set of XML Schemas, often referred to as the canonical data model, which can be referenced by multiple services (or WSDL De?nitions). This means, we will need to share the same XML schema across multiple composites. While typically a service (or WSDL) will only be implemented by a single composite, it will often be invoked by multiple composites; so the corresponding WSDL will be shared across multiple composites. Within JDeveloper, the default behavior, when referencing a predefined schema or WSDL, is for it to add a copy of the file to our SOA project. However, if we have several composites, each referencing their own local copy of the same WSDL or XML schema, then every time that we need to change either the schema or WSDL, we will be required to update every copy. This can be a time-consuming and error-prone approach; a better approach is to have a single copy of each WSDL and schema that is referenced by all composites. The SOA infrastructure incorporates a Metadata Service (MDS), which allows us to create a library of XML artifacts that we can share across SOA composites. MDS supports two types of repositories: File-based repository: This is quicker and easier to set up, and so is typically used as the design-time MDS by JDeveloper. Database repository: It is installed as part of the SOA infrastructure. This is used at runtime by the SOA infrastructure. As you move projects from one environment to another (for example, from test to production), you must typically modify several environment-specific values embedded within your composites, such as the location of a schema or the endpoint of a referenced web service. By placing all this information within the XML artifacts deployed to MDS, you can make your composites completely agnostic of the environment they are to be deployed to. The other advantage of placing all your referenced artifacts in MDS is that it removes any direct dependencies between composites, which means that they can be deployed and started in any order (once you have deployed the artifacts to MDS). In addition, an SOA composite leverages many other XML artifacts, such as fault policies, XSLT Transformations, EDLs for event EDN event definitions, and Schematrons for validation, each of which may need to be shared across multiple composites. These can also be shared between composites by placing them in MDS. Defining a project structure Before placing all our XML artifacts into MDS, we need to define a standard file structure for our XML library. This allows us to ensure that if any XML artifact within our XML library needs to reference another XML artifact (for example a WSDL importing a schema), it can do so via a relative reference; in other words, the XML artifact doesn't include any reference to MDS and is portable. This has a number of benefits, including: OSB compatibility; the same schemas and WSDLs can be deployed to the Oracle Service Bus without modification Third-party tool compatibility; often we will use a variety of tools that have no knowledge of MDS to create/edit XML schemas, WSDLs, and so on (for example XML Spy, Oxygen) In this article, we will assume that we have defined the following directory structure under our <src> directory. Under the xmllib folder, we have defined multiple <solution> directories, where a solution (or project) is made up of one or more related composite applications. This allows each solution to maintain its XML artifacts independently. However, it is also likely that there will be a number of XML artifacts that need to be shared between different solutions (for example, the canonical data model for the organization), which in this example would go under <core>. Where we have XML artifacts shared between multiple solutions, appropriate governance is required to manage the changes to these artifacts. For the purpose of this article, the directory structure is over simpli?ed. In reality, a more comprehensive structure should be de?ned as part of the naming and deployment standards for your SOA Reference Architecture. The other consideration here is versioning; over time it is likely that multiple versions of the same schema, WSDL and so on, will require to be deployed side by side. To support this, we typically recommend appending the version number to the filename. We would also recommend that you place this under some form of version control, as it makes it far simpler to ensure that everyone is using an up-to-date version of the XML library. For the purpose of this article, we will assume that you are using Subversion. Creating a file-based MDS repository for JDeveloper Before we can reference this with JDeveloper, we need to define a connection to the file-based MDS. Getting ready By default, a file-based repository is installed with JDeveloper and sits under the directory structure: <JDeveloper Home>/jdeveloper/integration/seed This already contains the subdirectory soa, which is reserved for, and contains, artifacts used by the SOA infrastructure For artifacts that we wish to share across our applications in JDeveloper, we should create the subdirectory apps (under the seed directory); this is critical, as when we deploy the artifacts to the SOA infrastructure, they will be placed in the apps namespace We need to ensure that the content of the apps directory always contains the latest version of our XML library; as these are stored under Subversion, we simply need to check out the right portion of the Subversion project structure. How to do it... First, we need to create and populate our file-based repository. Navigate to the seed directory, and right-click and select SVN Checkout..., this will launch the Subversion Checkout window. For URL of repository, ensure that you specify the path to the apps subdirectory. For Checkout directory, specify the full pathname of the seed directory and append /apps at the end. Leave the other default values, as shown in the following screenshot, and then click on OK: Subversion will check out a working copy of the apps subfolder within Subversion into the seed directory. Before we can reference our XML library with JDeveloper, we need to define a connection to the file-based MDS. Within JDeveloper, from the File menu select New to launch the Gallery, and under Categories select General | Connections | SOA-MDS Connection from the Items list. This will launch the MDS Connection Wizard. Enter File Based MDS for Connection Name and select a Connection Type of File Based MDS. We then need to specify the MDS root folder on our local filesystem; this will be the directory that contains the apps directory, namely: <JDeveloper Home>jdeveloperintegrationseed Click on Test Connection; the Status box should be updated to Success!. Click on OK. This will create a file-based MDS connection in JDeveloper. Browse the File Based MDS connection in JDeveloper. Within JDeveloper, open the Resource Palette and expand SOA-MDS. This should contain the File Based MDS connection that we just created. Expand all the nodes down to the xsd directory, as shown in the following screenshot: If you double-click on one of the schema files, it will open in JDeveloper (in read-only mode). There's more... Once the apps directory has been checked out, it will contain a snapshot of the MDS artifacts at the point in time that you created the checkpoint. Over time, the artifacts in MDS will be modified or new ones will be created. It is important that you ensure that your local version of MDS is updated with the current version. To do this, navigate to the seed directory, right-click on apps, and select SVN Update. Creating Mediator using a WSDL in MDS In this recipe, we will show how we can create Mediator using an interface definition from a WSDL held in MDS. This approach enables us to separate the implementation of a service (a composite) from the definition of its contract (WSDL). Getting ready Make sure you have created a file-based MDS repository for JDeveloper, as described in the first recipe. Create an SOA application with a project containing an empty composite. How to do it... Drag Mediator from SOA Component Palette onto your composite. This will launch the Create Mediator wizard; specify an appropriate name (EmployeeOnBoarding in the following example), and for the Template select Interface Definition from WSDL Click on the Find Existing WSDLs icon (circled in the previous screenshot); this will launch the SOA Resource Browser. Select Resource Palette from the drop-down list (circled in the following screenshot). Select the WSDL that you wish to import and click on OK. This will return you to the Create Mediator wizard window; ensure that the Port Type is populated and click on OK. This will create Mediator based on the specified WSDL within our composite. How it works... When we import the WSDL in this fashion, JDeveloper doesn't actually make a copy of the schema; rather within the componentType file, it sets the wsdlLocation attribute to reference the location of the WSDL in MDS (as highlighted in the following screenshot). For WSDLs in MDS, the wsdlLocation attribute uses the following format: oramds:/apps/<wsdl name> Where oramds indicates that it is located in MDS, apps indicates that it is in the application namespace and <wsdl name> is the full pathname of the WSDL in MDS. The wsdlLocation doesn't specify the physical location of the WSDL; rather it is relative to MDS, which is specific to the environment in which the composite is deployed. This means that when the composite is open in JDeveloper, it will reference the WSDL in the file-based MDS, and when deployed to the SOA infrastructure, it will reference the WSDL deployed to the MDS database repository, which is installed as part of the SOA infrastructure. There's more... This method can be used equally well to create a BPEL process based on the WSDL from within the Create BPEL Process wizard; for Template select Base on a WSDL and follow the same steps. This approach works well with Contract First Design as it enables the contract for a composite to be designed first, and when ready for implementation, be checked into Subversion. The SOA developer can then perform a Subversion update on their file-based MDS repository, and then use the WSDL to implement the composite Creating Mediator that subscribes to EDL in MDS In this recipe, we will show how we can create Mediator that subscribes to an EDN event whose EDL is defined in MDS. This approach enables us to separate the definition of an event from the implementation of a composite that either subscribes to, or publishes, the event. Getting ready Make sure you have created a file-based MDS repository for JDeveloper, as described in the initial recipe. Create an SOA application with a project containing an empty composite. How to do it... Drag Mediator from SOA Component Palette onto your composite. This will launch the Create Mediator wizard; specify an appropriate name for it (UserRegistration in the following example), and for the Template select Subscribe to Events. Click on the Subscribe to new event icon (circled in the previous screenshot); this will launch the Event Chooser window. Click on the Browse for Event Definition (edl) files icon (circled in the previous screenshot); this will launch SOA Resource Browser. Select Resource Palette from the drop-down list. Select the EDL that you wish to import and click on OK. This will return you to the Event Chooser window; ensure that the required event is selected and click on OK. This will return you to the Create Mediator window; ensure that the required event is configured as needed, and click on OK. This will create an event subscription based on the EDL specified within our composite. How it works... When we reference an EDL in MDS, JDeveloper doesn't actually make a copy of the EDL; rather within the composite.xml file, it creates an import statement to reference the location of the EDL in MDS. There's more... This approach can be used equally well to subscribe to an event within a BPEL process or publish an event using either Mediator or BPEL.
Read more
  • 0
  • 0
  • 2433

article-image-scraping-web-page
Packt
20 Jun 2017
11 min read
Save for later

Scraping a Web Page

Packt
20 Jun 2017
11 min read
In this article by Katharine Jarmul author of the book Python Web Scraping - Second Edition we can look at some example as suppose I have a shop selling shoes and want to keep track of my competitor's prices. I could go to my competitor's website each day and compare each shoe's price with my own, however this will take a lot of time and will not scale well if I sell thousands of shoes or need to check price changes frequently. Or maybe I just want to buy a shoe when it's on sale. I could come back and check the shoe website each day until I get lucky, but the shoe I want might not be on sale for months. These repetitive manual processes could instead be replaced with an automated solution using the web scraping techniques covered in this book. In an ideal world, web scraping wouldn't be necessary and each website would provide an API to share the data in a structured format. Indeed, some websites do provide APIs, but they typically restrict the data that is available and how frequently it can be accessed. Additionally, a website developer might change, remove or restrict the backend API. In short, we cannot rely on APIs to access the online data we may want and therefore, we need to learn about web scraping techniques. (For more resources related to this topic, see here.) Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, first with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at (https://docs.python.org/3/howto/regex.html). Even if you use regular expressions (or regex) with another programming language, I recommend stepping through it for a refresher on regex with Python. To scrape the country area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> from advanced_link_crawler import download >>> url = 'http://example.webscraping.com/view/UnitedKingdom-239' >>> html = download(url) >>> re.findall(r'(.*?)', html) ['<'img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', 'EU', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', 'IE '] This result shows that thetag is used for multiple country attributes. If we simply wanted to scrape the country area, we can select the second matching element, as follows: >>> re.findall('(.*?)', html)[1]'244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if this table is changed and the area is no longer in the second matching element. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data at some point, we want our solution to be as robust against layout changes as possible. To make this regular expression more specific, we can include the parentelement, which has an ID, so it ought to be unique: >>> re.findall(' Area: (.*?) ', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra spaces could be added between the tags, or the area_label could be changed. Here is an improved version to try and support these various possibilities: >>> re.findall('''.*?<tds*class=["']w2p_fw["']>(.*?) ''', html) ['244,820 square kilometres'] This regular expression is more future-proof but is difficult to construct, and quite unreadable. Also, there are still plenty of other minor layout changes that would break it, such as if a title attribute was added to the <td> tag or if the tr or td elements changed their CSS classes or IDs. From this example, it is clear that regular expressions provide a quick way to scrape data but are too brittle and easily break when a web page is updated. Fortunately, there are better data extraction solutions such as. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have this module, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Many web pages do not contain perfectly valid HTML and Beautiful Soup needs to correct improper open and close tags. For example, consider this simple web page containing a list with missing attribute quotes and closing tags: <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print(fixed_html) <ul class="country"> <li> Area <li> Population </li> </li> </ul> We can see that using the default html.parser did not result in properly parsed HTML. We can see from the previous snippet that it has used nested li elements, which might make it difficult to navigate. Luckily there are more options for parsers. We can install LXML or we can also use html5lib. To install html5lib, simply use pip: pip install html5lib Now, we can repeat this code, changing only the parser like so: >>> soup = BeautifulSoup(broken_html, 'html5lib') >>> fixed_html = soup.prettify() >>> print(fixed_html) <html> <head> </head> <body> <ul class="country"> <li> Area </li> <li> Population </li> </ul> </body> </html>  Here, BeautifulSoup using html5lib was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. You should see similar results if you used lxml. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the country area from our example website: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the data element >>> area = td.text # extract the text from the data element >>> print(area) 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. We also know if the page contains broken HTML that BeautifulSoup can help clean the page and allow us to extract data from very broken website code. Lxml Lxml is a Python library built on top of the libxml2 XML parsing library written in C, which helps make it faster than Beautiful Soup but also harder to install on some computers, specifically Windows. The latest installation instructions are available at http://lxml.de/installation.html. If you run into difficulties installing the library on your own, you can also use Anaconda to do so:  https://anaconda.org/anaconda/lxml. If you are unfamiliar with Anaconda, it is a package and environment manager primarily focused on open data science packages built by the folks at Continuum Analytics. You can download and install Anaconda by following their setup instructions here: https://www.continuum.io/downloads. Note that using the Anaconda quick install will set your PYTHON_PATH to the Conda installation of Python. As with Beautiful Soup, the first step when using lxml is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> from lxml.html import fromstring, tostring >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = fromstring(broken_html) # parse the HTML >>> fixed_html = tostring(tree, pretty_print=True) >>> print(fixed_html) <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. These are not requirements for standard XML and so are unnecessary for lxml to insert. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here, because they are more compact and can be reused later when parsing dynamic content. Some readers will already be familiar with them from their experience with jQuery selectors or use in front-end web application development. We will compare performance of these selectors with XPath. To use CSS selectors, you might need to install the cssselect library like so: pip install cssselect Now we can use the lxml CSS selectors to extract the area data from the example page: >>> tree = fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print(area) 244,820 square kilometres By using the cssselect method on our tree, we can utilize CSS syntax to select a table row element with the places_area__row ID, and then the child table data tag with the w2p_fw class. Since cssselect returns a list, we then index the first result and call the text_content method, which will iterate over all child elements and return concatenated text of each element. In this case, we only have one element, but this functionality is useful to know for more complex extraction examples. Summary We have walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, so we will use it in future examples. Resources for Article: Further resources on this subject: Web scraping with Python (Part 2) [article] Scraping the Web with Python - Quick Start [article] Scraping the Data [article]
Read more
  • 0
  • 0
  • 2431

article-image-administrating-mysql-server
Packt
09 Feb 2012
10 min read
Save for later

Administrating the MySQL Server

Packt
09 Feb 2012
10 min read
(For more resources on MySQL, see here.) Managing users and their privileges The Privileges sub-page (visible only if we are logged in as a privileged user) contains dialogs to manage MySQL user accounts. It also contains dialogs to manage privileges on global, database, and table levels. This sub-page is hierarchical. When editing a user's privileges, we can see the global privileges as well as the database-specific privileges. Then, when viewing database-specific privileges for a user, we can view and edit this user's privileges for any table within this database. The user overview The first page displayed when we enter the Privileges sub-page is called User overview. This shows all user accounts and a summary of their global privileges, as shown in the following screenshot: (Move the mouse over the image to enlarge.) From this page, we can: Edit a user's privileges, via the Edit Privileges link for this user Export a user's privileges definition, via the Export link for this user Use the checkboxes to remove users, via the Remove selected users dialog Access the page where the Add a new User dialog is available The displayed users' list has columns with the following characteristics: Column Characteristic User The user account we are defining. Host The machine name or IP address, from which this user account will be connecting to the MySQL server. A % value here indicates all hosts. Password Contains Yes if a password is defined and No if it isn't. The password itself cannot be seen from phpMyAdmin's interface or by directly looking at the mysql.user table, as it is encrypted with a one-way hashing algorithm. Global privileges A list of the user's global privileges. Grant Contains Yes if the user can grant his/her privileges to others. Action Contains a link to edit this user's privileges or export them. Exporting privileges This feature can be useful when we need to create a user with the same password and privileges on another MySQL server. Clicking on Export for user marc produces the following panel: Then it's only a matter of selecting these GRANT statements and pasting them in the SQL box of another phpMyAdmin window, where we have logged in on another MySQL server. Privileges reload At the bottom of User overview page, this message is displayed: Note: phpMyAdmin gets the users' privileges directly from MySQL's privilege tables. The content of these tables may differ from the privileges the server uses, if they have been changed manually. In this case, you should reload the privileges before you continue. Here, the text reload the privileges is clickable. The effective privileges (the ones against which the server bases its access decisions) are the privileges that are located in the server's memory. Privilege modifications that are made from the User overview page are made both in memory and on disk in the mysql database. Modifications made directly to the mysql database do not have immediate effect. The reload the privileges operation reads the privileges from the database and makes them effective in memory. Adding a user The Add a new User link opens a dialog for user account creation. First, we see the panel where we will describe the account itself, as shown in the following screenshot: The second part of the Add a new User dialog is where we will specify the user's global privileges, which apply to the server as a whole (see the Assigning global privileges section of this article), as shown in the following screenshot: Entering the username The User name menu offers two choices. We can choose Use text field: and enter a username in the box, or we can choose Any user to create an anonymous user (the blank user). More details about the anonymous user are available at http://dev.mysql.com/doc/refman/5.5/en/connection-access.html. Let us choose Use text field: and enter bill. Assigning a host value By default, this menu is set to Any host, with % as the host value. The Local choice means localhost. The Use host table choice (which creates a blank value in the host field) means to look in the mysql.host table for database-specific privileges. Choosing Use text field: allows us to enter the exact host value we want. Let us choose Local. Setting passwords Even though it's possible to create a user without a password (by selecting the No password option), it's best to have a password. We have to enter it twice (as we cannot see what is entered) to confirm the intended password. A secure password should have more than eight characters, and should contain a mixture of uppercase and lowercase characters, digits, and special characters. Therefore, it's recommended to have phpMyAdmin generate a password—this is possible in JavaScript-enabled browsers. In the Generate password dialog, clicking on Generate button enters a random password (in clear text) on the screen and fills the Password and Re-type input fields with the generated password. At this point, we should note the password so that we can pass it on to the user. Understanding rights for database creation A frequent convention is to assign a user the rights to a database having the same name as this user. To accomplish this, the Database for user section offers the Create database with same name and grant all privileges radio button. Selecting this checkbox automates the process by creating both the database (if it does not already exist) and assigning the corresponding rights. Please note that, with this method, each user would be limited to one database (user bill, database bill). Another possibility is to allow users to create databases that have the same prefix as their usernames. Therefore, the other choice Grant all privileges on wildcard name (username_%) performs this function by assigning a wildcard privilege. With this in place, user bill could create the databases bill_test, bill_2, bill_payroll, and so on; phpMyAdmin does not pre-create the databases in this case. Assigning global privileges Global privileges determine the user's access to all databases. Hence, these are sometimes known as superuser privileges. A normal user should not have any of these privileges unless there is a good reason for this. Moreover, should a user account that has global privileges become compromised, the damage could be far greater. If we are really creating a superuser, we will select every global privilege that he or she needs. These privileges are further divided into Data, Structure, and Administration groups. In our example, bill will not have any global privileges. Limiting the resources used We can limit the resources used by this user on this server (for example, the maximum queries per hour). Zero means no limit. We will not impose any resources limits on bill. The following screenshot shows the status of the screen just before hitting Create user to create this user's definition (with the remaining fields being set to default): Editing a user profile The page used to edit a user's profile appears whenever we click on Edit Privileges for a user in the User overview page. Let us try it for our newly created user bill. There are four sections on this page, each with its own Go button. Hence, each section is operated independently and has a distinct purpose. Editing global privileges The section for editing the user's privileges has the same look as the Add a new User dialog, and is used to view and to change global privileges. Assigning database-specific privileges In this section, we define the databases to which our user has access, and his or her exact privileges on these databases. As shown in the previous screenshot, we see None because we haven't defined any privileges yet. There are two ways of defining database privileges. First, we can choose one of the existing databases from the drop-down menu as shown in the following screenshot: This assigns privileges only for the chosen database. Secondly, we can also choose Use text field: and enter a database name. We could enter a non-existent database name, so that the user can create it later (provided we give him/her the CREATE privilege in the next panel). We can also use special characters, such as the underscore and the percent sign, for wildcards. For example, entering bill here would enable him to create a bill database, and entering bill% would enable him to create a database with any name that starts with bill. For our example, we will enter bill and click on Go. The next screen is used to set bill's privileges on the bill database, and create table-specific privileges. To learn more about the meaning of a specific privilege, we can hover the mouse over a privilege name (which is always in English), and an explanation about this privilege appears in the current language. We give SELECT, INSERT, UPDATE, DELETE, CREATE, ALTER, INDEX, and DROP privileges to bill on this database. We then click on Go. After the privileges have been assigned, the interface stays at the same place, so that we can refine these privileges further. We cannot assign table-specific privileges for the moment, as the database does not yet exist. To go back to the general privileges page of bill, click on the 'bill'@'localhost' title. This brings us back to the following, familiar page except for a change in one section: We see the existing privileges (we could click on Edit Privileges link to edit or on Revoke link to revoke them) on the bill database for user bill, and we can add privileges for bill on another database. We can also see that bill has no table-specific privilege on the bill database. Changing the password The Change password dialog is part of the Edit user page, and we can use it either to change bill's password or to remove it. Removing the password will enable bill to log in without a password. The dialog offers a choice of password hashing options, and it's recommended to keep the default of MySQL 4.1+ hashing. For more details about hashing, please visit http://dev.mysql.com/doc/refman/5.1/en/password-hashing.html. Changing login information or copying a user This dialog can be used to change the user's login information, or to copy his or her login information to a new user. For example, suppose that Bill calls and tells us that he prefers the login name billy instead of bill. We just have to add a y to the username, and then select delete the old one from the user tables radio button, as shown in the following screenshot: After clicking on Go, bill no longer exists in the mysql database. Also, all of his privileges, including the privileges on the bill database, will have been transferred to the new user—billy. However, the user definition of bill will still exist in memory, and hence it's still effective. If we had chosen the delete the old one from the user tables and reload the privileges afterwards option instead, the user definition of bill would immediately have ceased to be valid. Alternatively, we could have created another user based on bill, by making use of the keep the old one choice. We can transfer the password to the new user by choosing Do not change the password option, or change it by entering a new password twice. The revoke all active privileges… option immediately terminates the effective current privileges for this user, even if he or she is currently logged in. Removing a user Removing a user is done from the User overview section of the Privileges page. We select the user to be removed. Then (in Remove selected users) we can select the Drop the databases that have the same names as the users option to remove any databases that are named after the users we are deleting. A click on Go effectively removes the selected users.
Read more
  • 0
  • 0
  • 2430

article-image-oracle-11g-streams-rules-part-1
Packt
05 Feb 2010
7 min read
Save for later

Oracle 11g Streams: RULES (Part 1)

Packt
05 Feb 2010
7 min read
Streams is all about the rules; literally. The action context that a Streams process takes is governed by the rule conditions. When you create a rule, Oracle generates system conditions, and evaluation contexts, that are used to evaluate each LCR to determine if the action context for the process should be accomplished. We have already addressed a number of these system conditions during our TAG discussion; for instance INCLUDE_TAGGED_LCR=FALSE generates a system evaluation for theLCR$_ROW_RECORD_TYPE :dml.is_null_tag='Y' subprogram. For more information on LCR Types, reference Oracle Database PL/SQL Packages and Types Reference manual. You can control what system evaluations are included in the rule by the parameter values you specify, as well as add user-defined evaluations with the AND_CONDITION parameter. There is a lot going on under the calm surface water of rules. Understanding how this activity flows together will help you become more advanced in creating rules to manipulate your Streams throughout your current environment. So, let's grab our snorkels and masks, and stick our heads under the surface and take a look. Rule components Rules have three components: conditions, evaluation context, and action context. These components coordinate with the "when", "what", and "how" of the LCR being processed. The conditions tell the Streams process "when" the LCR should be processed, the evaluation context defines "what" data/information the Streams process uses to process the LCR, and the action context tells the Streams process "how" to handle the LCR. Rule conditions The rule condition is essentially the "where clause". The conditions are evaluated against the properties of the LCR and return either TRUE or FALSE The conditions can contain compound expressions and operators (AND, OR, NOT, and so on).The final evaluation returned from the condition (TRUE or FALSE) is the final result of all the compound expressions. An example of a system-generated condition would be that of our good friend :dml.is_null_tag = 'Y' (generated by the INCLUDE_TAGGED_LCR=FALSE parameter of the DBMS_STREAMS_ADM.ADD_*_RULE procedures). On rule creation, the condition is passed in as a string (so make sure to escape any single quotes within the string). ':dml.get_object_owner() = ''OE'' and :dml.get_tag() =HEXTORAW(''22'')' It is important to remember that you want to keep your rule conditions as simple as possible. Complex rule conditions can have a significant impact on performance. The rule condition created by our Sub-Setting example is an example of a complex rule as it includes a PL/SQL call to a function. Also, rule conditions that contain NOT, or != can also impact performance. Rule Evaluation Context The rule evaluation context defines data external to the LCR properties that can be referenced in the rule conditions. This is comparable to the SQL statement from clause. This reference is a database object that contains the external data. The evaluation context provides the rule conditions with the necessary information for interpreting and evaluating the conditions that reference external data. If the evaluation context references objects, the rule owner must have the proper privileges to reference the object (select and execute) as the rule condition is evaluated in the schema of the evaluation context owner. Information contained in an Evaluation Context might include table aliases used in the condition, variable names and types, and/or a function to use to evaluate the rules to which the evaluation context is assigned. Evaluation Context structure can get a bit confusing. To get a better feel of it, you may want to start by looking at the following database views: DBA/ALL/USER_EVALUATION_CONTEXT_TABLES: table alias used DBA/ALL/USER_EVALUATION_CONTEXT_VARS: variable types used DBA/ALL/USER_EVALUATION_CONTEXTS: functions used Streams system created rules (created using DBMS_STREAMS_ADM) will create rules using the standard Oracle-supplied SYS.STREAMS$_EVALUATION_CONTEXT rule evaluation context. This evaluation context is composed of a variable_types> list for the :dml and :ddl variables, and the evaluation function SYS.DBMS_STREAMS_INTERNAL.EVALUATION_CONTEXT_FUNCTION as seen in the previous DBA views. You can create your own evaluation context using the DBMS_RULE_ADM.CREATE_EVALUATION_CONTEXT procedure: DBMS_RULE_ADM.CREATE_EVALUATION_CONTEXT(evaluation_context_name IN VARCHAR2,table_aliases IN SYS.RE$TABLE_ALIAS_LIST DEFAULT NULL,variable_types IN SYS.RE$VARIABLE_TYPE_LIST DEFAULT NULL,evaluation_function IN VARCHAR2 DEFAULT NULL,evaluation_context_comment IN VARCHAR2 DEFAULT NULL); If you create a custom Evaluation Context that uses the SYS.DBMS_STREAMS_INTERNAL.EVALUATION_CONTEXT_FUNCTION, it must include the same variables and types as in the SYS.STREAMS$_EVALUATION_CONTEXT (a.k.a. :dml and :ddl). Variable_types> can be defined using SYS.RE$VARIABLE_TYPE_LIST, which in turn accepts individual variable types defined using SYS.RE$VARIABLE_TYPE. Similarly, if you create a custom function to use as the evaluation function, it must have the following signature: FUNCTION evaluation_function_name(rule_set_name IN VARCHAR2,evaluation_context IN VARCHAR2,event_context IN SYS.RE$NV_LIST DEFAULT NULL,table_values IN SYS.RE$TABLE_VALUE_LIST DEFAULT NULL,column_values IN SYS.RE$COLUMN_VALUE_LIST DEFAULT NULL,variable_values IN SYS.RE$VARIABLE_VALUE_LIST DEFAULT NULL,attribute_values IN SYS.RE$ATTRIBUTE_VALUE_LIST DEFAULT NULL,stop_on_first_hit IN BOOLEAN DEFAULT FALSE,simple_rules_only IN BOOLEAN DEFAULT FALSE,true_rules OUT SYS.RE$RULE_HIT_LIST,maybe_rules OUT SYS.RE$RULE_HIT_LIST);RETURN BINARY_INTEGER; Where the returned BINARY_INTEGER value must be one of the following: DBMS_RULE_ADM.EVALUATION_SUCCESSDBMS_RULE_ADM.EVALUATION_CONTINUEDBMS_RULE_ADM.EVALUATION_FAILURE For more information on creating custom Evaluation Contexts and evaluation functions and Rule Types, refer to the Oracle Database PL/SQL Packages and Types Reference manual, and The Oracle Streams Extended Examples manual. Once an Evaluation Context is created it can be assigned to a rule or a rule set using the evaluation_context parameter of the appropriate DBMS_RULE_ADM procedure. The Evaluation Context for a Rule can be different than the Evaluation Context for a Rule Set to which the Rule might be assigned. The bottom line is that a Rule must be able to associate itself with an Evaluation Context at some level. We will revisit this concept as we discuss Rule Creation a little later on this section. Action Context The rule action context is just that, the action information that the rule evaluation engine returns to the client application, to be acted upon by the client application, when the rule evaluates to true. This is not the action itself, but values to be used by the action code that are specific to the rule. The action context is of the SYS.RE$NV_LIST type, which contains an array of name-value pairs and is associated to a rule condition. A rule condition can only have one action context. The action context itself is optional and can contain zero to many name-value pairs. The SYS.RE$NV_LIST has the following construct: TYPE SYS.RE$NV_LIST AS OBJECT(actx_list SYS.RE$NV_ARRAY); Subprograms are: ADD_PAIR (name IN VARCHAR2,value IN ANYDATA);GET_ALL_NAMES ()RETURN SYS.RE$NAME_ARRAY;GET_VALUE (name IN VARCHAR2)RETURN ANYDATA;REMOVE_PAIR (name IN VARCHAR2); For more information on creating and populating Action Contexts types, refer to the Oracle Database PL/SQL Packages and Types Reference manual. For more information on Rule components refer to the Oracle Streams Concepts and Administration manual.
Read more
  • 0
  • 0
  • 2418

article-image-inbuilt-data-types-python
Packt
22 Jun 2017
4 min read
Save for later

Inbuilt Data Types in Python

Packt
22 Jun 2017
4 min read
This article by Benjamin Baka, author of the book Python Data Structures and Algorithm, explains the inbuilt data types in Python. Python data types can be divided into 3 categories, numeric, sequence and mapping. There is also the None object that represents a Null, or absence of a value. It should not be forgotten either that other objects such as classes, files and exceptions can also properly be considered types, however they will not be considered here. (For more resources related to this topic, see here.) Every value in Python has a data type. Unlike many programming languages, in Python you do not need to explicitly declare the type of a variable. Python keeps track of object types internally. Python inbuilt data types are outlined in the following table: Category Name Description None None The null object Numeric int Integer   float Floating point number   complex Complex number   bool Boolean (True, False) Sequences str String of characters   list List of arbitrary objects   Tuple Group of arbitrary items   range Creates a range of integers. Mapping dict Dictionary of key – value pairs   set Mutable, unordered collection of unique items   frozenset Immutable set None type The None type is immutable and has one value, None. It is used to represent the absence of a value. It is returned by objects that do not explicitly return a value and evaluates to False in Boolean expressions. It is often used as the default value in optional arguments to allow the function to detect if the caller has passed a value. Numeric Types All numeric types, apart from bool, are signed and they are all immutable. Booleans have two possible values, True and False. These values are mapped to 1 and 0 respectively. The integer type, int, represents whole numbers of unlimited range. Floating point numbers are represented by the native double precision floating point representation of the machine. Complex numbers are represented by two floating point numbers. they are assigned using the j operator to signify the imaginary part of the complex number. For example : a = 2+3j We can access the real and imaginary parts by a.real and a.imag respectively. Representation error It should be noted that the native double precision representation of floating point numbers leads to some unexpected results. For example, consider the following: In[14]: 1-0.9 Out[14]: 0.09999999999998 In [15]: 1-0.9 == 0.1 Out[15]: False This is a result of the fact that most decimal fractions are not exactly representable as a binary fraction, which is how most underlying hardware represents floating point numbers. For algorithms or applications where this may be an issue Python provides a decimalmodule. This module allows for the exact representation of decimal numbers and facilitates greater control properties such as rounding behaviour, number of significant digits and precision. It defines two objects, a Decimal type, representing decimal numbers and a Context type, representing various computational parameters such as precision, rounding and error handling.  An example of its usage can be seen in the following: In [1]: import decimal In[2]: x = decimal.Decimal(3.14); y=decimal.Decimal(2.74) In[3]: x*y Out[3]: Decimal (‘8.60360000000001010036498883’) In[4]: decimal.getcontext().prec = 4 In[5]: x * y Out[5]: Decimal(‘8.604’) Here we have created a global context and set the precision to 4. The Decimal object can be treated pretty much as you would treat an int or a float. They are subject to all the same mathematical operations and can be used as dictionary keys, placed in sets and so on. In addition, Decimal objects also have several methods for mathematical operations such as natural exponents x.exp(), natural logarithms, x.ln() and base 10 logarithms, x.log10().  Python also has a fractions module that implements a rational number type. The following shows several ways to create fractions: In [62]: import fractions In [63]: fractions Fraction(3,4) #creates the fraction ¾ Out[63]: Fraction(3,4) In [64]: fraction Fraction(0,5) #creates a fraction from a float Out[64]: Fraction(1,2) In [65]: fraction Fraction(“.25”) #creates a fraction from a string Out[65]: Fraction(1,4) It is also worth mentioning here the NumPy extension. This has types for mathematical objects such as arrays, vectors and matrixes and capabilities for linear algebra, calculation of Fourier transforms, eigenvectors, logical operations and much more. Summary We have looked at the built in data types and some internal Python modules, most notable the collections module. There are a number of external libraries such as the SciPy stack, and, likewise.  Resources for Article: Further resources on this subject: Python Data Structures [article] Getting Started with Python Packages [article] An Introduction to Python Lists and Dictionaries [article]
Read more
  • 0
  • 0
  • 2412

article-image-getting-started-deep-learning
Packt
07 Mar 2016
12 min read
Save for later

Getting Started with Deep Learning

Packt
07 Mar 2016
12 min read
In this article by Joshua F. Wiley, author of the book, R Deep Learning Essentials, we will discuss deep learning, a powerful multilayered architecture for pattern recognition, signal detection, classification, and prediction. Although deep learning is not new, it has gained popularity in the past decade due to the advances in the computational capacity and new ways of efficient training models, as well as the availability of ever growing amount of data. In this article, you will learn what deep learning is. What is deep learning? To understand what deep learning is, perhaps it is easiest to start with what is meant by regular machine learning. In general terms, machine learning is devoted to developing and using algorithms that learn from raw data in order to make predictions. Prediction is a very general term. For example, predictions from machine learning may include predicting how much money a customer will spend at a given company, or whether a particular credit card purchase is fraudulent. Predictions also encompass more general pattern recognition, such as what letters are present in a given image, or whether a picture is of a horse, dog, person, face, building, and so on. Deep learning is a branch of machine learning where a multi-layered (deep) architecture is used to map the relations between inputs or observed features and the outcome. This deep architecture makes deep learning particularly suitable for handling a large number of variables and allows deep learning to generate features as part of the overall learning algorithm, rather than feature creation being a separate step. Deep learning has proven particularly effective in the fields of image recognition (including handwriting as well as photo or object classification) and natural language processing, such as recognizing speech. There are many types of machine learning algorithms. In this article, we are primarily going to focus on neural networks as these have been particularly popular in deep learning. However, this focus does not mean that it is the only technique available in machine learning or even deep learning, nor that other techniques are not valuable or even better suited, depending on the specific task. Conceptual overview of neural networks As their name suggests, neural networks draw their inspiration from neural processes and neurons in the body. Neural networks contain a series of neurons, or nodes, which are interconnected and process input. The connections between neurons are weighted, with these weights based on the function being used and learned from the data. Activation in one set of neurons and the weights (adaptively learned from the data) may then feed into other neurons, and the activation of some final neuron(s) is the prediction. To make this process more concrete, an example from human visual perception may be helpful. The term grandmother cell is used to refer to the concept that somewhere in the brain there is a cell or neuron that responds specifically to a complex and specific object, such as your grandmother. Such specificity would require thousands of cells to represent every unique entity or object we encounter. Instead, it is thought that visual perception occurs by building up more basic pieces into complex representations. For example, the following is a picture of a square: Figure 1 Rather than our visual system having cells neurons that are activated only upon seeing the gestalt, or entirety, of a square, we can have cells that recognize horizontal and vertical lines, as shown in the following: Figure 2 In this hypothetical case, there may be two neurons, one which is activated when it senses horizontal lines and another that is activated when it senses vertical lines. Finally, a higher-order process recognizes that it is seeing a square when both the lower order neurons are activated simultaneously. Neural networks share some of these same concepts, with inputs being processed by a first layer of neurons that may go on to trigger another layer. Neural networks are sometimes shown as graphical models. In Figure 3, Inputs are data represented as squares. These may be pixels in an image or different aspects of sounds, or something else. The next layer of Hidden neurons is neurons that recognize basic features, such as horizontal lines, vertical lines, or curved lines. Finally, the output may be a neuron that is activated by the simultaneous activation of two of the hidden neurons. In this article, observed data or features are depicted as squares, and unobserved or hidden layers as circles: Figure 3 Neural networks are used to refer to a broad class of models and algorithms. Hidden neurons are generated based on some combination of the observed data, similar to a basis expansion in other statistical techniques; however, rather than choosing the form of the expansion, the weights used to create the hidden neurons are learned from the data. Neural networks can involve a variety of activation function(s), which are transformations of the weighted raw data inputs to create the hidden neurons. A common choice for activation functions is the sigmoid function:  and the hyperbolic tangent function . Finally, radial basis functions are sometimes used as they are efficient function approximators. Although there are a variety of these, the Gaussian form is common: . In a shallow neural network such as is shown in Figure 3, with only a single hidden layer, from the hidden units to the outputs is essentially a standard regression or classification problem. The hidden units can be denoted by, h, the outputs by, Y. Different outputs can be denoted by subscripts i = 1, …, k and may represent different possible classifications, such as (in our case) a circle or square. The paths from each hidden unit to each output are the weights and for the ith output are denoted by wi. These weights are also learned from the data, just like the weights used to create the hidden layer. For classification, it is common to use a final transformation, the softmax function, which is   as this ensures that the estimates are positive (using the exponential function) and that the probability of being in any given class sums to one. For linear regression, the identity function, which returns its input, is commonly used. Confusion may arise as to why there are paths between every hidden unit and output as well as every input and hidden unit. These are commonly drawn to represent that a priori any of these relations are allowed to exist. The weights must then be learned from the data, with zero or near zero weights essentially equating to dropping unnecessary relations. This only scratches the surface of the conceptual and practical aspects of neural networks. For a slightly more in-depth introduction to neural networks, see Chapter 11 of The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009) also freely available at http://statweb.stanford.edu/~tibs/ElemStatLearn/. Next, we will turn to a brief introduction to deep neural networks. Deep neural networks Perhaps the simplest, if not the most informative, definition of a deep neural network (DNN) is that it is a neural network with multiple hidden layers. Although a relatively simple conceptual extension of neural networks, such deep architecture provides valuable advances in terms of the capability of the models and new challenges in training them. Using multiple hidden layers allows a more sophisticated build-up from simple elements to more complex ones. When discussing neural networks, we considered the outputs to be whether the object was a circle or a square. In a deep neural network, many circles and squares could be combined to form other more advanced shapes. One can consider two complexity aspects of a model's architecture. One is how wide or narrow it is—that is, how many neurons in a given layer. The second is how deep it is, or how many layers of neurons there are. For data that truly has such deep architectures, a DNN can fit it more accurately with fewer parameters than a neural network (NN), because more layers (each with fewer neurons) can be a more efficient and accurate representation; for example, because the shallow NN cannot build more advanced shapes from basic pieces, in order to provide equal accuracy to the DNN it must represent each unique object. Again considering pattern recognition in images, if we are trying to train a model for text recognition the raw data may be pixels from an image. The first layer of neurons could be trained to capture different letters of the alphabet, and then another layer could recognize sets of these letters as words. The advantage is that the second layer does not have to directly learn from the pixels, which are noisy and complex. In contrast, a shallow architecture may require far more parameters, as each hidden neuron would have to be capable of going directly from pixels in an image to a complete word, and many words may overlap, creating redundancy in the model. One of the challenges in training deep neural networks is how to efficiently learn the weights. The models are often complex and local minima abound making the optimization problem a challenging one. One of the major advancements came in 2006, when it was shown that Deep Belief Networks (DBNs) could be trained one layer at a time (Refer A Fast Learning Algorithm for Deep Belief Nets, by Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, (2006) at http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf). A DBN is a type of DNN where multiple hidden layers and connections between (but not within) layers (that is, a neuron in layer 1 may be connected to a neuron in layer 2, but may not be connected to another neuron in layer 1). This is the essentially the same definition of a Restricted Boltzmann Machine (RBM)—an example is diagrammed in Figure 4, except that a RBM typically has one input layer and one hidden layer: Figure 4 The restriction of no connections within a layer is valuable as it allows for much faster training algorithms to be used, such as the contrastive divergence algorithm. If several RBMs are stacked together, they can form a DBN. Essentially, the DBN can then be trained as a series of RBMs. The first RBM layer is trained and used to transform raw data into hidden neurons, which are then treated as a new set of inputs in a second RBM, and the process is repeated until all layers have been trained. The benefits of the realization that DBNs could be trained one layer at a time extend beyond just DBNs, however. DBNs are sometimes used as a pre-training stage for a deep neural network. This allows the comparatively fast, greedy layer-by-layer training to be used to provide good initial estimates, which are then refined in the deep neural network using other, slower, training algorithms such as back propagation. So far we have been primarily focused on feed-forward neural networks, where the results from one layer and neuron feed forward to the next. Before closing this section, two specific kinds of deep neural networks that have grown in popularity are worth mentioning. The first is a Recurrent Neural Network (RNN) where neurons send feedback signals to each other. These feedback loops allow RNNs to work well with sequences. A recent example of an application of RNNs was to automatically generate click-bait such as One trick to great hair salons don't want you to know or Top 10 reasons to visit Los Angeles: #6 will shock you!. RNNs work well for such jobs as they can be seeded from a large initial pool of a few words (even just trending search terms or names) and then predict/generate what the next word should be. This process can be repeated a few times until a short phrase is generated, the click-bait. This example is drawn from a blog post by Lars Eidnes, available at http://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/. The second type is a Convolutional Neural Network (CNN). CNNs are most commonly used in image recognition. CNNs work by having each neuron respond to overlapping subregions of an image. The benefits of CNNs are that they require comparatively minimal pre-processing yet still do not require too many parameters through weight sharing (for example, across subregions of an image). This is particularly valuable for images as they are often not consistent. For example, imagine ten different people taking a picture of the same desk. Some may be closer or farther away or at positions resulting in essentially the same image having different heights, widths, and the amount of image captured around the focal object. As for neural networks, this description only provides the briefest of overviews as to what DNNs are and some of the use cases to which they can be applied. Summary This article presented a brief introduction to NNs and DNNs. Using multiple hidden layers, DNNs have been a revolution in machine learning by providing a powerful unsupervised learning and feature-extraction component that can be standalone or integrated as part of a supervised model. There are many applications of such models and they are increasingly used by large-scale companies such as Google, Microsoft, and Facebook. Examples of tasks for deep learning are image recognition (for example, automatically tagging faces or identifying keywords for an image), voice recognition, and text translation (for example, to go from English to Spanish, or vice versa). Work is being done on text recognition, such as sentiment analysis to try to identify whether a sentence or paragraph is generally positive or negative, which is particularly useful to evaluate perceptions about a product or service. Imagine being able to scrape reviews and social media for any mention of your product and analyze whether it was being discussed more favorably than the previous month or year or not! Resources for Article:   Further resources on this subject: Dealing with a Mess [article] Design with Spring AOP [article] Probability of R? [article]
Read more
  • 0
  • 0
  • 2402
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-documents-and-collections-data-modeling-mongodb
Packt
22 Jun 2015
12 min read
Save for later

Documents and Collections in Data Modeling with MongoDB

Packt
22 Jun 2015
12 min read
In this article by Wilson da Rocha França, author of the book, MongoDB Data Modeling, we will cover documents and collections used in data modeling with MongoDB. (For more resources related to this topic, see here.) Data modeling is a very important process during the conception of an application since this step will help you to define the necessary requirements for the database's construction. This definition is precisely the result of the data understanding acquired during the data modeling process. As previously described, this process, regardless of the chosen data model, is commonly divided into two phases: one that is very close to the user's view and the other that is a translation of this view to a conceptual schema. In the scenario of relational database modeling, the main challenge is to build a robust database from these two phases, with the aim of guaranteeing updates to it with any impact during the application's lifecycle. A big advantage of NoSQL compared to relational databases is that NoSQL databases are more flexible at this point, due to the possibility of a schema-less model that, in theory, can cause less impact on the user's view if a modification in the data model is needed. Despite the flexibility NoSQL offers, it is important to previously know how we will use the data in order to model a NoSQL database. It is a good idea not to plan the data format to be persisted, even in a NoSQL database. Moreover, at first sight, this is the point where database administrators, quite used to the relational world, become more uncomfortable. Relational database standards, such as SQL, brought us a sense of security and stability by setting up rules, norms, and criteria. On the other hand, we will dare to state that this security turned database designers distant of the domain from which the data to be stored is drawn. The same thing happened with application developers. There is a notable divergence of interests among them and database administrators, especially regarding data models. The NoSQL databases practically bring the need for an approximation between database professionals and the applications, and also the need for an approximation between developers and databases. For that reason, even though you may be a data modeler/designer or a database administrator, don't be scared if from now on we address subjects that are out of your comfort zone. Be prepared to start using words common from the application developer's point of view, and add them to your vocabulary. This article will cover the following: Introducing your documents and collections The document's characteristics and structure Introducing documents and collections MongoDB has the document as a basic unity of data. The documents in MongoDB are represented in JavaScript Object Notation (JSON). Collections are groups of documents. Making an analogy, a collection is similar to a table in a relational model and a document is a record in this table. And finally, collections belong to a database in MongoDB. The documents are serialized on disk in a format known as Binary JSON (BSON), a binary representation of a JSON document. An example of a document is: {    "_id": 123456,    "firstName": "John",    "lastName": "Clay",    "age": 25,    "address": {      "streetAddress": "131 GEN. Almério de Moura Street",      "city": "Rio de Janeiro",      "state": "RJ",      "postalCode": "20921060"    },    "phoneNumber":[      {          "type": "home",          "number": "+5521 2222-3333"      },      {          "type": "mobile",          "number": "+5521 9888-7777"      }    ] } Unlike the relational model, where you must declare a table structure, a collection doesn't enforce a certain structure for a document. It is possible that a collection contains documents with completely different structures. We can have, for instance, on the same users collection: {    "_id": "123456",    "username": "johnclay",    "age": 25,    "friends":[      {"username": "joelsant"},      {"username": "adilsonbat"}    ],    "active": true,    "gender": "male" } We can also have: {    "_id": "654321",    "username": "santymonty",    "age": 25,    "active": true,    "gender": "male",    "eyeColor": "brown" } In addition to this, another interesting feature of MongoDB is that not just data is represented by documents. Basically, all user interactions with MongoDB are made through documents. Besides data recording, documents are a means to: Define what data can be read, written, and/or updated in queries Define which fields will be updated Create indexes Configure replication Query the information from the database Before we go deep into the technical details of documents, let's explore their structure. JSON JSON is a text format for the open-standard representation of data and that is ideal for data traffic. To explore the JSON format deeper, you can check ECMA-404 The JSON Data Interchange Standard where the JSON format is fully described. JSON is described by two standards: ECMA-404 and RFC 7159. The first one puts more focus on the JSON grammar and syntax, while the second provides semantic and security considerations. As the name suggests, JSON arises from the JavaScript language. It came about as a solution for object state transfers between the web server and the browser. Despite being part of JavaScript, it is possible to find generators and readers for JSON in almost all the most popular programming languages such as C, Java, and Python. The JSON format is also considered highly friendly and human-readable. JSON does not depend on the platform chosen, and its specification are based on two data structures: A set or group of key/value pairs A value ordered list So, in order to clarify any doubts, let's talk about objects. Objects are a non-ordered collection of key/value pairs that are represented by the following pattern: {    "key" : "value" } In relation to the value ordered list, a collection is represented as follows: ["value1", "value2", "value3"] In the JSON specification, a value can be: A string delimited with " " A number, with or without a sign, on a decimal base (base 10). This number can have a fractional part, delimited by a period (.), or an exponential part followed by e or E Boolean values (true or false) A null value Another object Another value ordered array The following diagram shows us the JSON value structure: Here is an example of JSON code that describes a person: {    "name" : "Han",    "lastname" : "Solo",    "position" : "Captain of the Millenium Falcon",    "species" : "human",    "gender":"male",    "height" : 1.8 } BSON BSON means Binary JSON, which, in other words, means binary-encoded serialization for JSON documents. If you are seeking more knowledge on BSON, I suggest you take a look at the BSON specification on http://bsonspec.org/. If we compare BSON to the other binary formats, BSON has the advantage of being a model that allows you more flexibility. Also, one of its characteristics is that it's lightweight—a feature that is very important for data transport on the Web. The BSON format was designed to be easily navigable and both encoded and decoded in a very efficient way for most of the programming languages that are based on C. This is the reason why BSON was chosen as the data format for MongoDB disk persistence. The types of data representation in BSON are: String UTF-8 (string) Integer 32-bit (int32) Integer 64-bit (int64) Floating point (double) Document (document) Array (document) Binary data (binary) Boolean false (x00 or byte 0000 0000) Boolean true (x01 or byte 0000 0001) UTC datetime (int64)—the int64 is UTC milliseconds since the Unix epoch Timestamp (int64)—this is the special internal type used by MongoDB replication and sharding; the first 4 bytes are an increment, and the last 4 are a timestamp Null value () Regular expression (cstring) JavaScript code (string) JavaScript code w/scope (code_w_s) Min key()—the special type that compares a lower value than all other possible BSON element values Max key()—the special type that compares a higher value than all other possible BSON element values ObjectId (byte*12) Characteristics of documents Before we go into detail about how we must model documents, we need a better understanding of some of its characteristics. These characteristics can determine your decision about how the document must be modeled. The document size We must keep in mind that the maximum length for a BSON document is 16 MB. According to BSON specifications, this length is ideal for data transfers through the Web and to avoid the excessive use of RAM. But this is only a recommendation. Nowadays, a document can exceed the 16 MB length by using GridFS. GridFS allows us to store documents in MongoDB that are larger than the BSON maximum size, by dividing it into parts, or chunks. Each chunk is a new document with 255 K of size. Names and values for a field in a document There are a few things that you must know about names and values for fields in a document. First of all, any field's name in a document is a string. As usual, we have some restrictions on field names. They are: The _id field is reserved for a primary key You cannot start the name using the character $ The name cannot have a null character, or (.) Additionally, documents that have indexed fields must respect the size limit for an indexed field. The values cannot exceed the maximum size of 1,024 bytes. The document primary key As seen in the preceding section, the _id field is reserved for the primary key. By default, this field must be the first one in the document, even when, during an insertion, it is not the first field to be inserted. In these cases, MongoDB moves it to the first position. Also, by definition, it is in this field that a unique index will be created. The _id field can have any value that is a BSON type, except the array. Moreover, if a document is created without an indication of the _id field, MongoDB will automatically create an _id field of the ObjectId type. However, this is not the only option. You can use any value you want to identify your document as long as it is unique. There is another option, that is, generating an auto-incremental value based on a support collection or on an optimistic loop. Support collections In this method, we use a separate collection that will keep the last used value in the sequence. To increment the sequence, first we should query the last used value. After this, we can use the operator $inc to increment the value. There is a collection called system.js that can keep the JavaScript code in order to reuse it. Be careful not to include application logic in this collection. Let's see an example for this method: db.counters.insert(    {      _id: "userid",      seq: 0    } )   function getNextSequence(name) {    var ret = db.counters.findAndModify(          {            query: { _id: name },            update: { $inc: { seq: 1 } },            new: true          }    );    return ret.seq; }   db.users.insert(    {      _id: getNextSequence("userid"),      name: "Sarah C."    } ) The optimistic loop The generation of the _id field by an optimistic loop is done by incrementing each iteration and, after that, attempting to insert it in a new document: function insertDocument(doc, targetCollection) {    while (1) {        var cursor = targetCollection.find( {},         { _id: 1 } ).sort( { _id: -1 } ).limit(1);        var seq = cursor.hasNext() ? cursor.next()._id + 1 : 1;        doc._id = seq;        var results = targetCollection.insert(doc);        if( results.hasWriteError() ) {            if( results.writeError.code == 11000 /* dup key */ )                continue;            else                print( "unexpected error inserting data: " +                 tojson( results ) );        }        break;    } } In this function, the iteration does the following: Searches in targetCollection for the maximum value for _id. Settles the next value for _id. Sets the value on the document to be inserted. Inserts the document. In the case of errors due to duplicated _id fields, the loop repeats itself, or else the iteration ends. The points demonstrated here are the basics to understanding all the possibilities and approaches that this tool can offer. But, although we can use auto-incrementing fields for MongoDB, we must avoid using them because this tool does not scale for a huge data mass. Summary In this article, you saw how to build documents in MongoDB, examined their characteristics, and saw how they are organized into collections. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [article] About MongoDB [article] Creating a RESTful API [article]
Read more
  • 0
  • 0
  • 2397

article-image-highlights-greenplum
Packt
30 Oct 2013
10 min read
Save for later

Highlights of Greenplum

Packt
30 Oct 2013
10 min read
(For more resources related to this topic, see here.) Big Data analytics – platform requirements Organizations are striving towards becoming more data driven and leverage data to gain the competitive advantage. It is inevitable that any current business intelligence infrastructure needs to be upgraded to include Big Data technologies and analytics needs to be embedded into every core business process. The following diagram depicts a matrix that connects requirements from low storage/cost to high storage/cost information management systems and analytics applications. The following section lists all the capabilities that an integrated platform for Big Data analytics should have: A data integration platform that can integrate data from any source, of any type, and highly voluminous in nature. This includes efficient data extraction, data cleansing, transformation, and loading capabilities. A data storage platform that can hold structured, unstructured, and semistructured data with a capability to slice and dice data to any degree, discarding the format. In short, while we store data, we should be able to use the best suited platform for a given data format (for example: structured data to use relational store, semi-structured data to use NoSQL store, and unstructured data to use a file store) and still be able to join data across platforms to run analytics. Support for running standard analytics functions and standard analytical tools on data that has characteristics described previously. Modular and elastically scalable hardware that wouldn't force changes to architecture/design with growing needs to handle bigger data and more complex processing requirements. A centralized management and monitoring system. Highly available and fault tolerant platform that can repair itself in times of any hardware failure seamlessly. Support for advanced visualizations to communicate insights in an effective way. A collaboration platform that can help end users perform the functions of loading, exploring, and visualizing data, and other workflow aspects as an end-to-end process. Core components The following figure depicts core software components of Greenplum UAP: In this section, we will take a brief look at what each component is and take a deep dive into their functions in the sections to follow. Greenplum Database Greenplum Database is a shared nothing, massively parallel processing solution built to support next generation data warehousing and Big Data analytics processing. It stores and analyzes voluminous structured data. It comes in a software-only version that works on commodity servers (this being its unique selling point) and additionally also is available as an appliance (DCA) that can take advantage of large clusters of powerful servers, storage, and switches. GPDB (Greenplum Database) comes with a parallel query optimizer that uses a cost-based algorithm to evaluate and select optimal query plans. Its high-speed interconnection supports continuous pipelining for data processing. In its new distribution under Pivotal, Greenplum Database is called Pivotal (Greenplum) Database. Shared nothing, massive parallel processing (MPP) systems, and elastic scalability Until now, our applications have been benchmarked for certain performance and the core hardware and its architecture determined its readiness for further scalability that came at a cost, be it in terms of changes to the design or hardware augmentation. With growing data volumes, scalability and total cost of ownership is becoming a big challenge and the need for elastic scalability has become prime. This section compares shared disk, shared memory, and shared nothing data architectures and introduces the concept of massive parallel processing. Greenplum Database and HD components implement shared nothing data architecture with master/worker paradigm demonstrating massive parallel processing capabilities. Shared disk data architecture Have a look at the following figure which gives an idea about shared disk data architecture: Shared disk data architecture refers to an architecture where there is a data disk that holds all the data and each node in the cluster accesses this data for processing. Any data operations can be performed by any node at a given point in time and in case two nodes attempt persisting/writing a tuple at the same time, to ensure consistency, a disk-based lock or intended lock communication is passed on thus affecting the performance. Further with increase in the number of nodes, contention at the database level increases. These architectures are write limited as there is a need to handle the locks across the nodes in the cluster. Even in case of the reads, partitioning should be implemented effectively to avoid complete table scans. Shared memory data architecture Have a look at the following figure which gives an idea about shared memory data architecture: In memory, data grids come under the shared memory data architecture category. In this architecture paradigm, data is held in memory that is accessible to all the nodes within the cluster. The major advantage with this architecture is that there would be no disk I/O involved and data access is very quick. This advantage comes with an additional need for loading and synchronizing data in memory with the underlying data store. The memory layer seen in the following figure can be distributed and local to the compute nodes or can exist as data node. Shared nothing data architecture Though an old paradigm, shared nothing data architecture is gaining traction in the context of Big Data. Here the data is distributed across the nodes in the cluster and every processor operates on the data local to itself. The location where data resides is referred to as data node and where the processing logic resides is called compute node. It can happen that both nodes, compute and data, are physically one. These nodes within the cluster are connected using high-speed interconnects. The following figure depicts two aspects of the architecture, the one on the left represents data and computes decoupled processes and the other to the right represents data and computes processes co-located: One of the most important aspects of shared nothing data architecture is the fact that there will not be any contention or locks that would need to be addressed. Data is distributed across the nodes within the cluster using a distribution plan that is defined as a part of the schema definition. Additionally, for higher query efficiency, partitioning can be done at the node level. Any requirement for a distributed lock would bring in complexity and an efficient distribution and partitioning strategy would be a key success factor. Reads are usually the most efficient relative to shared disk databases. Again, the efficiency is determined by the distribution policy, if a query needs to join data across the nodes in the cluster, users would see a temporary redistribution step that would bring required data elements together into another node before the query result is returned. Shared nothing data architecture thus supports massive parallel processing capabilities. Some of the features of shared nothing data architecture are as follows: It can scale extremely well on general purpose systems It provides automatic parallelization in loading and querying any database It has optimized I/O and can scan and process nodes in parallel It supports linear scalability, also referred to as elastic scalability, by adding a new node to the cluster, additional storage, and processing capability, both in terms of load performance and query performance is gained The Greenplum high-availability architecture In addition to primary Greenplum system components, we can also optionally deploy redundant components for high availability and avoiding single point of failure. The following components need to be implemented for data redundancy: Mirror segment instances: A mirror segment always resides on a different host than its primary segment. Mirroring provides you with a replica of the database contained in a segment. This may be useful in the event of disk/hardware failure. The metadata regarding the replica is stored on the master server in system catalog tables. Standby master host: For a fully redundant Greenplum Database system, a mirror of the Greenplum master can be deployed. A backup Greenplum master host serves as a warm standby in cases when the primary master host becomes unavailable. The standby master host is synchronized periodically and kept up-to-date using transaction replication log process that runs on the standby master to keep the master host and standby in sync. In the event of master host failure the standby master is activated and constructed using the transaction logs. Dual interconnect switches: A highly available interconnect can be achieved by deploying redundant network interfaces on all Greenplum hosts and a dual Gigabit Ethernet. The default configuration is to have one network interface per primary segment instance on a segment host (both the interconnects are by default 10Gig in DCA). External tables External tables in Greenplum refer to those database tables that help Greenplum Database access data from a source that is outside of the database. We can have different external tables for different formats. Greenplum supports fast, parallel, as well as nonparallel data loading and unloading. The external tables act as an interfacing point to external data source and give an impression of a local data source to the accessing function. File-based data sources are supported by external tables. The following file formats can be loaded onto external tables: Regular file-based source (supports Text, CSV, and XML data formats): file:// or gpfdist:// protocol Web-based file source (supports Text, CSV, OS commands, and scripts): http:// protocol Hadoop-based file source (supports Text and custom/user-defined formats): gphdfs:// protocol Following is the syntax for the creation and deletion of readable and writable external tables: To create a read-only external table: CREATE EXTERNAL (WEB) TABLE LOCATION (<<file paths>>) | EXECUTE '<<query>>' FORMAT '<<Format name for example: 'TEXT'>>' (DELIMITER, '<<name the delimiter>>'); To create a writable external table: CREATE WRITABLE EXTERNAL (WEB) TABLE LOCATION (<<file paths>>) | EXECUTE '<<query>>' FORMAT '<<Format name for example: 'TEXT'>>' (DELIMITER, '<<name the delimiter>>'); To drop an external table: DROP EXTERNAL (WEB) TABLE; Following are the examples on using file:// and gphdfs:// protocol: CREATE EXTERNAL TABLE test_load_file ( id int, name text, date date, description text ) LOCATION ( 'file://filehost:6781/data/folder1/*', 'file://filehost:6781/data/folder2/*' 'file://filehost:6781/data/folder3/*.csv' ) FORMAT 'CSV' (HEADER); In the preceding example, data is loaded from three different file server locations; also, as you can see, the wild card notation for each of the locations can be different. Now, in case where the files are located on HDFS, the following notation needs to be used (in the following example, the file is '|' delimited): CREATE EXTERNAL TABLE test_load_file ( id int, name text, date date, description text ) LOCATION ( 'gphdfs://hdfshost:8081/data/filename.txt' ) FORMAT 'TEXT' (DELIMITER '|'); Summary In this article, we have learned about Greenplum UAP and also Greenplum Database. This article also gives information about the core components of Greenplum UAP. Resources for Article: Further resources on this subject: Making Big Data Work for Hadoop and Solr [Article] Big Data Analysis [Article] Core Data iOS: Designing a Data Model and Building Data Objects [Article]
Read more
  • 0
  • 0
  • 2388

article-image-getting-started-oracle-information-integration
Packt
02 Aug 2011
14 min read
Save for later

Getting Started with Oracle Information Integration

Packt
02 Aug 2011
14 min read
  Oracle Information Integration, Migration, and Consolidation: RAW The definitive book and eBook guide to Oracle Information Integration and Migration in a heterogeneous world         Read more about this book       (For more resources on Oracle, see here.) Why consider information integration? The useful life of pre-relational mainframe database management system engines is coming to an end because of a diminishing application and skills base, and increasing costs.—Gartner Group During the last 30 years, many companies have deployed mission critical applications running various aspects of their business on the legacy systems. Most of these environments have been built around a proprietary database management system running on the mainframe. According to Gartner Group, the installed base of mainframe, Sybase, and some open source databases has been shrinking. There is vendor sponsored market research that shows mainframe database management systems are growing, which, according to Gartner, is due primarily to increased prices from the vendors, currency conversions, and mainframe CPU replacements. Over the last few years, many companies have been migrating mission critical applications off the mainframe onto open standard Relational Database Management Systems (RDBMS) such as Oracle for the following reasons: Reducing skill base: Students and new entrants to the job market are being trained on RDBMS like Oracle and not on the legacy database management systems. Legacy personnel are retiring, and those that are not are moving into expensive consulting positions to arbitrage the demand. Lack of flexibility to meet business requirements: The world of business is constantly changing and new business requirements like compliance and outsourcing require application changes. Changing the behavior, structure, access, interface or size of old databases is very hard and often not possible, limiting the ability of the IT department to meet the needs of the business. Most applications on the aging platforms are 10 to 30 years old and are long past their original usable lifetime. Lack of Independent Software Vendor (ISV)applications: With most ISVs focusing on the larger market, it is very difficult to find applications, infrastructure, and tools for legacy platforms. This requires every application to be custom coded on the closed environment by scarce in-house experts or by expensive outside consultants. Total Cost of Ownership (TCO): As the user base for proprietary systems decreases, hardware, spare parts, and vendor support costs have been increasing. Adding to this are the high costs of changing legacy applications, paid either as consulting fees for a replacement for diminishing numbers of mainframe trained experts or increased salaries for existing personnel. All leading to a very high TCO which doesn't even take into account the opportunity cost to the business of having inflexible systems. Business challenges in data integration and migration Once the decision has been taken to migrate away from a legacy environment, the primary business challenge is business continuity. Since many of these applications are mission critical, running various aspects of the business, the migration strategy has to ensure continuity to the new application—and in the event of failure, rollback to the mainframe application. This approach requires data in the existing application to be synchronized with data on the new application. Making the challenge of data migration more complicated is the fact that legacy applications tend to be interdependent, but the need from a risk mitigation standpoint is to move applications one at a time. A follow-on challenge is prioritizing the order in which applications are to be moved off the mainframe, and ensuring that the order meets both the business needs and minimizes the risk in the migration process. Once a specific application is being migrated, the next challenge is to decide which business processes will be migrated to the new application. Many companies have business processes that are present, because that's the way their systems work. When migrating an application off the mainframe, many business processes do not need to migrate. Even among the business processes that need to be migrated, some of these business processes will need to be moved as-is and some of them will have to be changed. Many companies utilize the opportunity afforded by a migration to redo the business processes they have had to live with for many years. Data is the foundation of the modernization process. You can move the application, business logic, and work flow, but without a clean migration of the data the business requirements will not be met. A clean data migration involves: Data that is organized in a usable format by all modern tools Data that is optimized for an Oracle database Data that is easy to maintain Technical challenges of information integration The technical challenges with any information integration all stem from the fact that the application accesses heterogeneous data (VSAM, IMS, IDMS, ADABAS, DB2, MSSQL, and so on) that can even be in a non-relational hierarchical format. Some of the technical problems include: The flexible file definition feature used in COBOL applications in the existing system will have data files with multi-record formats and multi-record types in the same dataset—neither of which exist in RDBMS. Looping data structure and substructure or relative offset record organization such as a linked list, which are difficult to map into a relational table. Data and referential integrity is managed by the Oracle database engine. However, legacy applications already have this integrity built in. One question is whether to use Oracle to handle this integrity and remove the logic from the application. Finally, creating an Oracle schema to maximize performance, which includes mapping non-oracle keys to Oracle primary and secondary keys; especially when legacy data is organized in order of key value which can affect the performance on an Oracle RDBMS. There are also differences in how some engines process transactions, rollbacks, and record locking. General approaches to information integration and migration There are several technical approaches to consider when doing any kind of integration or migration activity. In this section, we will look at a methodology or approach for both data integration and data migration. Data integration Clearly, given this range of requirements, there are a variety of different integration strategies, including the following: Consolidated: A consolidated data integration solution moves all data into a single database and manages it in a central location. There are some considerations that need to be known regarding the differences between non- Oracle and Oracle mechanics. Transaction processing is an example. Some engines use implicit commits and some manage character sets differently than Oracle does, this has an impact on sort order. Federated: A federated data integration solution leaves data in the individual data source where it is normally maintained and updated, and simply consolidates it on the fly as needed. In this case, multiple data sources will appear to be integrated into a single virtual database, masking the number and different kinds of databases behind the consolidated view. These solutions can work bidirectionally. Shared: A shared data integration solution actually moves data and events from one or more source databases to a consolidated resource, or queue, created to serve one or more new applications. Data can be maintained and exchanged using technologies such as replication, message queuing, transportable table spaces, and FTP. Oracle has extensive support for consolidated data integration and while there are many obvious benefits to the consolidated solution, it is not practical for any organization that must deal with legacy systems or integrate with data it does not own. Therefore, we will not discuss this type any further, but instead concentrate on federated and shared solutions. Data migration Over 80 percent of migration projects fail or overrun their original budgets/ timelines, according to a study by the Standish Group. In most cases, this is because of a lack of understanding of some of the unique challenges of a migration project. The top five challenges of a migration project are: Little migration expertise to draw from: Migration is not an industry-recognized area of expertise with an established body of knowledge and practices, nor have most companies built up any internal competency to draw from. Insufficient understanding of data and source systems: The required data is spread across multiple source systems, not in the right format, of poor quality, only accessible through vaguely understood interfaces, and sometimes missing altogether. Continuously evolving target system: The target system is often under development at the time of data migration, and the requirements often change during the project. Complex target data validations: Many target systems have restrictions, constraints, and thresholds on the validity, integrity, and quality of the data to be loaded. Repeated synchronization after the initial migration: Migration is not a one-time effort. Old systems are usually kept alive after new systems launch and synchronization is required between the old and new systems during this handoff period. Also, long after the migration is completed, companies often have to prove the migration was complete and accurate to various government, judicial, and regulatory bodies. Most migration projects fail because of an inappropriate migration methodology, because the migration problem is thought of as a four stage process: Analyze the source data Extract/transform the data into the target formats Validate and cleanse the data Load the data into the target However, because of the migration challenges discussed previously, this four stage project methodology often fails miserably. The challenge begins during the initial analysis of the source data when most of the assumptions about the data are proved wrong. Since there is never enough time planned for analysis, any mapping specification from the mainframe to Oracle is effectively an intelligent guess. Based on the initial mapping specification, extractions, and transformations developed run into changing target data requirements, requiring additional analysis and changes to the mapping specification. Validating the data according to various integrity and quality constraints will typically pose a challenge. If the validation fails, the project goes back to further analysis and then further rounds of extractions and transformations. When the data is finally ready to be loaded into Oracle, unexpected data scenarios will often break the loading process and send the project back for more analysis, more extractions and transformations, and more validations. Approaching migration as a four stage process means continually going back to earlier stages due to the five challenges of data migration. The biggest problem with migration project methodology is that it does not support the iterative nature of migrations. Further complicating the issue is that the technology used for data migration often consists of general-purpose tools repurposed for each of the four project stages. These tools are usually non-integrated and only serve to make difficult processes more difficult on top of a poor methodology. The ideal model for successfully managing a data migration project is not based on multiple independent tools. Thus, a cohesive method enables you to cycle or spiral your way through the migration process—analyzing the data, extracting and transforming the data, validating the data, and loading it into targets, and repeating the same process until the migration is successfully completed. This approach enables target-driven analysis, validating assumptions, refining designs, and applying best practices as the project progresses. This agile methodology uses the same four stages of analyze, extract/transform, validate and load. However, the four stages are not only iterated, but also interconnected with one another. An iterative approach is best achieved through a unified toolset, or platform, that leverages automation and provides functionality which spans all four stages. In an iterative process, there is a big difference between using a different tool for each stage and one unified toolset across all four stages. In one unified toolset, the results of one stage can be easily carried into the next, enabling faster, more frequent and ultimately less iteration which is the key to success in a migration project. A single platform not only unifies the development team across the project phases, but also unifies the separate teams that may be handling each different source system in a multi-source migration project. Architectures: federated versus shared Federated data integration can be very complicated. This is especially the case for distributed environments where several heterogeneous remote databases are to be synchronized using two-phase commit. Solutions that provide federated data integration access and maintain the data in the place wherever it resides (such as in a mainframe data store associated with legacy applications). Data access is done 'transparently' for example, the user (or application) interacts with a single virtual or federated relational database under the control of the primary RDBMS, such as Oracle. This data integration software is working with the primary RDBMS 'under the covers' to transform and translate schemas, data dictionaries, and dialects of SQL; ensure transactional consistency across remote foreign databases (using two-phase commit); and make the collection of disparate, heterogeneous, distributed data sources appear as one unified database. The integration software carrying out these complex tasks needs to be tightly integrated with the primary RDBMS in order to benefit from built-in functions and effective query optimization. The RDBMS must also provide all the other important RDBMS functions, including effective query optimization. Data sharing integration Data sharing-based integration involves the sharing of data, transactions, and events among various applications in an organization. It can be accomplished within seconds or overnight, depending on the requirement. It may be done in incremental steps, over time, as individual one-off implementations are required. If one-off tools are used to implement data sharing, eventually the variety of data-sharing approaches employed begin to conflict, and the IT department becomes overwhelmed with an unmanageable maintenance, which increases the total cost of ownership. What is needed is a comprehensive, unified approach that relies on a standard set of services to capture, stage, and consume the information being shared. Such an environment needs to include a rules-based engine, support for popular development languages, and comply with open standards. GUI-based tools should be available for ease of development and the inherent capabilities should be modular enough to satisfy a wide variety of possible implementation scenarios. The data-sharing form of data integration can be applied to achieve near real-time data sharing. While it does not guarantee the level of synchronization inherent with a federated data integration approach (for example, if updates are performed using two-phase commit), it also doesn't incur the corresponding performance overhead. Availability is improved because there are multiple copies of the data. Considerations when choosing an integration approach There is a range in the complexity of data integration projects from relatively straightforward (for example, integrating data from two merging companies that used the same Oracle applications) to extremely complex projects such as long-range geographical data replication and multiple database platforms. For each project, the following factors can be assessed to estimate the complexity level. Pretend you are a systems integrator such as EDS trying to size a data integration effort as you prepare a project proposal. Potential for conflicts: Is the data source updated by more than one application? If so, the potential exists for each application to simultaneously update the same data. Latency: What is the required synchronization level for the data integration process? Can it be an overnight batch operation like a typical data warehouse? Must it be synchronous, and with two-phase commit? Or, can it be quasi-real-time, where a two or three second lag is tolerable, permitting an asynchronous solution? Transaction volumes and data growth trajectory: What are the expected average and peak transaction rates and data processing throughput that will be required? Access patterns: How frequently is the data accessed and from where? Data source size: Some data sources of such volume that back up, and unavailability becomes extremely important. Application and data source variety: Are we trying to integrate two ostensibly similar databases following the merger of two companies that both use the same application, or did they each have different applications? Are there multiple data sources that are all relational databases? Or are we integrating data from legacy system files with relational databases and realtime external data feeds? Data quality: The probability that data quality adds to overall project complexity increases as the variety of data sources increases. One point of this discussion is that the requirements of data integration projects will vary widely. Therefore, the platform used to address these issues must be a rich superset of the features and functions that will be applied to any one project.
Read more
  • 0
  • 0
  • 2387

article-image-more-line-charts-area-charts-and-scatter-plots
Packt
26 Aug 2014
13 min read
Save for later

More Line Charts, Area Charts, and Scatter Plots

Packt
26 Aug 2014
13 min read
In this article by Scott Gottreu, the author of Learning jqPlot, we'll learn how to import data from remote sources. We will discuss what area charts, stacked area charts, and scatter plots are. Then we will learn how to implement these newly learned charts. We will also learn about trend lines. (For more resources related to this topic, see here.) Working with remote data sources We return from lunch and decide to start on our line chart showing social media conversions. With this chart, we want to pull the data in from other sources. You start to look for some internal data sources, coming across one that returns the data as an object. We can see an excerpt of data returned by the data source. We will need to parse the object and create data arrays for jqPlot: { "twitter":[ ["2012-11-01",289],...["2012-11-30",225] ], "facebook":[ ["2012-11-01",27],...["2012-11-30",48] ] } We solve this issue using a data renderer to pull our data and then format it properly for jqPlot. We can pass a function as a variable to jqPlot and when it is time to render the data, it will call this new function. We start by creating the function to receive our data and then format it. We name it remoteDataSource. jqPlot will pass the following three parameters to our function: url: This is the URL of our data source. plot: The jqPlot object we create is passed by reference, which means we could modify the object from within remoteDataSource. However, it is best to treat it as a read-only object. options: We can pass any type of option in the dataRendererOptions option when we create our jqPlot object. For now, we will not be passing in any options: <script src="../js/jqplot.dateAxisRenderer.min.js"></script> <script> $(document).ready(function(){ var remoteDataSource = function(url, plot, options) { Next we create a new array to hold our formatted data. Then, we use the $.ajax method in jQuery to pull in our data. We set the async option to false. If we don't, the function will continue to run before getting the data and we'll have an empty chart: var data = new Array; $.ajax({ async: false, We set the url option to the url variable that jqPlot passed in. We also set the data type to json: url: url, dataType:"json", success: function(remoteData) { Then we will take the twitter object in our JSON and make that the first element of our data array and make facebook the second element. We then return the whole array back to jqPlot to finish rendering our chart: data.push(remoteData.twitter); data.push(remoteData.facebook); } }); return data; }; With our previous charts, after the id attribute, we would have passed in a data array. This time, instead of passing in a data array, we pass in a URL. Then, within the options, we declare the dataRenderer option and set remoteDataSource as the value. Now when our chart is created, it will call our renderer and pass in all the three parameters we discussed earlier: var socialPlot = $.jqplot ('socialMedia', "./data/social_shares.json", { title:'Social Media Shares', dataRenderer: remoteDataSource, We create labels for both our data series and enable the legend: series:[ { label: 'Twitter' }, { label: 'Facebook' } ], legend: { show: true, placement: 'outsideGrid' }, We enable DateAxisRenderer for the x axis and set min to 0 on the y axis, so jqPlot will not extend the axis below zero: axes:{ xaxis:{ renderer:$.jqplot.DateAxisRenderer, label: 'Days in November' }, yaxis: { min:0, label: 'Number of Shares' } } }); }); </script> <div id="socialMedia" style="width:600px;"></div> If you are running the code samples from your filesystem in Chrome, you will get an error message similar to this: No 'Access-Control-Allow-Origin' header is present on the requested resource. The security settings do not allow AJAX requests to be run against files on the filesystem. It is better to use a local web server such as MAMP, WAMP, or XAMPP. This way, we avoid the access control issues. Further information about cross-site HTTP requests can be found at the Mozilla Developer Network at https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS. We load this new chart in our browser and can see the result. We are likely to run into cross-domain issues when trying to access remote sources that do not allow cross-domain requests. The common practice to overcome this hurdle would be to use the JSONP data type in our AJAX call. jQuery will only run JSONP calls asynchronously. This keeps your web page from hanging if a remote source stops responding. However, because jqPlot requires all the data from the remote source before continuing, we can't use cross-domain sources with our data renderers. We start to think of ways we can use external APIs to pull in data from all kinds of sources. We make a note to contact the server guys to write some scripts to pull from the external APIs we want and pass along the data to our charts. By doing it in this way, we won't have to implement OAuth (OAuth is a standard framework used for authentication), http://oauth.net/2, in our web app or worry about which sources allow cross-domain access. Adding to the project's scope As we continue thinking up new ways to work with this data, Calvin stops by. "Hey guys, I've shown your work to a few of the regional vice-presidents and they love it." Your reply is that all of this is simply an experiment and was not designed for public consumption. Calvin holds up his hands as if to hold our concerns at bay. "Don't worry, they know it's all in beta. They did have a couple of ideas. Can you insert in the expenses with the revenue and profit reports? They also want to see those same charts but formatted differently." He continues, "One VP mentioned that maybe we could have one of those charts where everything under the line is filled in. Oh, and they would like to see these by Wednesday ahead of the meeting." With that, Calvin turns around and makes his customary abrupt exit. Adding a fill between two lines We talk through Calvin's comments. Adding in expenses won't be too much of an issue. We could simply add the expense line to one of our existing reports but that will likely not be what they want. Visually, the gap on our chart between profit and revenue should be the total amount of expenses. You mention that we could fill in the gap between the two lines. We decide to give this a try: We leave the plugins and the data arrays alone. We pass an empty array into our data array as a placeholder for our expenses. Next, we update our title. After this, we add a new series object and label it Expenses: ... var rev_profit = $.jqplot ('revPrfChart', [revenue, profit, [] ], { title:'Monthly Revenue & Profit with Highlighted Expenses', series:[ { label: 'Revenue' }, { label: 'Profit' }, { label: 'Expenses' } ], legend: { show: true, placement: 'outsideGrid' }, To fill in the gap between the two lines, we use the fillBetween option. The only two required options are series1 and series2. These require the positions of the two data series in the data array. So in our chart, series1 would be 0 and series2 would be 1. The other three optional settings are: baseSeries, color, and fill. The baseSeries option tells jqPlot to place the fill on a layer beneath the given series. It will default to 0. If you pick a series above zero, then the fill will hide any series below the fill layer: fillBetween: { series1: 0, series2: 1, We want to assign a different value to color because it will default to the color of the first data series option. The color option will accept either a hexadecimal value or the rgba option, which allows us to change the opacity of the fill. Even though the fill option defaults to true, we explicitly set it. This option also gives us the ability to turn off the fill after the chart is rendered: color: "rgba(232, 44, 12, 0.5)", fill: true }, The settings for the rest of the chart remain unchanged: axes:{ xaxis:{ renderer:$.jqplot.DateAxisRenderer, label: 'Months' }, yaxis:{ label: 'Totals Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id="revPrfChart" style="width:600px;"></div> We switch back to our web browser and load the new page. We see the result of our efforts in the following screenshot. This chart layout works but we think Calvin and the others will want something else. We decide we need to make an area chart. Understanding area and stacked area charts Area charts come in two varieties. The default type of area chart is simply a modification of a line chart. Everything from the data point on the y axis all the way to zero is shaded. In the event your numbers are negative, then the data above the line up to zero is shaded in. Each data series you have is laid upon the others. Area charts are best to use when we want to compare similar elements, for example, sales by each division in our company or revenue among product categories. The other variation of an area chart is the stacked area chart. The chart starts off being built in the same way as a normal area chart. The first line is plotted and shaded below the line to zero. The difference occurs with the remaining lines. We simply stack them. To understand what happens, consider this analogy. Each shaded line represents a wall built to the height given in the data series. Instead of building one wall behind another, we stack them on top of each other. What can be hard to understand is the y axis. It now denotes a cumulative total, not the individual data points. For example, if the first y value of a line is 4 and the first y value on the second line is 5, then the second point will be plotted at 9 on our y axis. Consider this more complicated example: if the y value in our first line is 2, 7 for our second line, and 4 for the third line, then the y value for our third line will be plotted at 13. That's why we need to compare similar elements. Creating an area chart We grab the quarterly report with the divisional profits we created this morning. We will extend the data to a year and plot the divisional profits as an area chart: We remove the data arrays for revenue and the overall profit array. We also add data to the three arrays containing the divisional profits: <script src="../js/jqplot.dateAxisRenderer.min.js"></script> <script> $(document).ready(function(){ var electronics = [["2011-11-20", 123487.87], ...]; var media = [["2011-11-20", 66449.15], ...]; var nerd_corral = [["2011-11-20", 2112.55], ...]; var div_profit = $.jqplot ('division_profit', [ media, nerd_corral, electronics ], { title:'12 Month Divisional Profits', Under seriesDefaults, we assign true to fill and fillToZero. Without setting fillToZero to true, the fill would continue to the bottom of the chart. With the option set, the fill will extend downward to zero on the y axis for positive values and stop. For negative data points, the fill will extend upward to zero: seriesDefaults: { fill: true, fillToZero: true }, series:[ { label: 'Media & Software' }, { label: 'Nerd Corral' }, { label: 'Electronics' } ], legend: { show: true, placement: 'outsideGrid' }, For our x axis, we set numberTicks to 6. The rest of our options we leave unchanged: axes:{ xaxis:{ label: 'Months', renderer:$.jqplot.DateAxisRenderer, numberTicks: 6, tickOptions: { formatString: "%B" } }, yaxis: { label: 'Total Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id="division_profit" style="width:600px;"></div> We review the results of our changes in our browser. We notice something is wrong: only the Electronics series, shown in brown, is showing. This goes back to how area charts are built. Revisiting our wall analogy, we have built a taller wall in front of our other two walls. We need to order our data series from largest to smallest: We move the Electronics series to be the first one in our data array: var div_profit = $.jqplot ('division_profit', [ electronics, media, nerd_corral ], It's also hard to see where some of the lines go when they move underneath another layer. Thankfully, jqPlot has a fillAlpha option. We pass in a percentage in the form of a decimal and jqPlot will change the opacity of our fill area: ... seriesDefaults: { fill: true, fillToZero: true, fillAlpha: .6 }, ... We reload our chart in our web browser and can see the updated changes. Creating a stacked area chart with revenue Calvin stops by while we're taking a break. "Hey guys, I had a VP call and they want to see revenue broken down by division. Can we do that?" We tell him we can. "Great" he says, before turning away and leaving. We discuss this new request and realize this would be a great chance to use a stacked area chart. We dig around and find the divisional revenue numbers Calvin wanted. We can reuse the chart we just created and just change out the data and some options. We use the same variable names for our divisional data and plug in revenue numbers instead of profit. We use a new variable name for our chart object and a new id attribute for our div. We update our title and add the stackSeries option and set it to true: var div_revenue = $.jqplot ( 'division_revenue' , [electronics, media, nerd_corral], { title: '12 Month Divisional Revenue', stackSeries: true, We leave our series' options alone and the only option we change on our x axis is set numberTicks back to 3: seriesDefaults: { fill: true, fillToZero: true }, series:[ { label: 'Electronics' }, { label: 'Media & Software' }, { label: 'Nerd Corral' } ], legend: { show: true, placement: 'outsideGrid' }, axes:{ xaxis:{ label: 'Months', renderer:$.jqplot.DateAxisRenderer, numberTicks: 3, tickOptions: { formatString: "%B" } }, We finish our changes by updating the ID of our div container: yaxis: { label: 'Total Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id=" division_revenue " style="width:600px;"></div> With our changes complete, we load this new chart in our browser. As we can see in the following screenshot, we have a chart with each of the data series stacked on top of each other. Because of the nature of a stacked chart, the individual data points are no longer decipherable; however, with the visualization, this is less of an issue. We decide that this is a good place to stop for the day. We'll start on scatterplots and trend lines tomorrow morning. As we begin gathering our things, Calvin stops by on his way out and we show him our recent work. "This is amazing. You guys are making great progress." We tell him we're going to move on to trend lines tomorrow. "Oh, good," Calvin says. "I've had requests to show trending data for our revenue and profit. Someone else mentioned they would love to see trending data of shares on Twitter for our daily deals site. But, like you said, that can wait till tomorrow. Come on, I'll walk with you two."
Read more
  • 0
  • 0
  • 2386
article-image-different-ir-algorithms-you-will-learn
Packt
04 Jan 2016
23 min read
Save for later

Different IR Algorithms

Packt
04 Jan 2016
23 min read
In this article written by Sudipta Mukherjee, author of the book F# for Machine Learning, we learn about how information overload is almost a passé term; however, it is still valid. Information retrieval is a big arena and most of it is far from being solved. However, that being said, we have come a long way and the results produced by some of the state of the art information retrieval algorithms are really impressive. You may not know that you are using information retrieval but whenever you search for some documents on your PC or on the internet, you are actually using the produce of some information retrieval algorithm at the background. So as the metaphor goes, finding the needle (read information/insight) in a haystack (read your data archive on your PC or on the web) is the key to successful business. Distance based: Two documents are matched based on their proximity, calculated by several distance metric on the vector representation of the document Set based: Two documents are matched based on their proximity, calculated by several set based / fuzzy set based, metric based on the bag of words (BoW) model of the document (For more resources related to this topic, see here.) What are some interesting things that you can do? You will learn how the same algorithm can find similar biscuits and identify author of digital documents from the words authors use. You will also learn how IR distance metrics can be used to group color images. Information retrieval using tf-idf Whenever you type some search term in your "Windows" search box, some documents appear matching your search term. There is a common, well-known, easy-to-implement algorithm that makes it possible to rank the documents based on the search term. Basically, the algorithm allows the developers to assign some kind of score to each document in the result set. That score can be seen as a score of confidence that the system has on how much the user would like that result. The score that this algorithm attaches with each document is a product of two different scores. The first one is called term frequency (tf) and the other one is called inverse document frequency (idf). Their product is referred to as tf-idf or "term frequency inverse document frequency". Tf is the number of times some term occurs in a given document. Idf is the ratio between the total number of documents scanned and the number of documents in which a given search term is found. However, this ration is not used as is. Log of this ration is used as idf, as shown next. The following is a term frequency and inverse term frequency example for the word "example": This is the same as: Idf is normally calculated with the following formula: The following is the code that demonstrates how to find tf-idf score for the given search terms; in this case, "example". Sentences are fabricated to match up the desired number of count of the word "example" in document 2; in this case, "sentence2". Here  denotes the set of all the documents and  denotes a second document. let sentence1 = "this is a a sample" let sentence2 = "this is another another example example example" let word = "example" let numberOfDocs = 2. let tf1 = sentence1.Split ' ' |> Array.filter ( fun t -> t = word) |> Array.length let tf2 = sentence2.Split ' ' |> Array.filter ( fun t -> t = word) |> Array.length let docs = [|sentence1;sentence2|] let foundIn = docs |> Array.map ( fun t -> t.Split ' '                                             |> Array.filter ( fun z -> z = word))                                             |> Array.filter ( fun m -> m |> Array.length <> 0)                                             |> Array.length let idf =  Operators.log10 ( numberOfDocs / float foundIn) let pr1 = float  tf1 * idf let pr2 = float tf2 * idf printfn "%f %f" pr1 pr2 This produces the following output: 0.0 ­ 0.903090 This means that the second document is more closely related with the word "example" than the first one. In this case, this is one of the extreme cases where the word doesn't appear at all in one document and appears three times in the other. However, with the same word occurring multiple times in both the documents, you will get different scores for each document. You can think of these scores as the confidence scores for association of the word and the document. More the score, more is the confidence that the document is bound to have something related to that word. Measures of similarity In the following section, you will create a framework for finding several distance measures. A distance between two probability distribution functions (pdf) is an important way to know how close two entities are. One way to generate the pdfs from histogram is to normalize the histogram. Generating a PDF from a histogram A histogram holds the number of times a value occurred. For example, a text can be represented as a histogram where the histogram values represent the number of times a word appears in the text. For gray images, it can be the number of times each gray scale appears in the image. In the following section, you will build a few modules that hold several distance metrics. A distance metric is a measure of how similar two objects are. It can also be called a measure of proximity. The following metrics use the notation of  and  to denote the ith value of either PDF. Let's say we have a histogram denoted by , and there are  elements in the histogram. Then a rough estimate of  is . The following function does this transformation from histogram to pdf: let toPdf (histogram:int list)=         histogram |> List.map (fun t -> float t / float histogram.Length ) There are a couple of important assumptions made. First, we assume that the number of bins is equal to the number of elements. So the histogram and the pdf will have the same number of elements. That's not exactly correct in mathematical sense. All the elements of a pdf should add up to 1. The following implementation of histToPdf guarantees that but it's not a good choice as it is not normalization. So resist the temptation to use this one. let histToPdf (histogram:int list)=         let sum = histogram |> List.sum         histogram |> List.map (fun t -> float t / float sum) Generating a histogram from a list of elements is simple. Following is the function that takes a list and returns the histogram. F# has the function already built in; it is called countBy. let listToHist (l : int list)=         l |> List.countBy (fun t -> t) Next is an example of how using these two functions, a list of integer can be transformed into a pdf. The following method takes a list of integers and returns the probability distribution associated: let listToPdf (aList : int list)=         aList |> List.countBy (fun t -> t)               |> List.map snd               |> toPdf Here is how you can use it: let list = [1;1;1;2;3;4;5;1;2] let pdf = list |> listToPdf I have captured the following in the F# interactive and got the following histogram from the preceding list: val it : (int * int) list = [(1, 4); (2, 2); (3, 1); (4, 1); (5, 1)] If you just project the second element for this histogram and store it in an int list, then you can represent the histogram in a list. So for this example, the histogram can be represented as: let hist = [4;2;1;1;1] Distance metrics are classified among several families based on their structural similarity. The sections following show how to work on these metrics using F# and uses histograms as int list as input.  and  denote the vectors that represent the entity being compared. For example, for the document retrieval system, these numbers might indicate the number of times a given word occurred in each of the documents that are being compared.  denotes the i element of the vector represented by  and  denotes the ith element of the vector represented by . Some literature call these vectors the profile vectors. Minkowski family As you can see, when two pdfs are almost the same then this family of distance metrics tends to be zero, and when they are further apart, they lead to positive infinity. So if the distance metric between two pdfs is close to zero then we can conclude that they are similar and if the distance is more, then we can conclude otherwise. All these formulae of these metrics are special cases of what's known as Minkowski distance. Euclidean distance The following code implements Euclidean distance. //Euclidean distance let euclidean (p:int list)(q:int list) =       List.zip p q |> List.sumBy (fun t -> float (fst t - snd t) ** 2.) City block distance The following code implements City block distance. let cityBlock (p:int list)(q:int list) =     List.zip p q |> List.sumBy (fun t -> float (fst t  - snd t)) Chebyshev distance The following code implements Chebyshev distance. let chebyshev(p:int list)(q:int list) =         List.zip p q |> List.map( fun t -> abs (fst t - snd t)) |> List.max L1 family This family of distances relies on normalization to keep the values within limit. All these metrics are of the form A/B where A is primarily a measure of proximity between the two pdfs P and Q. Most of the time A is calculated based on the absolute distance. For example, the numerator of the Sørensen distance is the City Block distance while the bottom is a normalization component that is obtained by adding each element of the two participating pdfs. Sørensen The following code implements Sørensendistance. let sorensen  (p:int list)(q:int list) =         let zipped = List.zip (p |> toPdf ) (q|>toPdf)         let numerator = zipped  |> List.sumBy (fun t -> float (fst t - snd t))         let denominator = zipped |> List.sumBy (fun t -> float (fst t + snd t))numerator / denominator Gower distance The following code implements Gowerdistance. There could be division by zero if the collection q is empty. let gower(p:int list)(q:int list)=         //I love this. Free flowing fluid conversion         //rather than cramping abs and fst t - snd t in a         //single line      let numerator = List.zip (p|>toPdf) (q |> toPdf)                          |> List.map (fun t -> fst t - snd t)                          |> List.map (fun z -> abs z)                          |> List.map float                          |> List.sum                          |> float      let denominator = float p.Length      numerator / denominator Soergel The following code implements Soergeldistance. let soergel (p:int list)(q:int list) =         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator =  zipped |> List.sumBy(fun t -> abs( fst t - snd t))         let denominator = zipped |> List.sumBy(fun t -> max (fst t ) (snd t))         float numerator / float denominator kulczynski d The following code implements Kulczynski ddistance. let kulczynski_d (p:int list)(q:int list) =         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator =  zipped |> List.sumBy(fun t -> abs( fst t - snd t))         let denominator = zipped |> List.sumBy(fun t -> min (fst t ) (snd t))         float numerator / float denominator kulczynski s The following code implements Kulczynski sdistance. let kulczynski_s (p:int list)(q:int list) = / kulczynski_d p q Canberra distance The following code implements Canberradistance. let canberra (p:int list)(q:int list) =     let zipped = List.zip (p|>toPdf) (q |> toPdf)     let numerator =  zipped |> List.sumBy(fun t -> abs( fst t - snd t))     let denominator = zipped |> List.sumBy(fun t -> fst t + snd t)     float numerator / float denominator Intersection family This family of distances tries to find the overlap between two participating pdfs. Intersection The following code implements Intersectiondistance. let intersection(p:int list ) (q: int list) =         List.zip (p|>toPdf) (q |> toPdf)             |> List.map (fun t -> min (fst t) (snd t)) |> List.sum Wave Hedges The following code implements Wave Hedgesdistance. let waveHedges (p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy ( fun t -> 1.0 - float( min (fst t) (snd t))                                      / float (max (fst t) (snd t))) Czekanowski distance The following code implements Czekanowski distance. let czekanowski(p:int list)(q:int list) =         let zipped  = List.zip (p|>toPdf) (q |> toPdf)         let numerator = 2. * float (zipped |> List.sumBy (fun t -> min (fst t) (snd t)))         let denominator =  zipped |> List.sumBy (fun t -> fst t + snd t)         numerator / float denominator Motyka The following code implements Motyka distance.  let motyka(p:int list)(q:int list)=        let zipped = List.zip (p|>toPdf) (q |> toPdf)        let numerator = zipped |> List.sumBy (fun t -> min (fst t) (snd t))        let denominator = zipped |> List.sumBy (fun t -> fst t + snd t)        float numerator / float denominator Ruzicka The following code implements Ruzicka distance. let ruzicka (p:int list) (q:int list) =         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator = zipped |> List.sumBy (fun t -> min (fst t) (snd t))         let denominator = zipped |> List.sumBy (fun t -> max (fst t) (snd t))         float numerator / float denominator Inner Product family Distances belonging to this family are calculated by some product of pairwise elements from both the participating pdfs. Then this product is normalized with a value also calculated from the pairwise elements. Innerproduct The following code implements Inner-productdistance. let innerProduct(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> fst t * snd t) Harmonic mean The following code implements Harmonicdistance. let harmonicMean(p:int list)(q:int list)=        2. * (List.zip (p|>toPdf) (q |> toPdf)           |> List.sumBy (fun t -> ( fst t * snd t )/(fst t + snd t))) Cosine similarity The following code implements Cosine Similarity distance measure. let cosineSimilarity(p:int list)(q:int list)=     let zipped = List.zip p q     let prod  (x,y) = float x *  float y     let numerator = zipped |> List.sumBy prod     let denominator  =  sqrt ( p|> List.map sqr |> List.sum |> float) *                         sqrt ( q|> List.map sqr |> List.sum |> float)     numerator / denominator Kumar Hassebrook The following code implements Kumar Hassebrook distance measure.   let kumarHassebrook (p:int list) (q:int list) =         let sqr x = x * x         let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator = zipped |> List.sumBy prod         let denominator =  (p |> List.sumBy sqr) +                            (q |> List.sumBy sqr) - numerator           numerator / denominator Dice coefficient The following code implements Dice coefficient.   let dicePoint(p:int list)(q:int list)=             let zipped = List.zip (p|>toPdf) (q |> toPdf)         let numerator = zipped |> List.sumBy (fun t -> fst t * snd t)         let denominator  =  (p |> List.sumBy sqr)  +                             (q |> List.sumBy sqr)           float numerator / float denominator Fidelity family or squared-chord family This family of distances uses square root as an instrument to keep the distance within limit. Sometimes other functions, such as log, are also used. Fidelity The following code implements FidelityDistance measure.   let fidelity(p:int list)(q:int list)=               List.zip (p|>toPdf) (q |> toPdf)|> List.map  prod                                         |> List.map sqrt                                         |> List.sum Bhattacharya The following code implements Bhattacharya distance measure. let bhattacharya(p:int list)(q:int list)=         -log (fidelity p q) Hellinger The following code implements Hellingerdistance measure. let hellinger(p:int list)(q:int list)=             let product = List.zip (p|>toPdf) (q |> toPdf)                          |> List.map prod |> List.sumBy sqrt        let right = 1. -  product        2. * right |> abs//taking this off will result in NaN                   |> float                   |> sqrt Matusita The following code implements Matusita distance measure. let matusita(p:int list)(q:int list)=         let value =2. - 2. *( List.zip (p|>toPdf) (q |> toPdf)                             |> List.map prod                             |> List.sumBy sqrt)         value |> abs |> sqrt Squared Chord The following code implements Squared Chord distance measure. let squarredChord(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> sqrt (fst t ) - sqrt (snd t)) Squared L2 family This is almost the same as the L1 family, just that it got rid of the expensive square root operation and relies on the squares instead. However, that should not be an issue. Sometimes the squares can be quite large so a normalization scheme is provided by dividing the result of the squared sum with another square sum, as done in "Divergence". Squared Euclidean The following code implements Squared Euclidean distance measure. For most purpose this can be used instead of Euclidean distance as it is computationally cheap and performs as well. let squaredEuclidean (p:int list)(q:int list)=        List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t-> float (fst t - snd t) ** 2.0) Squared Chi The following code implements Squared Chi distance measure. let squaredChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t -> (fst t - snd t ) ** 2.0 / (fst t + snd t)) Pearson's Chi The following code implements Pearson’s Chidistance measure. let pearsonsChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t -> (fst t - snd t ) ** 2.0 / snd t) Neyman's Chi The following code implements Neyman’s Chi distance measure. let neymanChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)            |> List.sumBy (fun t -> (fst t - snd t ) ** 2.0 / fst t) Probabilistic Symmetric Chi The following code implements Probabilistic Symmetric Chidistance measure. let probabilisticSymmetricChi(p:int list)(q:int list)=        2.0 * squaredChi p q Divergence The following code implements Divergence measure. This metric is useful when the elements of the collections have elements in different order of magnitude. This normalization will make the distance properly adjusted for several kinds of usages. let divergence(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> (fst t - snd t) ** 2. / (fst t + snd t) ** 2.) Clark The following code implements Clark’s distance measure. let clark(p:int list)(q:int list)=               sqrt( List.zip (p|>toPdf) (q |> toPdf)         |> List.map (fun t ->  float( abs( fst t - snd t))                                / (float (fst t + snd t)))           |> List.sumBy (fun t -> t * t)) Additive Symmetric Chi The following code implements Additive Symmetric Chi distance measure. let additiveSymmetricChi(p:int list)(q:int list)=         List.zip (p|>toPdf) (q |> toPdf)             |> List.sumBy (fun t -> (fst t - snd t ) ** 2. * (fst t + snd t)/ (fst t * snd t))  Summary Congratulations!! you learnt how different similarity measures work and when to use which one, to find the closest match. Edmund Burke said that, "It's the nature of every greatness not to be exact", and I can't agree more. Most of the time the users aren't really sure what they are looking for. So providing a binary answer of yes or no, or found or not found is not that useful. Striking the middle ground by attaching a confidence score to each result is the key. The techniques that you learnt will prove to be useful when we deal with recommender system and anomaly detection, because both these fields rely heavily on the IR techniques. Resources for Article:   Further resources on this subject: Learning Option Pricing [article] Working with Windows Phone Controls [article] Simplifying Parallelism Complexity in C# [article]
Read more
  • 0
  • 0
  • 2369

article-image-going-viral
Packt
19 Mar 2014
14 min read
Save for later

Going Viral

Packt
19 Mar 2014
14 min read
(For more resources related to this topic, see here.) Social media mining using sentiment analysis People are highly opinionated. We hold opinions about everything from international politics to pizza delivery. Sentiment analysis, synonymously referred to as opinion mining, is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions through written language. Practically speaking, this field allows us to measure, and thus harness, opinions. Up until the last 40 years or so, opinion mining hardly existed. This is because opinions were elicited in surveys rather than in text documents, computers were not powerful enough to store or sort a large amount of information, and algorithms did not exist to extract opinion information from written language. The explosion of sentiment-laden content on the Internet, the increase in computing power, and advances in data mining techniques have turned social data mining into a thriving academic field and crucial commercial domain. Professor Richard Hamming famously pushes researchers to ask themselves, "What are the important problems in my field?" Researchers in the broad area of natural language processing (NLP) cannot help but list sentiment analysis as one such pressing problem. Sentiment analysis is not only a prominent and challenging research area, but also a powerful tool currently being employed in almost every business and social domain. This prominence is due, at least in part, to the centrality of opinions as both measures and causes of human behavior. This article is an introduction to social data mining. For us, social data refers to data generated by people or by their interactions. More specifically, social data for the purposes of this article will usually refer to data in text form produced by people for other people's consumption. Data mining is a set of tools and techniques used to describe and make inferences about data. We approach social data mining with a potent mix of applied statistics and social science theory. As for tools, we utilize and provide an introduction to the statistical programming language R. The article covers important topics and latest developments in the field of social data mining with many references and resources for continued learning. We hope it will be of interest to an audience with a wide array of substantive interests from fields such as marketing, sociology, politics, and sales. We have striven to make it accessible enough to be useful for beginners while simultaneously directing researchers and practitioners already active in the field towards resources for further learning. Code and additional material will be available online at http://socialmediaminingr.com as well as on the authors' GitHub account, https://github.com/SocialMediaMininginR. The state of communication The state of communication section describes the fundamentally altered modes of social communication fostered by the Internet. The interconnected, social, rapid, and public exchange of information detailed here underlies the power of social data mining. Now more than ever before, information can go viral, a phrase first cited as early as 2004. By changing the manner in which we connect with each other, the Internet changed the way we interact—communication is now bi-directional and many-to-many. Networks are now self-organized, and information travels along every dimension, varying systematically depending on direction and purpose. This new economy with ideas as currency has impacted nearly every person. More than ever, people rely on context and information before making decisions or purchases, and by extension, more and more on peer effects and interactions rather than centralized sources. The traditional modes of communication are represented mainly by radio and television, which are isotropic and one-to-many. It took 38 years for radio broadcasters and 13 years for television to reach an audience of 50 million, but the Internet did it in just four years (Gallup). Not only has the nature of communication changed, but also its scale. There were 50 pages on the World Wide Web (WWW) in 1993. Today, the full impact and scope of the WWW is difficult to measure, but we can get a rough sense of its size: the Indexed Web contains at least 1.7 billion pages as of February 2014 (World Wide Web size). The WWW is the largest, most widely used source of information, with nearly 2.4 billion users (Wikipedia). 70 percent of these users use it daily to both contribute and receive information in order to learn about the world around them and to influence that same world—constantly organizing information around pieces that reflect their desires. In today's connected world, many of us are members of at least one, if not more, social networking service. The influence and reach of social media enterprises such as Facebook is staggering. Facebook has 1.11 billion monthly active users and 751 million monthly active users of their mobile products (Facebook key facts). Twitter has more than 200 million (Twitter blog) active users. As communication tools, they offer a global reach to huge multinational audiences, delivering messages almost instantaneously. Connectedness and social media have altered the way we organize our communications. Today we have dramatically more friends and more friends of friends, and we can communicate with these higher order connections faster and more frequently than ever before. It is difficult to ignore the abundance of mimicry (that is, copying or reposting) and repeated social interactions in our social networks. This mimicry is a result of virtual social interactions organized into reaffirming or oppositional feedback loops. We self-organize these interactions via (often preferential) attachments that form organic, shifting networks. There is little question of whether or not social media has already impacted your life and changed the manner in which you communicate. Our beliefs and perceptions of reality, as well as the choices we make, are largely conditioned by our neighbors in virtual and physical networks. When we need to make a decision, we seek out for opinions of others—more and more of those opinions are provided by virtual networks. Information bounce is the resonance of content within and between social networks often powered by social media such as customer reviews, forums, blogs, microblogs, and other user-generated content. This notion represents a significant change when compared to how information has traveled throughout history; individuals no longer need to exclusively rely on close ties within their physical social networks. Social media has both made our close ties closer and the number of weak ties exponentially greater. Beyond our denser and larger social networks is a general eagerness to incorporate information from other networks with similar interests and desires. The increased access to networks of various types has, in fact, conditioned us to seek even more information; after all, ignoring available information would constitute irrational behavior. These fundamental changes to the nature and scope of communication are crucial due to the importance of ideas in today's economic and social interactions. Today, and in the future, ideas will be of central importance, especially those ideas that bounce and go viral. Ideas that go viral are those that resonate and spur on social movements, which may have political and social purposes or reshape businesses and allow companies such as Nike and Apple to produce outsized returns on capital. This article introduces readers to the tools necessary to measure ideas and opinions derived from social data at scale. Along the way, we'll describe strategies for dealing with Big Data. What is Big Data? People create 2.5 quintillion bytes (2.5 * 1018) of data, or nearly 2.3 million Terabytes of data every day, so much that 90 percent of the data in the world today has been created in the last two years alone. Furthermore, rather than being a large collection of disparate data, much of this data flow consists of data on similar things, generating huge data-sets with billions upon billions of observations. Big Data refers not only to the deluge of data being generated, but also to the astronomical size of data-sets themselves. Both factors create challenges and opportunities for data scientists. This data comes from everywhere: physical sensors used to gather information, human sensors such as the social web, transaction records, and cell phone GPS signals to name a few. This data is not only big but is growing at an increasing rate. The data used in this article, namely, Twitter data, is no exception. Twitter was launched in March 21, 2006, and it took 3 years, 2 months, and 1 day to reach 1 billion tweets. Twitter users now send 1 billion tweets every 2.5 days. The size and scope of Big Data helps us overcome some of the hurdles caused by its low density. For instance, even though each unique piece of social data may have little applicability to our particular task, these small bits of information quickly become useful as we aggregate them across thousands or millions of people. Like the proverbial bundle of sticks—none of which could support inferences alone—when tied together, these small bits of information can be a powerful tool for understanding the opinions of the online populace. The sheer scope of Big Data has other benefits as well. The size and coverage of many social data-sets creates coverage overlaps in time, space, and topic. This allows analysts to cross-refer socially generated sets against one another or against small-scale data-sets designed to examine niche questions. This type of cross-coverage can generate consilience (Osborne)—the principle that states evidence from independent, unrelated sources can converge to form strong conclusions. That is, when multiple sources of evidence are in agreement, the conclusion can be very strong even when none of the individual sources of evidence are very strong on their own. A crucial characteristic of socially generated data is that it is opinionated. This point underpins the usefulness of big social data for sentiment analysis, and is novel. For the first time in history, interested parties can put their fingers to the pulse of the masses because the masses are frequently opining about what is important to them. They opine with and for each other and anyone else who cares to listen. In sum, opinionated data is the great enabler of opinion-based research. Human sensors and honest signals Opinion data generated by humans in real time presents tremendous opportunities. However, big social data will only prove useful to the extent that it is valid. This section tackles the extent to which socially generated data can be used to accurately measure individual and/or group-level opinions head-on. One potential indicator of the validity of socially generated data is the extent of its consumption for factual content. Online media has expanded significantly over the past 20 years. For example, online news is displacing print and broadcast. More and more Americans distrust mainstream media, with a majority (60 percent) now having little to no faith in traditional media to report news fully, accurately, and fairly. Instead, people are increasingly turning to the Internet to research, connect, and share opinions and views. This was especially evident during the 2012 election where social media played a large role in information transmission (Gallup). Politics is not the only realm affected by social Big Data. People are increasingly relying on the opinions of others to inform about their consumption preferences. Let's have a look at this: 91 percent of people report having gone into a store because of an online experience 89 percent of consumers conduct research using search engines 62 percent of consumers end up making a purchase in a store after researching it online 72 percent of consumers trust online reviews as much as personal recommendations 78 percent of consumers say that posts made by companies on social media influence their purchases If individuals are willing to use social data as a touchstone for decision making in their own lives, perhaps this is prima facie evidence of its validity. Other Big Data thinkers point out that much of what people do online constitutes their genuine actions and intentions. The breadcrumbs left from when people execute online transactions, send messages, or spend time on web pages constitute what Alex Petland of MIT calls honest signals. These signals are honest insofar as they are actions taken by people with no subtext or secondary intent. Specifically, he writes the following: "Those breadcrumbs tell the story of your life. It tells what you've chosen to do. That's very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy." To paraphrase, Petland finds some web-based data to be valid measures of people's attitudes when that data is without subtext or secondary intent; what he calls data exhaust. In other words, actions are harder to fake than words. He cautions against taking people's online statements at face value, because they may be nothing more than cheap talk. Anthony Stefanidis of George Mason University also advocates for the use of social data mining. He favorably speaks about its reliability, noting that its size inherently creates a preponderance of evidence. This article takes neither the strong position of Pentland and honest signals nor Stefanidis and preponderance of evidence. Instead, we advocate a blended approach of curiosity and creativity as well as some healthy skepticism. Generally, we follow the attitude of Charles Handy (The Empty Raincoat, 1994), who described the steps to measurement during the Vietnam War as follows: "The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide." The social web may not consist of perfect data, but its value is tremendous if used properly and analyzed with care. 40 years ago, a social science study containing millions of observations was unheard of due to the time and cost associated with collecting that much information. The most successful efforts in social data mining will be by those who "measure (all) what is measurable, and make measurable (all) what is not so" (Rasinkinski, 2008). Ultimately, we feel that the size and scope of big social data, the fact that some of it is comprised of honest signals, and the fact that some of it can be validated with other data, lends it validity. In another sense, the "proof is in the pudding". Businesses, governments, and organizations are already using social media mining to good effect; thus, the data being mined must be at least moderately useful. Another defining characteristic of big social data is the speed with which it is generated, especially when considered against traditional media channels. Social media platforms such as Twitter, but also the web generally, spread news in near-instant bursts. From the perspective of social media mining, this speed may be a blessing or a curse. On the one hand, analysts can keep up with the very fast-moving trends and patterns, if necessary. On the other hand, fast-moving information is subject to mistakes or even abuse. Following the tragic bombings in Boston, Massachusetts (April 15, 2013), Twitter was instrumental in citizen reporting and provided insight into the events as they unfolded. Law enforcement asked for and received help from general public, facilitated by social media. For example, Reddit saw an overall peak in traffic when reports came in that the second suspect was captured. Google Analytics reports that there were about 272,000 users on the site with 85,000 in the news update thread alone. This was the only time in Reddit's history other than Obama AMA that a thread beat the front page in the ratings (Reddit). The downside of this fast-paced, highly visible public search is that masses can be incorrect. This is exactly what happened; users began to look at the details and photos posted and pieced together their own investigation—as it turned out, the information was incorrect. This was a charged event and created an atmosphere that ultimately undermined the good intentions of many. Other efforts such as those by governments (Wikipedia) and companies (Forbes) to post messages favorable to their position is less than well intentioned. Overall, we should be skeptical of tactical (that is, very real time) uses of social media. Summary In this article, we introduced readers to the concepts of social media, sentiment analysis, and Big Data. We described how social media has changed the nature of interpersonal communication and the opportunities it presents for analysts of social data. This article also made a case for the use of quantitative approaches to measure all that is measurable, and make the one which is not so measurable. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [Article] Managing your social channels [Article] Social Media for Wordpress: VIP Memberships [Article]
Read more
  • 0
  • 0
  • 2362

article-image-applied-modeling
Packt
19 Dec 2013
15 min read
Save for later

Applied Modeling

Packt
19 Dec 2013
15 min read
(For more resources related to this topic, see here.) This article examines the foundations of tabular modeling (at least from the modelers point of view). In order to provide a familiar model which can be used later, this article will progressively build the model. Each recipe is intended to demonstrate a particular technique; however, they need to be followed in order so that the final model is completed. Grouping by binning and sorting with ranks Often, we want to provide descriptive information in our data, based on values that are derived from downstream activities. For example, this could arise if we wish to include a field in the Customer table that shows the value of the previous sales for that customer or a bin grouping that defines the customer into banded sales. This value can then be used to rank each customer according to their relative importance in relation to other customers. This type of value adding activity is usually troublesome and time intensive in a traditional data mart design as the customer data can be fully updated only once the sales data has been loaded. While this process may seem relatively straightforward, it is a recursive problem as the customer data must be loaded before the sales data (since customers must exist before the sales can be assigned to them), but the full view of the customer is reliant on loading the sales data. In a standard dimensional (star schema) modeling approach, including this type of information for dimensions requires a three-step process: The dimension (reseller customer data) is updated for known fields and attributes. This load excludes information that is derived (such as sales). Then, the sales data (referred to as fact data) is loaded in data warehouse. This ensures that the data mart is in a current state and all the sales transaction data is up-to-date. Information relating to any new and changed stores can be loaded correctly. The dimension data which relies on other fact data is updated based on the current state of the data mart. Since the tabular model is less confined by the traditional star schema requirements of the fact and dimension tables (in fact, the tabular model does not explicitly identify facts or dimensions), the inclusion and processing of these descriptive attributes can be built directly into the model. The calculation of a simple measure such as a historic sales value may be included in OLAP modeling through calculated columns in the data source view. However, this is restrictive and limited to simple calculations (such as total sales or n period sales). Other manipulations (such as ranking and binning) are a lot more flexible in tabular modeling (as we will see). This recipe examines how to manipulate a dimensional table in order to provide a richer end user experience. Specifically, we will do the following: Introduce a calculated field to calculate the historical sales for the customer Determine the rank of the customer based on that field Create a discretization bin for the customer based on their sales Create an ordered hierarchy based on their discretization bins Getting ready Continuing with the scenario that was discussed in the Introduction section of the article, the purpose of this article is to identify each reseller's (customer's) historic sales and then rank them accordingly. We then discretize the Resellers table (customer) based on this. This problem is further complicated by the consideration that a sale occurs in the country of origin (the sales data in the Reseller Sales table will appear in any currency). In order to provide a concise recipe, we break the entire process into two distinct steps: Conversion of sales (which manages the ranking of Resellers based on a unified sales value) Classification of sales (which manages the manipulation of sales values based on discretized bins to format those bins) How to do it… Firstly, we need to provide a common measure to compare the sales value of Resellers. Convert sales to a uniform currency using the following steps: Open a new workbook and launch the PowerPivot Window. Import the text files Reseller Sales.txt, Currency Conversion.txt, and Resellers.txt. The source data folder for this article includes the base schema.ini file that correctly transforms all data upon import. When importing the data, you should be prompted that the schema.ini file exists and will override the import settings. If this does not occur, ensure that the schema.ini file exists in the same directory as your source data. The prompt should look like the following screenshot: Although it is not mandatory, it is recommended that connection managers are labeled according to a standard. In this model, I have used the convention type_table_name where type refers to the connection type (.txt) and table_name refers to the name of the table. Connections can be edited using the Existing Connections button in the Design tab. Create a relationship between the Customer ID field in the Resellers table and the Customer ID field in the Reseller Sales table. Add a new field (also called a calculated column) in the Resellers Sales table to show the gross value of the sale amount in USD. Add usd_gross_sales and hide it from client tools using the following code: = [Quantity Ordered] *[Price Per Unit] *LOOKUPVALUE ( 'Currency Conversion'[AVG Rate] ,'Currency Conversion'[Date] ,[Order dt] ,'Currency Conversion'[Currency ID] ,[Currency ID] ) Add a new measure in the Resellers Sales table to show sales (in USD). Add USD Gross Sales as: USD Gross Sales := SUM ( [usd_gross_sales] ) Add a new calculated column to the Resellers table to show the USD Sales Total value. The formula for the field should be: = 'Reseller Sales' [USD Gross Sales] Add a Sales Rank field in the Resellers table to show the order for each resellers USD Sales Total. The formula for Sales Rank is: =RANKX(Resellers, [USD Sales Total]) Hide the USD Sales Total and Sales Rank fields from client tools. Now that all the entries in the Resellers table show their sales value in a uniform currency, we can determine a grouping for the Reseller table. In this case, we are going to group them into 100,000 dollar bands. Add a new field to show each value in the USD Sales Total column of Resellers rounded down to the nearest 100,000 dollars. Hide it from client tools. Now, add Round Down Amount as: =ROUNDDOWN([USD Sales Total],-5) Add a new field to show the rank of Round Down Amount in descending order and hide it from client tools. Add Round Down Order as: =RANKX(Resellers,[Round Down Amount],,FALSE(), DENSE) Add a new field in the Resellers table to show the 100,000 dollars group that the reseller belongs to. Since we know what the lower bound of the sales bin is, we can also infer that the upper bin is the rounded up 100,000 dollars sales group. Add Sales Group as follows: =IF([Round Down Amount]=0 || ISBLANK([Round Down Amount]) , "Sales Under 100K" , FORMAT([Round Down Amount], "$#,K") & " - " & FORMAT(ROUNDUP([USD Sales Total],-5),"$#,K") ) Set the Sort By Column of the Sales Group field to the Round Down Order column. Note that the Round Down Order column should display in a descending order (that is, entries in the Resellers table with high sales values should appear first). Create a hierarchy on the Resellers table which shows the Sales Group field and the Customer Name column as levels. Title the hierarchy as Customer By Sales Group. Add a new measure titled Number of Resellers to the Resellers table: Number of Resellers:=COUNTROWS(Resellers) Create a pivot table that shows the Customer By Sales Group hierarchy on the rows and Number of Resellers as values. If you created the usd_gross_sales field in the Reseller Sales table, it can also be added as an implicit measure to verify values. Expand the first bin of Sales Group. The pivot should look like the following screenshot: How it works… This recipe has included various steps, which add descriptive information to the Resellers table. This includes obtaining data from a separate table (the USD sales) and then manipulating that field within the Resellers table. In order to provide a clearer definition of how this process works, we will break the explanation into several subsections. This includes the sales data retrieval, the use of rank functions, and finally, the discretization of sales. The next section deals with the process of converting sales data to a single currency. The starting point for this recipe is to determine a common currency sales value for each reseller (or customer). While the inclusion of the calculated column USD Sales Total in the Resellers table should be relatively straightforward, it is complicated by the consideration that the sales data is stored in multiple currencies. Therefore, the first step needs to include a currency conversion to determine the USD sales value for each line. This is simply the local value multiplied by the daily exchange rate. The LOOKUPVALUE function is used to return the row-by-row exchange rate. Now that we have the usd_gross_sales value for each sales line, we define a measure that calculates its sum in whatever filter context it is applied in. Including it in the Reseller Sales table makes sense (since, it relates to sales data), but what is interesting is how the filter context is applied when it is used as a field in the Resellers table. Here, the row filter context that exists in the Resellers table (after all, each row refers to a reseller) applies a restriction to the sales data. This shows the sales value for each reseller. For this recipe to work correctly, it is not necessary to include the calculated field usd_gross_sales in Reseller Sales. We simply need to define a calculation, which shows the gross sales value in USD and then use the row filter context in the Resellers table to restrict sales to the reseller in question (that is, the reseller in the row). It is obvious that the exchange rate should be applied on a daily basis because the value can change every day. We could use an X function in the USD Gross Sales measure to achieve exactly the same outcome. Our formula will be: SUMX ( 'Reseller Sales' , 'Reseller Sales'[Quantity Ordered] * 'Reseller Sales'[Price Per Unit] * LOOKUPVALUE ( 'Currency Conversion'[AVG Rate] , 'Currency Conversion'[Date] ,'Reseller Sales'[Order dt] ,'Currency Conversion'[Currency ID] ,'Reseller Sales'[Currency ID] ) ) Furthermore, if we wanted to, we could completely eliminate the USD Gross Sales measure from the model. To do this, we could wrap the entire formula (the previous definition on USD Gross Sales) into the CALCULATE statement in the Resellers table's field definition of USD Gross Sales. This forces the calculation to occur at the current row context. Why have we included the additional fields and measures in Reseller Sales? This is a modeling choice. It makes the model easier to understand because it is more modular. This would otherwise require two calculations (one into a default currency and the second into a selected currency) and the field usd_gross_sales is used in that calculation. Now that sales are converted to a uniform currency, we can determine the importance by rank. RANKX is used to rank the rows in the Resellers table based on the USD Gross Sales field. The simplest implementation of RANKX is demonstrated within the Sales Rank field. Here, the function simply returns a rank based on the value according to the supplied measure (which is of course USD Gross Sales). However, the RANKX function provides a lot of versatility and follows the syntax: RANKX(<table> , <expression>[, <value>[, <order>[, <ties>]]] ) After the initial implementation of RANKX in its simplest form, the arguments of particular interest are the <order> and <ties> arguments. These can be used to specify the sort order (whether the rank is to be applied from highest to lowest or lowest to highest) and the function behavior when duplicate values are encountered. This may be best demonstrated with an example. To do this, we will examine the operation of rank in relation to Round Down Amount. When a simple RANKX function is applied, the function sorts the columns in an ascending order and returns the position of a row based on the sorted order of the value and the number of prior rows within the table. This includes rows attributable to duplicate values. This is shown in the following screenshot where the Simple column is defined as RANKX(Resellers,[Round Down Amount]). Note, the data is sorted by Round Down Amount and the first four tied values have a RANKX value of 1. This is the behavior we expect since all rows have the same value. For the next value (700000), RANKX returns 5 because this is the fifth element in the sequence. When the DENSE argument is specified, the value returned after a tie is the next sequential number in the list. In the preceding screenshot, this is shown through the DENSE column. The formula for the field DENSE is: RANKX(Resellers, [Round Down Amount],,,DENSE)) Finally, we can specify the sort order that is used by the function (the default is ascending) with the help of <order> argument of the function. If we wish to sort (and rank) from lowest to highest, we could use the formula as shown in the INVERSE DENSE column. The INVERSE DENSE column uses the following calculation: RANKX(Resellers, [Round Down Amount],,TRUE,DENSE) After having specified the Sales Group field sort by column as Round Down Order, we may ask why we did not also sort the Customer Name column by their respective values in the Sales Rank column? Trying to define a sort by column in this way would cause an error as it is not a one-to-one relationship between these two fields. That is, each customer does not have a unique value for the sales rank. Let's have a look at this in more detail. If we filter the Resellers table to show the blank USD Sales Total rows (click on the drop-down arrow in the USD Sales Total column and check the BLANKS checkbox), we see that the values of the Sales Rank column for all the rows is the same. In the following screenshot, we can see the value 636 repeated for all the rows: Allowing the client tool visibility to the USD Sales Total and Sales Rank fields will not provide an intuitive browsing attribute for most client tools. For this reason, it is not recommended to expose these attributes to users. Hiding them will still allow the data to be queried directly Discretizing sales By discretizing the Resellers table, we firstly make a decision to group each reseller into bands of 100,000 intervals. Since, we have already calculated the USD Gross Sales value for each customer, our problem is reduced by determining which bin each customer belongs to. This is very easily achieved as we can derive the lower and upper bound for the Resellers table. That is, the lower bound will be a rounded down amount of their sales and the upper bound will be the rounded up value (that is rounded nearest to the 100,000 interval). Finally, we must ensure that the ordering of the bins is correct so that the bins appear from the highest value resellers to the lowest. For convenience, these steps are broken down through the creation of additional columns but they need not be—we could incorporate the steps into a single formula (mind you, it would be hard to read). Additionally, we have provided a unique name for the first bin by testing for 0 sales. This may not be required. The rounding is done with the ROUNDDOWN and ROUNDUP functions. These functions simply return the number moved by the number of digits offset. The following is the syntax for ROUNDDOWN: ROUNDDOWN(<number>, <num_digits>) Since we are interested only in the INTEGER values (that is, values to the left of the decimal place), we must specify <num_digits> as -5. The display value of the bin is controlled through the FORMAT function, which returns the text equivalent of a value according to the provided format string. The syntax for FORMAT is: FORMAT(<value>, <format_string>) There's more... In presenting a USD Gross Sales value for the Resellers table, we may not be interested in all the historic data. A typical variation on this theme is to determine the current worth by showing the recent history (or summing recent sales). This requirement can be easily implemented into the preceding method by swapping USD Gross Sales with recent sales. To determine this amount, we need to filter the data used in the SUM function. For example, to determine the last 30 days' sales for a reseller, we will use the following code: SUMX( FILTER('Reseller Sales' , 'Reseller Sales'[Order dt]> (MAX('Reseller Sales'[Order dt])-30) ) , USD SALES EXPRESSION )
Read more
  • 0
  • 0
  • 2362
article-image-how-to-perform-data-exploration-with-rethinkdb
Vijin Boricha
14 Feb 2018
5 min read
Save for later

How to perform Data exploration with RethinkDB

Vijin Boricha
14 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will let you master the capabilities of RethinkDB and implement them to develop efficient real-time web applications.[/box] In this article, we will learn to do data exploration in RethinkDB with the help of few use case. Executing data exploration use cases We have imported our database i.e. our mock data into our RethinkDB instance. Now it's time to run a use case query and make use of it. But before we do so, we need to figure out one data alteration. We have made a mistake while generating mock data (on purpose actually) we have a $ sign before ctc. Hence, it becomes tough to perform salary-level queries. Before we move ahead, we need to figure out this problem, and basically get rid of the $ sign and update the ctc value to an integer instead of a string. In order to do this, we need to perform the following operation: Traverse through each document in the database Split the ctc string into two parts, containing $ and the other value Update the ctc value in the document with a new data type and value Since we require the chaining of queries, I have written a small snippet in Node.js to achieve the previous scenario as follows: var rethinkdb = require('rethinkdb'); var connection = null; rethinkdb.connect({host : 'localhost', port : 28015},function(err,conn) { if(err) { throw new Error('Connection error'); } connection = conn; rethinkdb.db("company").table("employees") .run(connection,function(err,cursor) { if(err) { throw new Error(err); } cursor.each(function(err,data) { data.ctc = parseInt(data.ctc.split("$")[1]); rethinkdb.db("company").table("employees") .get(data.id) .update({ctc : data.ctc}) .run(connection,function(err,response) { if(err) { throw new Error(err); } console.log(response); }); }); }); }); As you can see in the preceding code, we first fetch all the documents and traverse them using cursor, one document at a time. We use the split() method as a $ separator and convert the outcome, which is salary, into an integer using the parseInt() method. We update each document at a time using the id value of the document: After selecting all the documents again, we can see an updated ctc value as an integer, as shown in the following figure: This is one of the practical examples where we perform some data manipulation before moving ahead with complex queries. Similarly, you can look for errors such as blank spaces in a specific field or duplicate elements in your record. Finding duplicate elements We can use distinct() to find out whether there is any duplicate element present in the table. Say you have 1,000 rows and there are 10 duplicates. In order to determine that, we just need to find out the unique rows (of course excluding the ID key, as that's unique by nature). Here is the query for the same: r.db("company").table('employees').without('id').distinct().count() As shown in the following screenshot, this query returns the count of unique rows, which should be 1,000 if there are no duplicates: This implies that our records contain no duplicate documents. Finding the list of countries We can write a query to find all the countries we have in our record and also use distinct again by just selecting the country field. Here is the query: r.db("company").table('employees')("country").distinct() As shown in this image, we have 124 countries in our records: Finding the top 10 employees with the highest salary In this use case, we need to evaluate all the records and find the top 10 employees with the highest to lowest pay. Here is the query for the same: r.db("company").table("employees").orderBy(r.desc("ctc")).limit(10) Here we are using orderBy, which by default orders the record in ascending order. To get the highest pay at the first document, we need to use descending ordering; we did it using the desc() ReQL command. As shown in the following image, the query returns 10 rows: You can modify the same query by just by limiting the number of users to one to get thehighest-paid employee. Displaying employee records with a specific name and location To extract such records from our table, we need to again perform a filter on the "first_name" and "country" fields. Here is the query to return those records: r.db("company").table('employees').filter({"first_name" : "John","country" : "Sweden"}) We are just performing a basic filter and comparing both fields. ReQL queries are really easy for solving such queries due to their chaining feature. After executing the preceding query, we show the following output: To summarize, we looked over a few use cases where we had to perform alteration and filtering of records in order to meet exploration task, like stripping the $ sign from ctc, or converting base 256 ip addresses into base 10 values and then performing a query on them. We also covered a general use case in order to get a practical feel of ReQL. If you are interested to learn about RethinkDB Query Language,  Extending RethinkDB, and more you may check out this book Mastering RethinkDB.  
Read more
  • 0
  • 0
  • 2354

article-image-choosing-styles-various-graph-elements-r
Packt
25 Jan 2011
4 min read
Save for later

Choosing Styles of Various Graph Elements in R

Packt
25 Jan 2011
4 min read
  R Graph Cookbook Detailed hands-on recipes for creating the most useful types of graphs in R – starting from the simplest versions to more advanced applications Learn to draw any type of graph or visual data representation in R Filled with practical tips and techniques for creating any type of graph you need; not just theoretical explanations All examples are accompanied with the corresponding graph images, so you know what the results look like Each recipe is independent and contains the complete explanation and code to perform the task as efficiently as possible    Choosing plotting point symbol styles and sizes In this recipe, we will see how we can adjust the styling of plotting symbols, which is useful and necessary when we plot more than one set of points representing different groups of data on the same graph. Getting ready All you need to try out this recipe is to run R and type the recipe at the command prompt. You can also choose to save the recipe as a script so that you can use it again later on. We will also use the cityrain.csv example data file. Please read the file into R as follows: rain<-read.csv("cityrain.csv") The code file can be downloaded from here. How to do it... The plotting symbol and size can be set using the pch and cex arguments: plot(rnorm(100),pch=19,cex=2)   How it works... The pch argument stands for plotting character (symbol). It can take numerical values (usually between 0 and 25) as well as single character values. Each numerical value represents a different symbol. For example, 1 represents circles, 2 represents triangles, 3 represents plus signs, and so on. If we set the value of pch to a character such as "*" or "£" in inverted commas, then the data points are drawn as that character instead of the default circles. The size of the plotting symbol is controlled by the cex argument, which takes numerical values starting at 0 giving the amount by which plotting symbols should be magnified relative to the default. Note that cex takes relative values (the default is 1). So, the absolute size may vary depending on the defaults of the graphic device in use. For example, the size of plotting symbols with the same cex value may be different for a graph saved as a PNG file versus a graph saved as a PDF. There’s more... The most common use of pch and cex is when we don’t want to use color to distinguish between different groups of data points. This is often the case in scientific journals which do not accept color images. For example, let’s plot the city rainfall data as a set of points instead of lines: plot(rain$Tokyo, ylim=c(0,250), main="Monthly Rainfall in major cities", xlab="Month of Year", ylab="Rainfall (mm)", pch=1) points(rain$NewYork,pch=2) points(rain$London,pch=3) points(rain$Berlin,pch=4) legend("top", legend=c("Tokyo","New York","London","Berlin"), ncol=4, cex=0.8, bty="n", pch=1:4) Choosing line styles and width Similar to plotting point symbols, R provides simple ways to adjust the style of lines in graphs. Getting ready All you need to try out this recipe is to run R and type the recipe at the command prompt. You can also choose to save the recipe as a script so that you can use it again later on. We will again use the cityrain.csv data file. How to do it... Line styles can be set by using the lty and lwd arguments (for line type and width respectively) in the plot(), lines(), and par() commands. Let’s take our rainfall example and apply different line styles keeping the color the same: plot(rain$Tokyo, ylim=c(0,250), main="Monthly Rainfall in major cities", xlab="Month of Year", ylab="Rainfall (mm)", type="l", lty=1, lwd=2) lines(rain$NewYork,lty=2,lwd=2) lines(rain$London,lty=3,lwd=2) lines(rain$Berlin,lty=4,lwd=2) legend("top", legend=c("Tokyo","New York","London","Berlin"), ncol=4, cex=0.8, bty="n", lty=1:4, lwd=2)   How it works... Both line type and width can be set with numerical values as shown in the previous example. Line type number values correspond to types of lines: 0: blank 1: solid (default) 2: dashed 3: dotted 4: dotdash 5: longdash 6: twodash We can also use the character strings instead of numbers, for example, lty="dashed" instead of lty=2. The line width argument lwd takes positive numerical values. The default value is 1. In the example we used a value of 2, thus making the lines thicker than default.
Read more
  • 0
  • 0
  • 2347
Modal Close icon
Modal Close icon