Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-dpm-non-aware-windows-workload-protection
Packt
16 Jul 2013
18 min read
Save for later

DPM Non-aware Windows Workload Protection

Packt
16 Jul 2013
18 min read
(For more resources related to this topic, see here.) Protecting DFS with DPM DFS stands for Distributed File System . It was introduced in Windows Server 2003, and is a set of services available as a role on Windows Server operating systems that allow you to group file shares held in different locations (different servers) under one folder known as DFS root . The actual locations of the file shares are transparent to the end user. DFS is also often used for redundancy of file shares. For more information on DFS Windows Server 2008: http://technet.microsoft.com/en-us/library/cc753479%28v=ws.10%29.aspx Windows Server 2008 R2 and Windows Server 2012: http://technet.microsoft.com/en-us/library/cc732006.aspx Before DFS can be protected it is important to know how it is structured. DFS consists of both data and configuration information: The configuration for DFS is stored in the registry of each server, and in either the DFS tree during standalone DFS deployments, or in Active Directory when domain-based DFS is deployed. DFS data is stored on each server in the DFS tree. The data consists of the multiple shares that make up the DFS root. Protecting DFS with DPM is fairly straightforward. It is recommended to protect the actual file shares directly on each of the servers in the DFS root. When you have a standalone DFS deployment you should protect the system state on the servers in the DFS root, and when you have a domain-based DFS deployment we recommend you protect your Active Directory of the domain controller that hosts the DFS root. If you are using DFS replication it is also recommended to protect the shadow copy components on servers that host the replication data, in addition to the previously mentioned items. These methods would allow you to restore DFS by restoring the data and either system state or Active Directory depending on your deployment type. Another option is to use the DfsUtil tool to export/import your DFS configuration. This is a command-line utility that comes with Windows Server that can export the namespace configuration to a file. The configuration can then be imported back into a DFS server to restore a DFS namespace. DPM can be set up to protect the DFS export. You would still need to protect the actual data directly. An example of using the DfsUtil tool would be: Run DfsUtil root export domainnamerootname dfsrootname.xml to export the DFS configuration to an XML file, then run DfsUtil root import to import the DFS configuration back in. For more information on the DfsUtil tool, visit the following URL: http://blogs.technet.com/b/josebda/archive/2009/05/01/using-the-windows-server-2008-dfsutil-exe-command-line-to-manage-dfs-namespaces.aspx That covers the backing up of DFS with DPM. Protecting Dynamics CRM with DPM Microsoft Dynamics CRM is Microsoft's customer relationship management (CRM) software in the CRM market. Microsoft Dynamics CRM Version 1.0 was released in 2003. It then progressed to Version 4.0 and the latest one is 2011. CRM is a part of the Microsoft Dynamics product family. In this section we will cover protecting Versions 4.0 and 2011. Note that when protecting Microsoft Dynamics CRM on either Version 4.0 or 2011, you should keep a note of your update-rollup level some place safe, so that you can install CRM back to that level in the event of a restore. You will need to restore the CRM database and this could lead to an error if CRM is not at the correct update level. To protect Microsoft Dynamics CRM 4.0, back up the following components: Microsoft CRM Server database This is straightforward; you simply need to protect the SQL CRM databases. The two databases you want to protect are the following: The configuration database: MSCRM_CONFIG The organization database: OrganizationName_MSCRM Microsoft CRM Server program files By default, these files will be located at C:Program FilesMicrosoft CRM. Microsoft CRM website By default the CRM website files are located in the C:Inetpubwwwroot directory. The web.config file can be protected. It only needs protecting if it has been changed from the default settings. Microsoft CRM registry subkey Back up the HKEY_LOCAL_MACHINESOFTWAREMicrosoftMSCRM key. Microsoft CRM customizations To protect customizations or any third-party add-ons you will need to understand the specific components to back up and protect. Other components to back up for protecting Microsoft CRM include the following: System state of your domain controller. Exchange server if the CRM's e-mail router is used. To protect Microsoft Dynamics CRM 2011, back up the following components: Microsoft CRM 2011 databases This is straightforward, you simply need to protect the SQL CRM databases. The two databases you want to protect are: The configuration database: MSCRM_CONFIG The organization database: OrganizationName_MSCRM Microsoft CRM 2011 program files By default, these files will be located at C:Program FilesMicrosoft CRM. Microsoft CRM 2011 website By default the CRM website files are located in the C:Program FilesMicrosoft CRMCRMWeb directory. The web.config file can be protected. It only needs protecting if it has been changed from the default settings. Microsoft CRM 2011 registry subkey Back up the HKEY_LOCAL_MACHINESOFTWAREMicrosoftMSCRM subkey. Microsoft CRM 2011 customizations To protect customizations or any third-party add-ons you will need to understand the specific components to back up and protect. Other components to back up for protecting Microsoft CRM 2011 include: System state of your domain controller. Exchange server if the CRM's e-mail router is used. SharePoint if CRM and SharePoint integration is in use. Note that for both CRM 4.0 and CRM 2011, you could have more than one OrganizationName_MSCRM database if you have more than one organization in CRM. Be sure to protect all of the OrganizationName_MSCRM databases that may exist. That wraps up the Microsoft Dynamics CRM protection for both 4.0 and 2011. You simply need to configure protection of the mentioned components with DPM. Now let's look at what it will take to protect another product from the Dynamics family. Protecting Dynamics GP with DPM Dynamics GP is Microsoft's ERP and accounting software package for mid-market businesses. GP has standard accounting functions but it can do more such as Sales Order Processing, Order Management, Inventory Management, and Demand Planner for forecasting, thus making it usable as a full-blown ERP. GP was once known as Great Plains Software before acquisition by Microsoft. The most recent versions of GP are Microsoft Dynamics GP 10.0 and Dynamics GP 2010 R2. GP holds your organization's financial data. If you use it as an ERP solution, it holds even more critical data, and losing this data could be devastating to an organization. Yes, there is a built-in backup utility in GP but this does not cover all bases in protecting your GP. In fact, the built-in backup process only backs up the SQL database, and does not cover items like: Customized forms Reports Financial statement formats The sysdata folder These are the GP components you should protect with DPM: SQL administrative databases: Master, TempDB, and Model Microsoft Dynamics GP system database (DYNAMICS) Each of your company databases If you use SQL Server Agent to schedule automatic tasks, back up the msdb database forms.dic (for customized forms) can be found in %systemdrive%Program Files (x86)Microsoft DynamicsGP2010 reports.dic (for reports) can be found in %systemdrive%Program Files (x86)Microsoft DynamicsGP2010 Backing up these components with DPM should be sufficient protection in the event a restore is needed. Protecting TMG 2010 with DPM Threat Management Gateway (TMG ) is a part of the Forefront product family. The predecessor to TMG is Internet Security and Acceleration Server (ISA Server ). TMG is fundamentally a firewall, but a very powerful one with features such as VPN, web caching, reverse proxy, advanced stateful packet, WAN failover, malware protection, routing, load balancer, and much more. There have been several forum threads on the Microsoft DPM TechNet forums asking about DPM protecting TMG, which sparked the inclusion of this section in the book. TMG is a critical part of networks and should have high priority in regards to backup, right up there with your other critical business applications. In many environments, if TMG is down, there are a good amount of users that cannot access certain business applications which causes downtime. Let's take a look at how and what to protect in regards to TMG. The first step is to allow DPM traffic on TMG so that the agent can communicate with DPM. You will need to install the DPM agent on TMG and then start protecting it from there. Follow the ensuing steps to protect your TMG server: On the TMG server, go to Start | All Programs | Microsoft TMG Server . Open the TMG Server Management MMC. Expand Arrays and then TMG Server computer , then click on Firewall Policy . On the View menu, click on Show System Policy Rules . Right-click on the Allow remote management from selected computers using MMC system policy rule. Select Edit System Policy . In the System Policy Editor dialog box, click to clear the Enable this configuration group checkbox, and then click on OK . Click on Apply to update the firewall configuration, and then click on OK . Right-click on the Allow RPC from TMG server to trusted servers system policy rule. Select Edit System Policy . In the System Policy Editor dialog box, click to clear the Enforce strict RPC compliance checkbox, and then click on OK . Click on Apply to update the firewall configuration, and then click on OK . On the View menu, click on Hide System Policy Rules . Right-click on Firewall Policy . Select New and then Access Rule . In the New Access Rule Wizard window, type a name in the Access rule name box. Click on Next . Check the Allow checkbox and then click on Next . In the This rule applies to list, select All outbound traffic from the drop-down menu and click on Next . On the Access Rule Sources page, click on Add . In the Add Network Entities dialog window, click on New and select Computer from the drop-down list. Now type the name of your DPM server and type the DPM server's IP address in the Computer IP Address field. Click on OK when you are done. You will then see your DPM server listed under the Computers folder in the Add Network Entities window. Select it and click on Add . This will bring the DPM computer into your access rule wizard. Click on Next . In the Add Rule Destinations window click on Add . The Add Network Entities window will come up again. In this window expand the Networks folder, and then select Local Host and click on Add . Now click on Next . Your rule should have both the DPM server and Local Host listed for both incoming and outgoing. Click on Next , leave the default All Users entry in the This rule applies to requests from the following user sets box, click on Next again. Click on Finish . Right-click on the new rule (DPM2010 in this example), and then click on Move Up . Right-click on the new rule, and select Properties . In the rule name properties dialog box (DPM2010 Properties ), click on the Protocols tab, then click on Filtering . Now select Configure RPC Protocol . In the Configure RPC protocol policy dialog box, check the Enforce strict RPC compliance checkbox, and then click on OK twice. Click on Apply to update the firewall policy, and then click on OK . Now you will need to attach the DPM agent for the TMG server. Follow the ensuing steps to complete this task: Open the DPM Administrator Console. Click on the Management tab on the navigation bar. Now click on the Agents tab. On the Actions pane, click on Install . Now the Protection Agent Install Wizard window should pop up. Choose the Attach agents checkbox. Choose Computer on trusted domain , and click on Next . Select the TMG server from the list and click on Add and then click on Next . Enter credentials for the domain account. The account that is used here needs to have administrative rights on the computer you are going to protect. Click on Next to continue. You will receive a warning that DPM cannot tell if the TMG server is clustered or not. Click on OK for this. On the next screen click on Attach to continue. Next you have to install the agent on the TMG firewall and point it to the correct DPM server. Follow the ensuing steps to complete this task: From the TMG server that you will be protecting, access the DPM server over the network and copy the folder with the agent installed in it down to the local machine. Use this path DPMSERVERNAME%systemdrive%program filesMicrosoft DPMDPMProtectionAgentsRA3.03.0.7696.0i386. Then from the local folder on the protected computer, run dpmra.msi to install the agent. Open a command prompt (make sure you have elevated privileges), change directory to C:Program FilesMicrosoft Data Protection ManagerDPMbin then run the following: SetDpmServer.exe -dpmServerName <serverName> userName <userName> Following is the example of the previous command: SetDpmServer.exe -dpmServerName buchdpm Now restart the TMG server. Once your TMG server comes back, check the Windows services to make sure that the DPMRA service is set to automatic, and then start it. That is it for configuring DPM to start protecting TMG, but there are a few more things that we still need to cover on this topic. With TMG backup you can choose to back up certain components of TMG, depending on your recovery needs. With DPM you can back up the TMG hard drive, TMG logs that are stored in SQL, TMG's system state, or BMR of TMG. Following is the list of components you should back up depending on your circumstances: What can be included in TMG server backup: TMG configuration settings (exported through TMG) TMG firewall settings (exported through TMG) TMG logfiles (stored in SQL databases) TMG install directory (only needed if you have custom forms for things such as an Outlook Web Access login screen TMG server system state TMG BMR None of the previous components are required for protection of TMG. In fact, protecting the SQL logfiles tends to cause more issues than it helps, as they change so often. These SQL log databases change so often that DPM will send an error when the old SQL databases no longer shown under protection. The logfiles are not required to restore your TMG. For a standard TMG restore, you will need to reinstall TMG, reconfigure NIC settings, import any certificates, and restore TMG configuration and firewall settings. For more information on backing up TMG 2010, visit the following page: http://technet.microsoft.com/en-us/library/cc984454.aspx. DPM cannot back up the TMG configuration and firewall settings natively. This needs to be scripted and scheduled through Windows Task Scheduler, and then placed on the local hard drive. DPM can back up the .XML settings for TMG export from there. You can find the TMG server's export script at http://msdn.microsoft.com/en-us/library/ms812627.aspx. Place this script into a .VBS file, and then set up a scheduled task to call this file to run. This automates the export of your TMG server settings. There is another way to back up the entire TMG server. This is a new type of protection, specific to TMG 2010. This protection is BMR and is available because TMG is now installed on top of Windows Server 2008 and Windows Server 2008 R2. Protecting the BMR of your TMG gives you the ability to restore your entire TMG in the event that it fails-configuration and firewall settings included. BMR will also bring back certificates and NIC card settings. Note that BMR of TMG restored on a virtual machine can't use its NIC card settings. It only on the same hardware. Well that covers how to protect TMG with DPM. As you can see that there are some improvements through BMR, and if you do not employ BMR protection you can still automate the process of protecting TMG. How to protect IIS Internet Information Services (IIS ) is Microsoft's web server platform. It is included for free with Windows Server operating systems. Its modular nature makes it scalable for different organization web server need. The latest version is IIS 8. It can be used for more than standard web hosting, for example as an FTP server or for media delivery . Knowing what to protect when it comes to IIS will come in handy in almost any environment you may work in. Backing up IIS is one thing but you need to ensure that you understand the websites or web applications you are running, so that you know how to back them up too. In this section, we are going to look at the protection of IIS. To protect IIS, you should backup the following components: IIS configuration files Website or web applications data SSL certificates Registry (only needed if website or web application required modifications of the registry) Metabase The IIS configuration files are located in the %systemdrive%windowssystem32inetsrvconfig directory (and subdirectories). The website or web application files are typically found in C:inetpubwwwroot. Now this is the default location but the website or web application files can be located anywhere on an IIS server. To export SSL certificates directly from IIS, follow the ensuing steps: Open the Microsoft IIS 7 console. In the left-hand pane, select the server name. In the center pane click on the server certificates icon. Right-click on the certificate you wish to export and select export . Enter a file path, name the certificate file, and give it a password. Click on OK and your certificate will be exported as a .pfx file in the path you specified. Metabase is an internal database that holds IIS configuration data. It is made up of two files: MBSchema.xml and MetaBase.xml. These can be found in %SystemRoot%system32inetsrv. A good thing to know is that if you protect the system state of a server, then IIS configuration will be included in this backup. This does not include the website or web application files, so you will still need to protect these in addition to a system state backup. That covers the items you will need to protect IIS with DPM backup. Protecting Lync 2010 with DPM Lync 2010 is Microsoft's Unified Communication platform complete with IM, presence, conferencing, enterprise video and voice, and more. Lync was formerly known as Office Communicator. Lync is quickly becoming an integral part of business communications. With Lync being a critical application to organizations, it important to ensure this platform is backed up. Lync is a massive product with many moving parts. We are not going to cover all of Lync's architecture as this would need its own book. We are going to focus on what should be backed up to ensure protection of your Lync deployment. Overall, we want to protect Lync's settings and configuration data. The majority of this data is stored in the Lync Central Management store. The following are the components that needs to be protected in order to back up Lync: Settings and configuration data Topology configuration (Xds.mdf) Location information (Lis.mdf) Response group configuration (RgsConfig.mdf) Data stored in databases User data (Rtc.mdf) Archiving data (LcsLog.mdf) Monitoring data (csCDR.mdf and QoeMetrics.mdf) File stores Lync server file store Archiving file store These stores will be file shares on the Lync server, named in the format lyncservernamesharename. To track down these file shares if you don't know where they are, go to the Lync Topology Builder and look in the File stores node. Note the files named Meeting.Active should not be backed up. These files are in use and locked while a meeting takes place. Other components as follows: Active Directory (User SIP data, a pointer to the Central Management store, and objects for Response Group and Conferencing Attendant) Certification authority (CA) and certificates (if you use an internal CA) Microsoft Exchange and Exchange Unified Messaging (UM) if you are using UM with your Exchange Domain Name System (DNS) records and IP addresses IIS on Lync Server DHCP Configuration Group Chat (if used) XMPP gateways if you are using XMPP gateway Public switched telephone network (PSTN) gateway configuration, if your Lync is connected to one Firewall and Load Balancer (if used) configurations Summary Now that we had a chance to look at several Microsoft workloads that are used in organizations today and how to protect them with DPM, you should have a good understanding what it takes to back them up. These workloads included Lync 2010, IIS, CRM, GP, DFS, and TMG. Note there are many more Microsoft workloads that DPM cannot protect natively, which we were unable to cover in this article. Resources for Article : Further resources on this subject: Overview of Microsoft Dynamics CRM 2011 [Article] Deploying .NET-based Applications on to Microsoft Windows CE Enabled Smart Devices [Article] Working with Dashboards in Dynamics CRM [Article]
Read more
  • 0
  • 0
  • 2135

article-image-out-process-distributed-caching
Packt
06 Sep 2013
7 min read
Save for later

Out-of-process distributed caching

Packt
06 Sep 2013
7 min read
(For more resources related to this topic, see here.) Getting ready Out-of-process caching is a way of distributing your caching needs in a different JVM and/or infrastructure. Ehcache provides a convenient deployable WAR file that works on most web containers/standalone servers whose mission is to provide an easy API interface to distributed cache. At the moment of writing, you can download it from http://sourceforge.net/projects/ehcache/files/ehcache-server/, or you can include it in your Maven POM and will be delivered as a WAR file. The cache server requires no special configuration on the Tomcat container. However, if you are running GlassFish, Jetty, WebLogic, or any other application server (or servlet container), there are minimal configuration changes to do. Please refer to the Ehcache cache server documentation for details on these. While using the RESTful interface, it is important to note that you have three ways to set the MIME type for exchanging data back and forth to the cache server, namely Text/XML, application/JSON, and application/x-java-serialized-object. You can use any programming language to invoke the web service interface and cache your objects (except for application/x-java-serialized-object for obvious reasons). Refer to the recipe8 project directory within the source code bundle for a fully working sample of this recipe content and further information related to this topic. How to do it... Add Ehcache and Ehcache cache server dependencies to your POM.xml file. <dependency> <groupId>net.sf.ehcache</groupId> <artifactId>ehcache-server</artifactId> <version>1.0.0</version> <type>war</type> </dependency> <dependency> <groupId>net.sf.ehcache</groupId> <artifactId>ehcache</artifactId> <version>2.6.0</version> <type>pom</type> </dependency> Edit ehcache.xml in the cache server to hold your cache setup (the cache name is very important).You can find this file here: ${CACHE_SERVER}/WEB-INF/classes/ehcache.xml. <?xml version="1.0" encoding="UTF-8"?> <ehcache xsi_noNamespaceSchemaLocation="ehcache.xsd" updateCheck="true" monitoring="autodetect" dynamicConfig="true"> <!-- Set cache eternal (of course not to do in production) --> <cache name="remoteCache" maxElementsInMemory="10000" eternal="true" diskPersistent="true" overflowToDisk="true"/> ... Disable the SOAP interface in the cache server web.xml (since we are going to use RESTful) file:You can find this file here: ${CACHE_SERVER}/WEB-INF/web.xml. <?xml version="1.0" encoding="UTF-8"?> <web-app xsi_schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5"> ... <!-- SOAP Servlet Comment out (or remove) to disable SOAP Web Services <servlet> <servlet-name>EhcacheWebServiceEndpoint</servlet-name> <servlet-class>com.sun.xml.ws.transport.http.servlet.WSServlet </servlet-class> <load-on-startup>1</load-on-startup> </servlet> <servlet-mapping> <servlet-name>EhcacheWebServiceEndpoint</servlet-name> <url-pattern>/soap/EhcacheWebServiceEndpoint</url-pattern> </servlet-mapping> <session-config> <session-timeout>60</session-timeout> </session-config> <listener> <listener-class> com.sun.xml.ws.transport.http.servlet.WSServletContextListener </listener-class> </listener> --> ... Make your objects-to-be-cached serializable: import java.io.Serializable; public final class Item implements Serializable { Invoke the RESTful (or SOAP) interface to save/retrieve/delete cached objects: ... public void saveItemInCache(String key, Serializable item) { //sample URL: http://localhost:8080/ehcache/rest/cacheName/{id} //here cacheName is the cache name you set up in the cache-server ehcache.xml String url = CACHE_SERVER_URL + "cacheName" + "/" + key; //initialize Apache HTTP Client client = new DefaultHttpClient(); //create Cache Element to be sent Element element = new Element(key, item); //serialize object to be sent to EhCache Server byte[] itemToByteArray = SerializationUtils.serialize(element); //create PUT request HttpPut putRequest = new HttpPut(url); //set header to read java-serialized-object mime type putRequest.setHeader ("Content-Type", "application/x-java-serialized-object"); ... How it works... The Ehcache cache server utility is a versatile tool that lets us distribute cache engines in a very flexible way. It provides a very simple API exposure via RESTful or SOAP-based web services. We start by editing the ehcache.xml configuration file within the cache server application by adding a cache that we would like to use for our cached objects: ... <!-- Set cache eternal (of course not to do in production) --> <cache name="remoteCache" maxElementsInMemory="10000" eternal="true" diskPersistent="true" overflowToDisk="true"/> ... The cache name defined here is very important because this will be the endpoint of our RESTful URL pattern that the cache server will identify and use. Then, we need to edit the web.xml file within the cache server application (located in {CACHE-SERVER}/WEB-INF/) in order to comment out (or completely remove) service definitions that we are not going to use (that is, SOAP if you are using RESTful or vice versa). <!-- SOAP Servlet Comment out to disable SOAP Web Services <servlet> <servlet-name>EhcacheWebServiceEndpoint</servlet-name> <servlet-class>com.sun.xml.ws.transport.http.servlet.WSServlet </servlet-class> <load-on-startup>1</load-on-startup> </servlet> ... In order to cache an object (specially a Java object), we need to make it serializable simply by implementing the Serializable interface (this is not a requirement for MIME types different from the application/x-java-serialized-object). import java.io.Serializable; public final class Item implements Serializable { Finally, we invoke the RESTful endpoint from our code to store/retrieve/delete the object from/to the cache layer. //sample URL: http://localhost:8080/ehcache/rest/cacheName/{id} //here cacheName is the cache name you set up in the cache-server ehcache.xml String url = CACHE_SERVER_URL + "cacheName" + "/" + key; //set header to read json mime type putRequest.setHeader("Content-Type", "application/json"); It is important to note here that the cacheName URL parameter represents the cache name you defined in the ehcache.xml configuration file in the cache server application. You have defined your cache name as follows: <!-- Set cache eternal (of course not to do in production) --> <cache name="remoteCache" maxElementsInMemory="10000" ... Now, your URL would be something like this: //sample URL: http://localhost:8080/ehcache/rest/remoteCache/{id} Here, id is just the key value you assign to the cached object. Finally, you just use any http/SOAP client library (or Java default Net API classes) to invoke the web service. In the case of RESTful services, you need to be aware that the HTTP method sent determines whether you are storing, updating, retrieving, or deleting a cached item. They are as follows: GET /{cache}/{element}: This retrieves an object by its key from the O-O-P cache layer. PUT /{cache}/{element}: This stores an item in the O-O-P cache layer. DELETE /{cache}/{element}: This deletes an item from the O-O-P cache layer. HEAD /{cache}/{element}: This retrieves metadata (cache configuration values) from the O-O-P cache layer. OPTIONS /{cache}/{element}: This returns the WADL describing operations. For changing the context you can edit the file ${CACHE_SERVER}/META-INF/context.xml and place your desired context name. As for security, look for the file ${CACHE_SERVER}/WEB-INF/server_security_config.xml_rename_to_activate and open it to read the instructions. Summary This article provided details on implementing distributed caching using the Ehcache server, and also explained in brief what out-of-process caching is. Resources for Article : Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [Article] Play Framework: Data Validation Using Controllers [Article] Building Applications with Spring Data Redis [Article]
Read more
  • 0
  • 0
  • 2130

article-image-cql-client-applications
Packt
04 Oct 2013
7 min read
Save for later

CQL for client applications

Packt
04 Oct 2013
7 min read
(For more resources related to this topic, see here.) Using the Thrift API The Thrift library is based on the Thrift RPC protocol. High-level clients built over it have been a standard way of building an application for a long time. In this section, we'll explain how to write a client application using CQL as the query language and thrift as the Java API. When we start Cassandra, by default it listens to Thrift clients (start_rpc: true property in the CASSANDRA_HOME/conf/cassandra.yaml file enables this). Let's build a small program that connects to Cassandra using the Thrift API, and runs CQL 3 queries for reading/writing data in the UserProfiles table we created for the facebook application. The program can be built by performing the following steps: For downloading the Thrift Library, you need to enter apache-assandra-thrift-1.2.x.jar (which is to be found in the CASSANDRA_HOME/lib folder) into your classpath. If your Java project is mavenized, you need to insert the following entry in pom.xml under the dependency section (version will vary depending upon your Cassandra server installation): <dependency> <groupId>org.apache.cassandra</groupId> <artifactId>cassandra-thrift</artifactId> <version>1.2.5</version> </dependency> For connecting to the Cassandra server on a given host and port, you need to open org.apache.thrift.transport.TTransport to the Cassandra node and create an instance of org.apache.cassandra.thrift.Cassandra.Client as follows: TTransport transport = new TFramedTransport(new TSocket("localhost", 9160)); TProtocol protocol = new TBinaryProtocol(transport); Cassandra.Client client = new Cassandra.Client(protocol); transport.open(); client.set_cql_version("3.0.0"); The default CQL version for Thrift is 2.0.0. You must set it to 3.0.0 if you are writing CQL 3 queries and don't want to see any version related errors. After you are done with transport, close it gracefully (usually at the end of read/write operations) as follows: transport.close(); Creating a schema : The executeQuery() utility method accepts String CQL 3 query and runs it: CqlResult executeQuery(String query) throws Exception { return client.execute_cql3_query(ByteBuffer.wrap(query.getBytes("UTF-8")), Compression.NONE, ConsistencyLevel.ONE); } Now, create keyspace and the table by directly executing CQL 3 query: //Create keyspace executeQuery("CREATE KEYSPACE facebook WITH replication = "{'class':'SimpleStrategy','replication_factor':3};"); executeQuery("USE facebook;"); //Create table executeQuery("CREATE TABLE UserProfiles(" +"email_id text," + "password text,"+ "name text," + "age int," + "profile_picture blob," + "PRIMARY KEY(email_id)" + ");" ); Reading/writing data: A couple of records can be inserted as follows: executeQuery("USE facebook;"); executeQuery("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('john.smith@example.com','p4ssw0rd',' John Smith',32,0x8e37);"); executeQuery("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('david.bergin@example.com','guess1t',' David Bergin',42,0xc9f1);"); Executing the SELECT query returns CQLResult, on which we can iterate easily to fetch records: CqlResult result = executeQuery("SELECT * FROM facebook.UserProfiles " + "WHERE email_id = 'john.smith@example.com';"); for (CqlRow row : result.getRows()) { System.out.println(row.getKey(); } Using the Datastax Java driver The Datastax Java driver is based on the Cassandra binary protocol that was introduced in Cassandra 1.2, and works only with CQL 3. The Cassandra binary protocol is specifically made for Cassandra in contrast to Thrift, which is a generic framework and has many limitations. Now, we are going to write a Java program that uses the Datastax Java driver for reading/writing data into Cassandra, by performing the following steps: Downloading the driver library : This driver library JAR file must be in your classpath in order to build an application using it. If you have a maven-based Java project, you need to insert the following entry into the pom.xml file under the dependeny section: <dependency> <groupId>com.datastax.cassandra</groupId> <artifactId>cassandra-driver-core</artifactId> <version>1.0.1</version> </dependency> This driver project is hosted on Github: (https://github.com/datastax/java-driver). It makes sense to check and download the latest version. Configuring Cassandra to listen to native clients : In the newer version of Cassandra, this would be enabled by default and Cassandra will listen to clients using binary protocol. But the earlier Cassandra installations may require enabling this. All you have to do is to check and enable the start_native_transport property into the CASSANDRA_HOME/conf/Cassandra.yaml file by inserting/uncommenting the following line: start_native_transport: true The port that Cassandra will use for listening to native clients is determined by the native_transport_port property. It is possible for Cassandra to listen to both Thrift and native clients simultaneously. If you want to disable Thrift, just set the start_rpc property to false in CASSANDRA_HOME/conf/Cassandra.yaml. Connecting to Cassandra : The com.datastax.driver.core.Cluster class is the entry point for clients to connect to the Cassandra cluster: Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build(); After you are done with using it (usually when application shuts down), close it gracefully: cluster.shutdown(); Creating a session : An object of com.datastax.driver.core.Session allows you to execute a CQL 3 statement. The following line creates a Session instance: Session session = cluster.connect(); Creating a schema : Before reading/writing data, let's create a keyspace and a table similar to UserProfiles in the facebook application we built earlier: // Create Keyspace session.execute("CREATE KEYSPACE facebook WITH replication = " + "{'class':'SimpleStrategy','replication_factor':1};"); session.execute("USE facebook"); // Create table session.execute("CREATE TABLE UserProfiles(" + "email_id text," + "password text,"+ "name text," + "age int," + "profile_picture blob," + "PRIMARY KEY(email_id)" + ");" ); Reading/writing data : We can insert a couple of records as follows: session.execute("USE facebook"); session.execute("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('john.smith@example.com','p4ssw0rd','John Smith',32,0x8e37);"); session.execute("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('david.bergin@example.com','guess1t','David Bergin',42,0xc9f1);"); Finding and printing records : A SELECT query returns an instance of com.datastax.driver.core.ResultSet. You can fetch individual rows by iterating over it using the com.datastax.driver.core.Row object: ResultSet results = session.execute ("SELECT * FROM facebook.UserProfiles " + "WHERE email_id = 'john.smith@example.com';"); for (Row row : results) { System.out.println ("Email: " + row.getString("email_id") + "tName: " + row.getString("name")+ "t Age : " + row.getInt("age")); } Deleting records : We can delete a record as follows: session.execute("DELETE FROM facebook.UserProfiles WHERE email_id='john.smith@example.com';"); Using high-level clients In addition to the libraries based on Thrift and binary protocols, some high-level clients are built with the purpose to ease development and provide additional services, such as connection pooling, load balancing, failover, secondary indexing, and so on. Some of them are listed here: Astyanax (https://github.com/Netflix/astyanax): Astyanax is a high-level Java client for Cassandra. It allows you to run both simple and prepared CQL queries. Hector (https://github.com/hector-client/hector): Hector is a high-level client for Cassandra. At the time of writing this book, it supported CQL 2 only (not CQL 3). Kundera (https://github.com/impetus-opensource/Kundera): Kundera is a JPA 2.0-based object datastore mapping library for Cassandra and many other NoSQL datastores. CQL 3 queries are run with Kundera using the native queries as described in JPA specification. Summary From this article, we basically learn about using CQL in queries using three different preceding methods. Resources for Article : Further resources on this subject: Quick start – Creating your first Java application [Article] Apache Cassandra: Libraries and Applications [Article] Getting Started with Apache Cassandra [Article]
Read more
  • 0
  • 0
  • 2129

article-image-make-efficient-data-driven-decisions
Aaron Lazar
15 Feb 2018
7 min read
Save for later

How to make efficient data-driven decisions

Aaron Lazar
15 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an extract from the book Predictive Analytics with TensorFlow, authored by Md. Rezaul Karim. The book will help you build, tune, and deploy predictive data models with TensorFlow.[/box] Today we’ll learn to take decisions driven by data with the help of few examples. The growing demand for data is a key challenge. Decision support teams such as institutional research and business intelligence often cannot take the right decisions on how to expand their business and research outcomes from a huge collection of data. Although data plays an important role in driving the decision, however, in reality, taking the right decision at right time is the goal. In other words, the goal is the decision support, not the data support. This can be achieved through an advanced use of data management and analytics. Data value chain for making decisions The following diagram in figure 1 (source: H. Gilbert Miller and Peter Mork, From Data to Decisions: A Value Chain for Big Data, Proc. Of IT Professional, Volume: 15, Issue: 1, Jan.-Feb. 2013, DOI: 10.1109/MITP.2013.11) shows the data chain towards taking actual decisions–that is, the goal. The value chains start through the data discovery stage consisting of several steps such as data collection and annotating data preparation, and then organizing them in a logical order having the desired flow. Then comes the data integration for establishing a common data representation of the data. Since the target is to take the right decision, for future reference having the appropriate provenance of the data–that is, where it comes from, is important: Well, now your data is somehow integrated into a presentable format, it's time for the data exploration stage, which consists of several steps such as analyzing the integrated data and visualization before taking the actions to take on the basis of the interpreted results. However, is this enough before taking the right decision? Probably not! The reason is that it lacks enough analytics, which eventually helps to take the decision with an actionable insight. Predictive analytics comes in here to fill the gap between. Now let's see an example of how in the following section. From disaster to decision – Titanic survival example Here is the challenge, Titanic–Machine Learning from Disaster from Kaggle (https://www.kaggle.com/c/titanic): "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy" But going into this deeper, we need to know about the data of passengers travelling in the Titanic during the disaster so that we can develop a predictive model that can be used for survival analysis. The dataset can be downloaded from the preceding URL. Table 1 here shows the metadata about the Titanic survival dataset: A snapshot of the dataset can be seen as follows: The ultimate target of using this dataset is to predict what kind of people survived the Titanic disaster. However, a bit of exploratory analysis of the dataset is a mandate. At first, we need to import necessary packages and libraries: import pandas as pd import matplotlib.pyplot as plt import numpy as np Now read the dataset and create a panda's DataFrame: df = pd.read_csv('/home/asif/titanic_data.csv') Before drawing the distribution of the dataset, let's specify the parameters for the graph: fig = plt.figure(figsize=(18,6), dpi=1600) alpha=alpha_scatterplot = 0.2 alpha_bar_chart = 0.55 fig = plt.figure() ax = fig.add_subplot(111) Draw a bar diagram for showing who survived versus who did not: ax1 = plt.subplot2grid((2,3),(0,0)) ax1.set_xlim(-1, 2) df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart) plt.title("Survival distribution: 1 = survived") Plot a graph showing survival by Age: plt.subplot2grid((2,3),(0,1)) plt.scatter(df.Survived, df.Age, alpha=alpha_scatterplot) plt.ylabel("Age") plt.grid(b=True, which='major', axis='y') plt.title("Survival by Age: 1 = survived") Plot a graph showing distribution of the passengers classes: ax3 = plt.subplot2grid((2,3),(0,2)) df.Pclass.value_counts().plot(kind="barh", alpha=alpha_bar_chart) ax3.set_ylim(-1, len(df.Pclass.value_counts())) plt.title("Class dist. of the passengers") Plot a kernel density estimate of the subset of the 1st class passengers' age: plt.subplot2grid((2,3),(1,0), colspan=2) df.Age[df.Pclass == 1].plot(kind='kde') df.Age[df.Pclass == 2].plot(kind='kde') df.Age[df.Pclass == 3].plot(kind='kde') plt.xlabel("Age") plt.title("Age dist. within class") plt.legend(('1st Class', '2nd Class','3rd Class'),loc='best') Plot a graph showing passengers per boarding location: ax5 = plt.subplot2grid((2,3),(1,2)) df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart) ax5.set_xlim(-1, len(df.Embarked.value_counts())) plt.title("Passengers per boarding location") Finally, we show all the subplots together: plt.show() >>> The figure shows the survival distribution, survival by age, age distribution, and the passengers per boarding location: However, to execute the preceding code, you need to install several packages such as matplotlib, pandas, and scipy. They are listed as follows: Installing pandas: Pandas is a Python package for data manipulation. It can be installed as follows: $ sudo pip3 install pandas #For Python 2.7, use the following: $ sudo pip install pandas Installing matplotlib: In the preceding code, matplotlib is a plotting library for mathematical objects. It can be installed as follows: $ sudo apt-get install python-matplotlib # for Python 2.7 $ sudo apt-get install python3-matplotlib # for Python 3.x Installing scipy: Scipy is a Python package for scientific computing. Installing blas and lapack and gfortran are a prerequisite for this one. Now just execute the following command on your terminal: $ sudo apt-get install libblas-dev liblapack-dev $ sudo apt-get install gfortran $ sudo pip3 install scipy # for Python 3.x $ sudo pip install scipy # for Python 2.7 For Mac, use the following command to install the above modules: $ sudo easy_install pip $ sudo pip install matplotlib $ sudo pip install libblas-dev liblapack-dev $ sudo pip install gfortran $ sudo pip install scipy For windows, I am assuming that Python 2.7 is already installed at C:Python27. Then open the command prompt and type the following command: C:Usersadmin-karim>cd C:/Python27 C:Python27> python -m pip install <package_name> # provide package name accordingly. For Python3, issue the following commands: C:Usersadmin-karim>cd C:Usersadmin-karimAppDataLocalPrograms PythonPython35Scripts C:Usersadmin-karimAppDataLocalProgramsPythonPython35 Scripts>python3 -m pip install <package_name> Well, we have seen the data. Now it's your turn to do some analytics on top of the data. Say predicting what kinds of people survived from that disaster. Don't you agree that we have enough information about the passengers, but how could we do the predictive modeling so that we can draw some fairly straightforward conclusions from this data? For example, say being a woman, being in 1st class, and being a child were all factors that could boost passenger chances of survival during this disaster. In a brute-force approach–for example, using if/else statements with some sort of weighted scoring system, you could write a program to predict whether a given passenger would survive the disaster. However, does writing such a program in Python make much sense? Naturally, it would be very tedious to write, difficult to generalize, and would require extensive fine tuning for each variable and samples (that is, passenger). This is where predictive analytics with machine learning algorithms and emerging tools comes in so that you could build a program that learns from sample data to predict whether a given passenger would survive. If you found this post useful and would like to explore more, head over to grab the book, Predictive Analytics with TensorFlow written by Md. Rezaul Karim.    
Read more
  • 0
  • 0
  • 2129

article-image-clustering-model-spark
Packt
19 Jan 2017
7 min read
Save for later

Clustering Model with Spark

Packt
19 Jan 2017
7 min read
In this article by Manpreet Singh Ghotra and Rajdeep Dua, coauthors of the book Machine Learning with Spark, Second Edition, we will analyze the case where we do not have labeled data available. Supervised learning methods are those where the training data is labeled with the true outcome that we would like to predict (for example, a rating for recommendations and class assignment for classification or a real target variable in the case of regression). (For more resources related to this topic, see here.) In unsupervised learning, the model is not supervised with the true target label. The unsupervised case is very common in practice, since obtaining labeled training data can be very difficult or expensive in many real-world scenarios (for example, having humans label training data with class labels for classification). However, we would still like to learn some underlying structure in the data and use these to make predictions. This is where unsupervised learning approaches can be useful. Unsupervised learning models are also often combined with supervised models, for example, applying unsupervised techniques to create new input features for supervised models. Clustering models are, in many ways, the unsupervised equivalent of classification models. With classification, we would try to learn a model that would predict which class a given training example belonged to. The model is essentially a mapping from a set of features to the class. In clustering, we would like to segment the data in such a way that each training example is assigned to a segment called a cluster. The clusters act much like classes, except that the true class assignments are unknown. Clustering models have many use cases that are the same as classification; these include the following: Segmenting users or customers into different groups based on behavior characteristics and metadata Grouping content on a website or products in a retail business Finding clusters of similar genes Segmenting communities in ecology Creating image segments for use in image analysis applications such as object detection Types of clustering models There are many different forms of clustering models available, ranging from simple to extremely complex ones. The Spark MLlibrary currently provides K-means clustering, which is among the simplest approaches available. However, it is often very effective, and its simplicity makes it is relatively easy to understand and is scalable. K-means clustering K-means attempts to partition a set of data points into K distinct clusters (where K is an input parameter for the model). More formally, K-means tries to find clusters so as to minimize the sum of squared errors (or distances) within each cluster. This objective function is known as the within cluster sum of squared errors (WCSS). It is the sum, over each cluster, of the squared errors between each point and the cluster center. Starting with a set of K initial cluster centers (which are computed as the mean vector for all data points in the cluster), the standard method for K-means iterates between two steps: Assign each data point to the cluster that minimizes the WCSS. The sum of squares is equivalent to the squared Euclidean distance; therefore, this equates to assigning each point to the closest cluster center as measured by the Euclidean distance metric. Compute the new cluster centers based on the cluster assignments from the first step. The algorithm proceeds until either a maximum number of iterations has been reached or convergence has been achieved. Convergence means that the cluster assignments no longer change during the first step; therefore, the value of the WCSS objective function does not change either. For more details, refer to Spark's documentation on clustering at http://spark.apache.org/docs/latest/mllib-clustering.html or refer to http://en.wikipedia.org/wiki/K-means_clustering. To illustrate the basics of K-means, we will use a simple dataset. We have five classes, which are shown in the following figure: Multiclass dataset However, assume that we don't actually know the true classes. If we use K-means with five clusters, then after the first step, the model's cluster assignments might look like this: Cluster assignments after the first K-means iteration We can see that K-means has already picked out the centers of each cluster fairly well. After the next iteration, the assignments might look like those shown in the following figure: Cluster assignments after the second K-means iteration Things are starting to stabilize, but the overall cluster assignments are broadly the same as they were after the first iteration. Once the model has converged, the final assignments could look like this: Final cluster assignments for K-means As we can see, the model has done a decent job of separating the five clusters. The leftmost three are fairly accurate (with a few incorrect points). However, the two clusters in the bottom-right corner are less accurate. This illustrates the following: The iterative nature of K-means The model's dependency on the method of initially selecting clusters' centers (here, we will use a random approach) How the final cluster assignments can be very good for well-separated data but can be poor for data that is more difficult Initialization methods The standard initialization method for K-means, usually simply referred to as the random method, starts by randomly assigning each data point to a cluster before proceeding with the first update step. Spark ML provides a parallel variant for this initialization method, called K-means++, which is the default initialization method used. Refer to http://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods and http://en.wikipedia.org/wiki/K-means%2B%2B for more information. The results of using K-means++ are shown here. Note that this time, the difficult bottom-right points have been mostly correctly clustered. Final cluster assignments for K-means++ Variants There are many other variants of K-means; they focus on initialization methods or the core model. One of the more common variants is fuzzy K-means. This model does not assign each point to one cluster as K-means does (a so-called hard assignment). Instead, it is a soft version of K-means, where each point can belong to many clusters and is represented by the relative membership to each cluster. So, for K clusters, each point is represented as a K-dimensional membership vector, with each entry in this vector indicating the membership proportion in each cluster. Mixture models A mixture model is essentially an extension of the idea behind fuzzy K-means; however, it makes an assumption that there is an underlying probability distribution that generates the data. For example, we might assume that the data points are drawn from a set of K-independent Gaussian (normal) probability distributions. The cluster assignments are also soft, so each point is represented by K membership weights in each of the K underlying probability distributions. Refer to http://en.wikipedia.org/wiki/Mixture_model for further details and for a mathematical treatment of mixture models. Hierarchical clustering Hierarchical clustering is a structured clustering approach that results in a multilevel hierarchy of clusters where each cluster might contain many subclusters (or child clusters). Each child cluster is, thus, linked to the parent cluster. This form of clustering is often also called tree clustering. Agglomerative clustering is a bottom-up approach where we have the following: Each data point begins in its own cluster The similarity (or distance) between each pair of clusters is evaluated The pair of clusters that are most similar are found; this pair is then merged to form a new cluster The process is repeated until only one top-level cluster remains Divisive clustering is a top-down approach that works in reverse, starting with one cluster, and at each stage, splitting a cluster into two, until all data points are allocated to their own bottom-level cluster. You can find more information at http://en.wikipedia.org/wiki/Hierarchical_clustering. Summary In this article, we explored a new class of model that learns structure from unlabeled data—unsupervised learning. You learned about various clustering models like the K-means model, mixture models, and the hierarchical clustering model. We also considered a simple dataset to illustrate the basics of K-means. Resources for Article: Further resources on this subject: Spark for Beginners [article] Setting up Spark [article] Holistic View on Spark [article]
Read more
  • 0
  • 0
  • 2123

article-image-explaining-data-exploration-in-under-a-minute
Amarabha Banerjee
08 Feb 2018
5 min read
Save for later

Explaining Data Exploration in under a minute

Amarabha Banerjee
08 Feb 2018
5 min read
[box type="note" align="" class="" width=""]Below given article is taken from the book Machine Learning with R written by Brett Lantz. This book will help you harness the power of R for statistical computing and data science.[/box] Today, we shall explore different data exploration techniques and a real world example of using these techniques. Introduction Data Exploration is a term used for finding insightful information from data. To find insights from data various steps such as data munging, data analysis, data modeling, and model evaluation are taken. In any real data exploration project, commonly six steps are involved in the exploration process. They are as follows: Asking the right questions: Asking the right questions will help in understanding the objective and target information sought from the data. Questions can be asked such as What are my expected findings after the exploration is finished?, or What kind of information can I extract through the exploration? Data collection: Once the right questions have been asked the target of exploration is cleared. Data collected from various sources is in unorganized and diverse format. Data may come from various sources such as files, databases, internet, and so on. Data collected in this way is raw data and needs to be processed to extract meaningful information. Most of the analysis and visualizing tools or applications expect data to be in a certain format to generate results and hence the raw data is of no use for them. Data munging: Raw data collected needs to be converted into the desired format of the tools to be used. In this phase, raw data is passed through various processes such as parsing the data, sorting, merging, filtering, dealing with missing values, and so on. The main aim is to transform raw data in the format that the analyzing and visualizing tools understand. Once the data is compatible with the tools, analysis and visualizing tools are used to generate the different results. Basic exploratory data analysis: Once the data munging is done and data is formating for the tools, it can be used to perform data exploration and analysis. Tools provide various methods and techniques to do the same. Most analyzing tools allow statistical functions to be performed on the data. Visualizing tools help in visualizing the data in different ways. Using basic statistical operations and visualizing the same data can be understood in better way. Advanced exploratory data analysis: Once the basic analysis is done it's time to look at an advanced stage of analysis. In this stage, various prediction models are formed on basis of requirement. Machine learning algorithms are utilized to train the model and generate the inferences. Various tuning on the model is also done to ensure correctness and effectiveness of the model. Model assessment: When the models are mare, they are evaluated to find the best model from the given different models. The major factor to decide the best model is to see how perfect or closely it can predict the values. Models are tuned here also for increasing the accuracy and effectiveness. Various plots and graphs are used to see the model’s prediction. Real world example - using Air Quality Dataset Air quality datasets come bundled with R. They contain data about the New York Air Quality Measurements of 1973 for five months from May to September recorded daily. To view all the available datasets use the data() function, it will display all the datasets available with R installation. How to do it Perform the following step to see all the datasets in R and using airquality: > data() > str(airquality) Output 'data.frame': 153 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ... $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ... $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ... > head(airquality) Output Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 How it works The str command is used to display the structure of the dataset, as you can see it contains the information about the observation of ozone, solar, wind, and temp attributes recorded each day for five months. Using the head function, you can see the first few lines of actual data. The dataset is very basic and is enough to start processing and analyzing data at a very basic level. Kaggle website, which has various diverse kinds of datasets. Apart from datasets it also holds many competitions in data science fields to solve real-world problems. You can find the competitions, datasets, kernels, and jobs at https://www. kaggle.com/. Many competitions are organized by large corporate bodies, government agencies, or from academia. Many of the competitions have prize money associated with them. The following screenshot shows competitions and prize money. You can simply create an account and start participating in competitions by submitting code and the output and the same will be assessed. Assessment or evaluation criteria is available on the detail page of each competition. By participating and using https:/ / www.kaggle. com/ one gains experience in solving real-world problems. It gives you a taste of what data scientist do. On the jobs page various jobs for data scientists and analysis is listed and you can apply if the profile is suitable or matches with your interests. If you liked our post, be sure to check out Machine Learning with R which consists of more useful machine learning techniques with R.  
Read more
  • 0
  • 0
  • 2110
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-low-level-index-control
Packt
28 Oct 2013
12 min read
Save for later

Low-Level Index Control

Packt
28 Oct 2013
12 min read
(For more resources related to this topic, see here.) Altering Apache Lucene scoring With the release of Apache Lucene 4.0 in 2012, all the users of this great, full text search library, were given the opportunity to alter the default TF/IDF based algorithm. Lucene API was changed to allow easier modification and extension of the scoring formula. However, that was not the only change that was made to Lucene when it comes to documents score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use different scoring formula for our documents. In this section we will take a deeper look at what Lucene 4.0 brings and how those features were incorporated into ElasticSearch. Setting per-field similarity Since ElasticSearch 0.90, we are allowed to set a different similarity for each of the fields we have in our mappings. For example, let's assume that we have the following simple mapping that we use, in order to index blog posts (stored in the posts_no_similarity.json file): { "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } } What we would like to do is, use the BM25 similarity model for the name field and the contents field. In order to do that, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings (stored in the posts_similarity.json file) would appear as shown in the following code: { "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed", "similarity" : "BM25" } } } } } And that's all, nothing more is needed. After the preceding change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name and contents fields. In case of the Divergence from randomness and Information based similarity model, we need to configure some additional properties to specify the behavior of those similarities. How to do that is covered in the next part of the current section. Default codec properties When using the default codec we are allowed to configure the following properties: min_block_size: It specifies the minimum block size Lucene term dictionary uses to encode blocks. It defaults to 25. max_block_size: It specifies the maximum block size Lucene term dictionary uses to encode blocks. It defaults to 48. Direct codec properties The direct codec allows us to configure the following properties: min_skip_count: It specifies the minimum number of terms with a shared prefix to allow writing of a skip pointer. It defaults to 8. low_freq_cutoff: The codec will use a single array object to hold postings and positions that have document frequency lower than this value. It defaults to 32. Memory codec properties By using the memory codec we are allowed to alter the following properties: pack_fst: It is a Boolean option that defaults to false and specifies if the memory structure that holds the postings should be packed into the FST. Packing into FST will reduce the memory needed to hold the data. acceptable_overhead_ratio: It is a compression ratio of the internal structure specified as a float value which defaults to 0.2. When using the 0 value, there will be no additional memory overhead but the returned implementation may be slow. When using the 0.5 value, there can be a 50 percent memory overhead, but the implementation will be fast. Values higher than 1 are also possible, but may result in high memory overhead. Pulsing codec properties When using the pulsing codec we are allowed to use the same properties as with the default codec and in addition to them one more property, which is described as follows: freq_cut_off: It defaults to 1. The document frequency at which the postings list will be written into the term dictionary. The documents with the frequency equal to or less than the value of freq_cut_off will be processed. Bloom filter-based codec properties If we want to configure a bloom filter based codec, we can use the bloom_filter type and set the following properties: delegate: It specifies the name of the codec we want to wrap, with the bloom filter. ffp: It is a value between 0 and 1.0 which specifies the desired false positive probability. We are allowed to set multiple probabilities depending on the amount of documents per Lucene segment. For example, the default value of 10k=0.01, 1m=0.03 specifies that the fpp value of 0.01 will be used when the number of documents per segment is larger than 10.000 and the value of 0.03 will be used when the number of documents per segment is larger than one million. For example, we could configure our custom bloom filter based codec to wrap a direct posting format as shown in the following code (stored in posts_bloom_custom.json file): { "settings" : { "index" : { "codec" : { "postings_format" : { "custom_bloom" : { "type" : "bloom_filter", "delegate" : "direct", "ffp" : "10k=0.03, 1m=0.05" } } } } }, "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "postings_format" : "custom_bloom" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } } NRT, flush, refresh, and transaction log In an ideal search solution, when new data is indexed it is instantly available for searching. At the first glance it is exactly how ElasticSearch works even in multiserver environments. But this is not the truth (or at least not all the truth) and we will show you why it is like this. Let's index an example document to the newly created index by using the following command: curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test" }' Now, we will replace this document and immediately we will try to find it. In order to do this, we'll use the following command chain: curl –XPOST localhost:9200/test/test/1 -d '{ "title": "test2" }' ; curl localhost:9200/test/test/_search?pretty The preceding command will probably result in the response, which is very similar to the following response: {"ok":true,"_index":"test","_type":"test","_id":"1","_version":2}{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "test", "_id" : "1", "_score" : 1.0, "_source" : { "title": "test" } } ] } } The first line starts with a response to the indexing command—the first command. As you can see everything is correct, so the second, search query should return the document with the title field test2, however, as you can see it returned the first document. What happened? But before we give you the answer to the previous question, we should take a step backward and discuss about how underlying Apache Lucene library makes the newly indexed documents available for searching. Updating index and committing changes The segments are independent indices, which means that queries that are run in parallel to indexing, from time to time should add newly created segments to the set of those segments that are used for searching. Apache Lucene does that by creating subsequent (because of write-once nature of the index) segments_N files, which list segments in the index. This process is called committing. Lucene can do this in a secure way—we are sure that all changes or none of them hits the index. If a failure happens, we can be sure that the index will be in consistent state. Let's return to our example. The first operation adds the document to the index, but doesn't run the commit command to Lucene. This is exactly how it works. However, a commit is not enough for the data to be available for searching. Lucene library use an abstraction class called Searcher to access index. After a commit operation, the Searcher object should be reopened in order to be able to see the newly created segments. This whole process is called refresh. For performance reasons ElasticSearch tries to postpone costly refreshes and by default refresh is not performed after indexing a single document (or a batch of them), but the Searcher is refreshed every second. This happens quite often, but sometimes applications require the refresh operation to be performed more often than once every second. When this happens you can consider using another technology or requirements should be verified. If required, there is possibility to force refresh by using ElasticSearch API. For example, in our example we can add the following command: curl –XGET localhost:9200/test/_refresh If we add the preceding command before the search, ElasticSearch would respond as we had expected. Changing the default refresh time The time between automatic Searcher refresh can be changed by using the index.refresh_interval parameter either in the ElasticSearch configuration file or by using the update settings API. For example: curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "refresh_interval" : "5m" } }' The preceding command will change the automatic refresh to be done every 5 minutes. Please remember that the data that are indexed between refreshes won't be visible by queries. As we said, the refresh operation is costly when it comes to resources. The longer the period of refresh is, the faster your indexing will be. If you are planning for very high indexing procedure when you don't need your data to be visible until the indexing ends, you can consider disabling the refresh operation by setting the index.refresh_interval parameter to -1 and setting it back to its original value after the indexing is done. The transaction log Apache Lucene can guarantee index consistency and all or nothing indexing, which is great. But this fact cannot ensure us that there will be no data loss when failure happens while writing data to the index (for example, when there isn't enough space on the device, the device is faulty or there aren't enough file handlers available to create new index files). Another problem is that frequent commit is costly in terms of performance (as you recall, a single commit will trigger a new segment creation and this can trigger the segments to merge). ElasticSearch solves those issues by implementing transaction log. Transaction log holds all uncommitted transactions and from time to time, ElasticSearch creates a new log for subsequent changes. When something goes wrong, transaction log can be replayed to make sure that none of the changes were lost. All of these tasks are happening automatically, so, the user may not be aware of the fact that commit was triggered at a particular moment. In ElasticSearch, the moment when the information from transaction log is synchronized with the storage (which is Apache Lucene index) and transaction log is cleared is called flushing. Please note the difference between flush and refresh operations. In most of the cases refresh is exactly what you want. It is all about making new data available for searching. From the opposite side, the flush operation is used to make sure that all the data is correctly stored in the index and transaction log can be cleared. In addition to automatic flushing, it can be forced manually using the flush API. For example, we can run a command to flush all the data stored in the transaction log for all indices, by running the following command: curl –XGET localhost:9200/_flush Or we can run the flush command for the particular index, which in our case is the one called library: curl –XGET localhost:9200/library/_flush curl –XGET localhost:9200/library/_refresh In the second example we used it together with the refresh, which after flushing the data opens a new searcher. The transaction log configuration If the default behavior of the transaction log is not enough ElasticSearch allows us to configure its behavior when it comes to the transaction log handling. The following parameters can be set in the elasticsearch.yml file as well as using index settings update API to control transaction log behavior: index.translog.flush_threshold_period: It defaults to 30 minutes (30m). It controls the time, after which flush will be forced automatically even if no new data was being written to it. In some cases this can cause a lot of I/O operation, so sometimes it's better to do flush more often with less data being stored in it. index.translog.flush_threshold_ops: It specifies the maximum number of operations after which the flush operation will be performed. It defaults to 5000. index.translog.flush_threshold_size: It specifies the maximum size of the transaction log. If the size of the transaction log is equal to or greater than the parameter, the flush operation will be performed. It defaults to 200 MB. index.translog.disable_flush: This option disables automatic flush. By default flushing is enabled, but sometimes it is handy to disable it temporarily, for example, during import of large amount of documents. All of the mentioned parameters are specified for an index of our choice, but they are defining the behavior of the transaction log for each of the index shards. Of course, in addition to setting the preceding parameters in the elasticsearch.yml file, they can also be set by using Settings Update API. For example: curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "translog.disable_flush" : true } }' The preceding command was run before the import of a large amount of data, which gave us a performance boost for indexing. However, one should remember to turn on flushing when the import is done. Near Real Time GET Transaction log gives us one more feature for free that is, real-time GET operation, which provides the possibility of returning the previous version of the document including non-committed versions. The real-time GET operation fetches data from the index, but first it checks if a newer version of that document is available in the transaction log. If there is no flushed document, data from the index is ignored and a newer version of the document is returned—the one from the transaction log. In order to see how it works, you can replace the search operation in our example with the following command: curl -XGET localhost:9200/test/test/1?pretty ElasticSearch should return the result similar to the following: { "_index" : "test", "_type" : "test", "_id" : "1", "_version" : 2, "exists" : true, "_source" : { "title": "test2" } } If you look at the result, you would see that again, the result was just as we expected and no trick with refresh was required to obtain the newest version of the document.
Read more
  • 0
  • 0
  • 2106

article-image-introducing-qlikview-elements
Packt
24 Sep 2013
6 min read
Save for later

Introducing QlikView elements

Packt
24 Sep 2013
6 min read
(For more resources related to this topic, see here.) People People are the only active element of data visualization, and as such, they are the most important. We briefly describe the roles of several people that participate in our project, but we mainly focus on the person who is going to analyze and visualize the data. After the meeting, we get together with our colleague, Samantha, who is the analyst that supports the sales and executive teams. She currently manages a series of highly personalized Excels that she creates from standard reports generated within the customer invoice and project management system. Her audience ranges from the CEO down to sales managers. She is not a pushover, but she is open to try new techniques, especially given that the sponsor of this project is the CEO of QDataViz, Inc. As a data discovery user, Samantha possesses the following traits: Ownership She has a stake in the project's success or failure. She, along with the company, stands to grow as a result of this project, and most importantly, she is aware of this opportunity. Driven She is focused on grasping what we teach her and is self-motivated to continue learning after the project is fi nished. The cause of her drive is unimportant as long as she remains honest. Honest She understands that data is a passive element that is open to diverse interpretations by different people. She resists basing her arguments on deceptive visualization techniques or data omission. Flexible She does not endanger her job and company results following every technological fad or whimsical idea. However, she realizes that technology does change and that a new approach can foment breakthroughs. Analytical She loves finding anomalies in the data and being the reason that action is taken to improve QDataViz, Inc. As a means to achieve what she loves, she understands how to apply functions and methods to manipulate data. Knowledgeable She is familiar with the company's data, and she understands the indicators needed to analyze its performance. Additionally, she serves as a data source and gives context to analysis. Team player She respects the roles of her colleagues and holds them accountable. In turn, she demands respect and is also obliged to meet her responsibilities. Data Our next meeting involves Samantha and Ivan, our Information Technology (IT) Director. While Ivan explains the data available in the customer invoice and project management system's well-defined databases, Samantha adds that she has vital data in Microsoft Excel that is missing from those databases. One Excel file contains the sales budget and another contains an additional customer grouping; both files are necessary to present information to the CEO. We take advantage of this discussion to highlight the following characteristics that make data easy to analyze. Reliable Ivan is going to document the origin of the tables and fields, which increases Samantha's confidence in the data. He is also going to perform a basic data cleansing and eliminate duplicate records whose only difference is a period, two transposed letters, or an abbreviation. Once the system is operational, Ivan will consider the impact any change in the customer invoice and project management system may have on the data. He will also verify that the data is continually updated while Samantha helps con firm the data's validity. Detailed Ivan will preserve as much detail as possible. If he is unable to handle large volumes of data as a whole, he will segment the detailed data by month and reduce the detail of a year's data in a consistent fashion. Conversely, he is will consider adding detail by prorating payments between the products of paid invoices in order to maintain a consistent level of detail between invoices and payments. Formal An Excel file as a data source is a short-term solution. While Ivan respects its temporary use to allow for a quick, first release of the data visualization project, he takes responsibility to find a more stable medium to long-term solution. In the span of a few months, he will consider modifying the invoice system, investing in additional software, or creating a simple portal to upload Excel files to a database. Flexible Ivan will not prevent progress solely for bureaucratic reasons. Samantha respects that Ivan's goal is to make data more standardized, secure, and recoverable. However, Ivan knows that if he does not move as quickly as business does, he will become irrelevant as Samantha and others create their own black market of company data. Referential Ivan is going to make available manifold perspectives of QDataViz, Inc. He will maintain history, budgets, and forecasts by customers, salespersons, divisions, states, and projects. Additionally, he will support segmenting these dimensions into multiple groups, subgroups, classes, and types. Tools We continue our meeting with Ivan and Samantha, but we now change our focus to what tool we will use to foster great data visualization and analysis. We create the following list of basic features we hope from this tool: Fast and easy implementation We should be able to learn the tool quickly and be able to deliver a first version of our data visualization project within a matter of weeks. In this fashion, we start receiving a return on our investment within a short period of time. Business empowerment Samantha should be able to continue her analysis with little help from us. Also, her audience should be able to easily perform their own lightweight analysis and follow up on the decisions made. Enterprise-ready Ivan should be able to maintain hundreds or thousands of users and data volumes that exceed 100 million rows. He should also be able to restrict access to certain data to certain users. Finally, he needs to have the confidence that the tools will remain available even if a server fails. Based on these expectations, we talk about data discovery tools, which are increasingly becoming part of the architecture of many organizations. Samantha can use these tools for self-service data analysis. In other words, she can create her own data visualizations without having to depend on pre-built graphs or reports. At the same time, Ivan can be reassured that the tool does not interfere with his goal of providing an enterprise solution that offers scalability, security, and high availability. The data discovery tool we are going to use is QlikView, and the following diagram shows the overall architecture we will build and where this article focuses its attention: Summary In this article, we learned about People, data, and tools which are an essential part of creating great data visualization and analysis. Resources for Article: Further resources on this subject: Meet QlikView [Article] Linking Section Access to multiple dimensions [Article] Creating sheet objects and starting new list using Qlikview 11 [Article]
Read more
  • 0
  • 0
  • 2101

article-image-creating-map
Packt
29 Dec 2014
11 min read
Save for later

Creating a Map

Packt
29 Dec 2014
11 min read
In this article by Thomas Newton and Oscar Villarreal, authors of the book Learning D3.js Mapping, we will cover the following topics through a series of experiments: Foundation – creating your basic map Experiment 1 – adjusting the bounding box Experiment 2 – creating choropleths Experiment 3 – adding click events to our visualization (For more resources related to this topic, see here.) Foundation – creating your basic map In this section, we will walk through the basics of creating a standard map. Let's walk through the code to get a step-by-step explanation of how to create this map. The width and height can be anything you want. Depending on where your map will be visualized (cellphones, tablets, or desktops), you might want to consider providing a different width and height: var height = 600; var width = 900; The next variable defines a projection algorithm that allows you to go from a cartographic space (latitude and longitude) to a Cartesian space (x,y)—basically a mapping of latitude and longitude to coordinates. You can think of a projection as a way to map the three-dimensional globe to a flat plane. There are many kinds of projections, but geo.mercator is normally the default value you will use: var projection = d3.geo.mercator(); var mexico = void 0; If you were making a map of the USA, you could use a better projection called albersUsa. This is to better position Alaska and Hawaii. By creating a geo.mercator projection, Alaska would render proportionate to its size, rivaling that of the entire US. The albersUsa projection grabs Alaska, makes it smaller, and puts it at the bottom of the visualization. The following screenshot is of geo.mercator:   This following screenshot is of geo.albersUsa:   The D3 library currently contains nine built-in projection algorithms. An overview of each one can be viewed at https://github.com/mbostock/d3/wiki/Geo-Projections. Next, we will assign the projection to our geo.path function. This is a special D3 function that will map the JSON-formatted geographic data into SVG paths. The data format that the geo.path function requires is named GeoJSON: var path = d3.geo.path().projection(projection); var svg = d3.select("#map")    .append("svg")    .attr("width", width)    .attr("height", height); Including the dataset The necessary data has been provided for you within the data folder with the filename geo-data.json: d3.json('geo-data.json', function(data) { console.log('mexico', data); We get the data from an AJAX call. After the data has been collected, we want to draw only those parts of the data that we are interested in. In addition, we want to automatically scale the map to fit the defined height and width of our visualization. If you look at the console, you'll see that "mexico" has an objects property. Nested inside the objects property is MEX_adm1. This stands for the administrative areas of Mexico. It is important to understand the geographic data you are using, because other data sources might have different names for the administrative areas property:   Notice that the MEX_adm1 property contains a geometries array with 32 elements. Each of these elements represents a state in Mexico. Use this data to draw the D3 visualization. var states = topojson.feature(data, data.objects.MEX_adm1); Here, we pass all of the administrative areas to the topojson.feature function in order to extract and create an array of GeoJSON objects. The preceding states variable now contains the features property. This features array is a list of 32 GeoJSON elements, each representing the geographic boundaries of a state in Mexico. We will set an initial scale and translation to 1 and 0,0 respectively: // Setup the scale and translate projection.scale(1).translate([0, 0]); This algorithm is quite useful. The bounding box is a spherical box that returns a two-dimensional array of min/max coordinates, inclusive of the geographic data passed: var b = path.bounds(states); To quote the D3 documentation: "The bounding box is represented by a two-dimensional array: [[left, bottom], [right, top]], where left is the minimum longitude, bottom is the minimum latitude, right is maximum longitude, and top is the maximum latitude." This is very helpful if you want to programmatically set the scale and translation of the map. In this case, we want the entire country to fit in our height and width, so we determine the bounding box of every state in the country of Mexico. The scale is calculated by taking the longest geographic edge of our bounding box and dividing it by the number of pixels of this edge in the visualization: var s = .95 / Math.max((b[1][0] - b[0][0]) / width, (b[1][1] - b[0][1]) / height); This can be calculated by first computing the scale of the width, then the scale of the height, and, finally, taking the larger of the two. All of the logic is compressed into the single line given earlier. The three steps are explained in the following image:   The value 95 adjusts the scale, because we are giving the map a bit of a breather on the edges in order to not have the paths intersect the edges of the SVG container item, basically reducing the scale by 5 percent. Now, we have an accurate scale of our map, given our set width and height. var t = [(width - s * (b[1][0] + b[0][0])) / 2, (height - s * (b[1][1] + b[0][1])) / 2]; When we scale in SVG, it scales all the attributes (even x and y). In order to return the map to the center of the screen, we will use the translate function. The translate function receives an array with two parameters: the amount to translate in x, and the amount to translate in y. We will calculate x by finding the center (topRight – topLeft)/2 and multiplying it by the scale. The result is then subtracted from the width of the SVG element. Our y translation is calculated similarly but using the bottomRight – bottomLeft values divided by 2, multiplied by the scale, then subtracted from the height. Finally, we will reset the projection to use our new scale and translation: projection.scale(s).translate(t); Here, we will create a map variable that will group all of the following SVG elements into a <g> SVG tag. This will allow us to apply styles and better contain all of the proceeding paths' elements: var map = svg.append('g').attr('class', 'boundary'); Finally, we are back to the classic D3 enter, update, and exit pattern. We have our data, the list of Mexico states, and we will join this data to the path SVG element:    mexico = map.selectAll('path').data(states.features);      //Enter    mexico.enter()        .append('path')        .attr('d', path); The enter section and the corresponding path functions are executed on every data element in the array. As a refresher, each element in the array represents a state in Mexico. The path function has been set up to correctly draw the outline of each state as well as scale and translate it to fit in our SVG container. Congratulations! You have created your first map! Experiment 1 – adjusting the bounding box Now that we have our foundation, let's start with our first experiment. For this experiment, we will manually zoom in to a state of Mexico using what we learned in the previous section. For this experiment, we will modify one line of code: var b = path.bounds(states.features[5]); Here, we are telling the calculation to create a boundary based on the sixth element of the features array instead of every state in the country of Mexico. The boundaries data will now run through the rest of the scaling and translation algorithms to adjust the map to the one shown in the following screenshot:   We have basically reduced the min/max of the boundary box to include the geographic coordinates for one state in Mexico (see the next screenshot), and D3 has scaled and translated this information for us automatically:   This can be very useful in situations where you might not have the data that you need in isolation from the surrounding areas. Hence, you can always zoom in to your geography of interest and isolate it from the rest. Experiment 2 – creating choropleths One of the most common uses of D3.js maps is to make choropleths. This visualization gives you the ability to discern between regions, giving them a different color. Normally, this color is associated with some other value, for instance, levels of influenza or a company's sales. Choropleths are very easy to make in D3.js. In this experiment, we will create a quick choropleth based on the index value of the state in the array of all the states. We will only need to modify two lines of code in the update section of our D3 code. Right after the enter section, add the following two lines: //Update var color = d3.scale.linear().domain([0,33]).range(['red',   'yellow']); mexico.attr('fill', function(d,i) {return color(i)}); The color variable uses another valuable D3 function named scale. Scales are extremely powerful when creating visualizations in D3; much more detail on scales can be found at https://github.com/mbostock/d3/wiki/Scales. For now, let's describe what this scale defines. Here, we created a new function called color. This color function looks for any number between 0 and 33 in an input domain. D3 linearly maps these input values to a color between red and yellow in the output range. D3 has included the capability to automatically map colors in a linear range to a gradient. This means that executing the new function, color, with 0 will return the color red, color(15) will return an orange color, and color(33) will return yellow. Now, in the update section, we will set the fill property of the path to the new color function. This will provide a linear scale of colors and use the index value i to determine what color should be returned. If the color was determined by a different value of the datum, for instance, d.sales, then you would have a choropleth where the colors actually represent sales. The preceding code should render something as follows: Experiment 3 – adding click events to our visualization We've seen how to make a map and set different colors to the different regions of this map. Next, we will add a little bit of interactivity. This will illustrate a simple reference to bind click events to maps. First, we need a quick reference to each state in the country. To accomplish this, we will create a new function called geoID right below the mexico variable: var height = 600; var width = 900; var projection = d3.geo.mercator(); var mexico = void 0;   var geoID = function(d) {    return "c" + d.properties.ID_1; }; This function takes in a state data element and generates a new selectable ID based on the ID_1 property found in the data. The ID_1 property contains a unique numeric value for every state in the array. If we insert this as an id attribute into the DOM, then we would create a quick and easy way to select each state in the country. The following is the geoID function, creating another function called click: var click = function(d) {    mexico.attr('fill-opacity', 0.2); // Another update!    d3.select('#' + geoID(d)).attr('fill-opacity', 1); }; This method makes it easy to separate what the click is doing. The click method receives the datum and changes the fill opacity value of all the states to 0.2. This is done so that when you click on one state and then on the other, the previous state does not maintain the clicked style. Notice that the function call is iterating through all the elements of the DOM, using the D3 update pattern. After making all the states transparent, we will set a fill-opacity of 1 for the given clicked item. This removes all the transparent styling from the selected state. Notice that we are reusing the geoID function that we created earlier to quickly find the state element in the DOM. Next, let's update the enter method to bind our new click method to every new DOM element that enter appends: //Enter mexico.enter()      .append('path')      .attr('d', path)      .attr('id', geoID)      .on("click", click); We also added an attribute called id; this inserts the results of the geoID function into the id attribute. Again, this makes it very easy to find the clicked state. The code should produce a map as follows. Check it out and make sure that you click on any of the states. You will see its color turn a little brighter than the surrounding states. Summary You learned how to build many different kinds of maps that cover different kinds of needs. Choropleths and data visualizations on maps are some of the most common geographic-based data representations that you will come across. Resources for Article: Further resources on this subject: Using Canvas and D3 [article] Interacting with your Visualization [article] Simple graphs with d3.js [article]
Read more
  • 0
  • 0
  • 2100

article-image-making-most-your-hadoop-data-lake-part-2-optimized-file-formats
Kristen Hardwick
30 Jun 2014
5 min read
Save for later

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

Kristen Hardwick
30 Jun 2014
5 min read
One major factor of making the conversion to Hadoop is the concept of the Data Lake. That idea suggests that users keep as much data as possible in HDFS in order to prepare for future use cases and as-yet-unknown data integration points. As your data grows, it is important to make sure that the data is being stored in a way that prolongs that behavior. Data compression is not the only technique that can be used to speed up job performance and improve cluster organization. In addition to the Text and Sequence File options that are typically used by default, Hadoop offers a few more optimized file formats that are specifically designed to improve the process of interacting with the data. In the second part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address another important factor in improving manageability—optimized file formats. Using a smarter file for your data: RCFile RCFile stands for Record Columnar File, and it serves as an ideal format for storing relational data that will be accessed through Hive. This format offers performance improvements by storing the data in an optimized way. First, the data is partitioned horizontally, into groups of rows. Then each row group is partitioned vertically, into collections of columns. Finally, the data in each column collection is compressed and stored in column-row format, as if it were a column-oriented database. The first benefit of this altered storage mechanism is apparent at the row level. All HDFS blocks used to form RCFiles will be made up of the horizontally partitioned collections of rows. This is significant because it ensures that no row of data will be split across multiple blocks, and will therefore always be on the same machine. This is not the case for traditional HDFS file formats, which typically use data size to split the file. This optimized data storage will reduce the amount of network bandwidth that is required to serve queries. The second benefit comes from optimizations at the column level, in the form of disk I/O reduction. Since the columns are stored vertically within each row group, the system will be able to seek directly to the required column position in the file, rather than being required to scan across all columns and filter out data that is not necessary. This is extremely useful in queries that only require access to a small subset of the existing columns. RCFiles can be used natively in both Hive and Pig with very little configuration. In Hive CREATE TABLE … STORED AS RCFILE; ALTER TABLE … SET FILEFORMAT RCFILE; SET hive.default.fileformat=RCFile; In Pig: register …/piggybank.jar; a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.hiverc.HiveRCInputFormat(…); The Pig jar file referenced here is just one option for enabling the RCFile. At the time of writing, there was also an RCFilePigStorageclass available through Twitter’s Elephant Bird open source library. Hortonworks’ ORCFile and Cloudera’s Parquet formats RCFiles provide optimization for relational files primarily by implementing modifications at the storage level. New innovations have provided improvements on the RCFile format, namely the ORCFile format from Hortonworks and the Parquet format from Cloudera. When storing data using the Optimized Row Columnar file or Parquet formats, several pieces of metadata are automatically written at the column level within each row group; for example, minimum and maximum values for numeric data types and dictionary-style metadata for text data types. The specific metadata is also configurable. One such use case would be for a user to configure the dataset to be sorted on a given set of columns for efficient access. This excess metadata allows for queries to take advantage of an improvement on the original RCFiles–predicate pushdown. That technique allows Hive to evaluate the where clause during the record gathering process, instead of filtering data after all records have been collected. The predicate pushdown technique will evaluate the conditions of the query against the metadata associated with a particular row group, allowing it to skip over entire file blocks if possible, or to seek directly to the correct row. One major benefit of this process is that the more complex a particular where clause is, the more potential there is for row groups and columns to be filtered as irrelevant to the final result. Cloudera’s Parquet format is typically used in conjunction with Impala, but just like with RCFiles, ORCFiles can be incorporated into both Hive and Pig. HCatalog can be used as the primary method to read and write ORCFiles using Pig. The commands for Hive are provided below: In Hive: CREATE TABLE … STORED AS ORC; ALTER TABLE … SET FILEFORMAT ORC SET hive.default.fileformat=Orc Conclusion This post has detailed the alternatives to the default file formats that can be used in Hadoop in order to optimize data access and storage. This information combined with the compression techniques described in the previous post (part 1) will provide some guidelines that can be used to ensure that users can make the most of the Hadoop Data Lake. About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.
Read more
  • 0
  • 0
  • 2089
article-image-article-optimizing-programs
Packt
29 May 2013
6 min read
Save for later

Optimizing Programs

Packt
29 May 2013
6 min read
(For more resources related to this topic, see here.) Using transaction SAT to find problem areas In this recipe, we will see the steps required to analyze the execution of any report, transaction, or function module using the transaction SAT. Getting ready For this recipe, we will analyze the runtime of a standard program RIBELF00 (Display Document Flow Program). The program selection screen contains a number of fields. We will execute the program on the order number (aufnr) and see the behavior. How to do it... For carrying out runtime analysis using transaction SAT, proceed as follows: Call transaction SAT. The screen appears as shown: Enter a suitable name for the variant (in our case, YPERF_VARIANT) and click the Create button below it. This will take you to the Variant creation screen. On the Duration and Type tab, switch on Aggregation by choosing the Per Call Position radio-button. Then, click on the Statements tab. On the Statements tab, make sure Internal Tables, the Read Operations checkbox and the Change Operations checkbox, and the Open SQL checkbox under Database Access are checked. Save your variant. Come back to the main screen of SAT. Make sure that within Data Formatting on the initial screen of SAT, the checkbox for Determine Names of Internal Tables is selected. Next, enter the name of the program that is to be traced in the field provided (in our case, it is RIBELF00). Then click the   button. The screen of the program appears as shown. We will enter an order number range and execute the program. Once the program output is generated, click on the Back key to come back to program selection screen. Click on the Back key once again to generate the evaluation results. How it works... We carried out the execution of the program through the transaction SAT and the evaluation results were generated. On the left are the Trace Results (in tree form) listing the statements/ events with the most runtime. These are like a summary report of the entire measurement of the program. They are listed in descending order of the Net time in microseconds and the percentage of the total time. For example, in our case, the OPEN CURSOR event takes 68 percent of the total runtime of the program. Selecting the Hit List tab will show the top time consumer components of the program. In this example, the access of database tables AFRU and VBAK takes most of the time. Double-clicking any item in the Trace Results window on the left-hand side will display (in the Hit List area on the right-hand pane) details of contained items along with execution time of each item. From the Hit List window, double-clicking a particular item will take us to the relevant line in the program code. For example, when we double-click the Open Cursor VBAK line, it will take us to the corresponding program code. We have carried out analysis with Aggregation switched on. The switching on of Aggregation shows one single entry for a multiple calls of a particular line of code. Because of this, the results are less detailed and easier to read, since the hit list and the call hierarchy in the results are much more simplified. Also within the results, by default, the names of the internal table used are not shown. In order for the internal table names to appear in the evaluation result, the Determine Names checkbox of Internal tables indicator is checked. As a general recommendation, the runtime analysis should be carried out several times for best results. The reason being that the DB-measurement time could be dependent on a variety of factors, such as system load, network performance, and so on. Creation of secondary indexes in database tables Very often, the cause of a long running report is full-scan of a database table specified within the code, mainly because no suitable index exists. In this recipe, we will see the steps required in creating a new secondary index in database table for performance improvement. Creating indexes lets you optimize standard reports as well as your own reports. In this recipe, we will create a secondary index on a test table ZST9_VBAK (that is simply a copy of VBAK). How to do it... For creating a secondary index, proceed as follows: Call transaction SE11. Enter the name of the table in the field provided, in our case, ZST9_VBAK. Then click the Display button. This will take you to the Display Table screen. Next, choose the menu path Goto | Indexes. This will display all indexes that currently exist for the table. Click the Create button and then choose the option Create Extension Index The dialog box appears. Enter a three-digit name for the index. Then, press Enter. This will take you to the extension index maintenance screen. On the top part, enter the short description in the Short Description field provided. We will create a non-unique index so the Non-unique index radio button is selected (on the middle part of the screen). On the lower part of the screen, specify the field names to be used in the index. In our case, we use MANDT and AUFNR . Then, activate your index using keys Ctrl + F3. The index will be created in the database with appropriate message of creation shown below Status. How it works... This will create the index on the database. Since we created an extension index, the index will not be overwritten by SAP during an upgrade. Now any report that accesses ZST9_VBAK table specifying MANDT and AUFNR in the WHERE clause, will take advantage of index scan using our new secondary index. There's more... It is recommended by SAP that the index be first created in development system and then transport to quality, and to the production system. Secondary indexes are not automatically generated on target systems after being transported. We should check the status on the Activation Log in the target systems, and use the Database Utility to manually activate the index in question. A secondary index, preferably, must have fields that are not common (or as much as uncommon as possible) with other indexes. Too many redundant secondary indexes (that is, too many common fields across several indexes) on a table has a negative impact on performance. For instance, a table with 10 secondary indexes is sharing more than three fields. In addition, tables that are rarely modified (and very often read) are the ideal candidates for secondary indexes. See also http://help.sap.com/saphelp_erp2005/helpdata/EN/85/685a41cdbf80 47e10000000a1550b0/content.htm http://help.sap.com/saphelp_nw04/helpdata/en/cf/21eb2d446011d1 89700000e8322d00/frameset.htmhttp://docs.oracle.com/cd/ SELECT clause E17076_02/html/programmer_reference/am_second.html http://forums.sdn.sap.com/thread.jspa?threadID=1469347
Read more
  • 0
  • 0
  • 2086

article-image-administrating-solr
Packt
11 Oct 2013
10 min read
Save for later

Administrating Solr

Packt
11 Oct 2013
10 min read
(For more resources related to this topic, see here.) Query nesting You might come across situations wherein you need to nest a query within another query in order to search specific keyword or phrase. Let us imagine that you want to run a query using the standard request handler, but you need to embed a query that is parsed by the dismax query parser inside it. Isn't that interesting? We will show you how to do it. Our example data looks like this: <add> <doc> <field name="id">1</field> <field name="title">Reviewed solrcook book</field> </doc> <doc> <field name="id">2</field> <field name="title">Some book reviewed</field> </doc> <doc> <field name="id">3</field> <field name="title">Another reviewed little book</field> </doc> </add> Here, we are going to use the standard query parser to support lucene query syntax, but we would like to boost phrases using the dismax query parser. At first it seems to be impossible to achieve, but don't worry, we will handle it. Let us suppose that we want to find books having the words reviewed and book in their title field and we would like to boost the reviewed book phrase by 10. Here we go with the query: http: //localhost:8080/solr/select?q=reviewed+AND+book+AND+_ query_:"{!dismax qf=title pf=title^10 v=$qq}"&qq=reviewed+book The results of the preceding query should look like: <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2</int> <lst name="params"> <str name="fl">*,score</str> <str name="qq">book reviewed</str> <str name="q">book AND reviewed AND _query_:"{!dismax qf=title pf=title^10 v=$qq}"</str> </lst> </lst> <result name="response" numFound="3" start="0" maxScore="0.77966106"> <doc> <float name="score">0.77966106</float> <str name="id">2</str> <str name="title">Some book reviewed</str> </doc> <doc> <float name="score">0.07087828</float> <str name="id">1</str> <str name="title">Reviewed solrcook book</str> </doc> <doc> <float name="score">0.07087828</float> <str name="id">3</str> <str name="title">Another reviewed little book</str> </doc> </result> </response> Let us focus on the query. The q parameter is built of two parts connected together with AND operator. The first one reviewed+AND+book is just a usual query with a logical operator AND defined. The second part building the query starts with a strange looking expression, _query_. This expression tells Solr that another query should be made that will affect the results list. We then see the expression stating that Solr should use the dismax query parser (the !dismax part) along with the parameters that will be passed to the parser (qf and pf). The v parameter is an abbreviation for value and it is used to pass the value of the q parameter (in our case, reviewed+book is being passed to dismax query parser). And that's it! We land to the search results which we had expected. Stats.jsp From the admin interface, when you click on the Statistics link, though you receive a web page of information about the specific index, this information is actually being served to the browser as an XML linked to an embedded XSL stylesheet. This is then transformed into HTML in the browser. This means that if you perform a GET request on stats.jsp, you will be back with XML demonstrated as follows. curl http://localhost:8080/solr/mbartists/admin/stats.jsp If you open the downloaded file, you will see all the data as XML. The following code is an extract of the statistics available that stores individual documents and the standard request handler with the metrics you might wish to monitor (highlighted in the following code): <entry> <name>documentCache</name> <class>org.apache.solr.search.LRUCache</class> <version>1.0</version> <description>LRU Cache(maxSize=512, initialSize=512)</description> <stats> <stat name="lookups">3251</stat> <stat name="hits">3101</stat> <stat name="hitratio">0.95</stat> <stat name="inserts">160</stat> <stat name="evictions">0</stat> <stat name="size">160</stat> <stat name="warmupTime">0</stat> <stat name="cumulative_lookups">3251</stat> <stat name="cumulative_hits">3101</stat> <stat name="cumulative_hitratio">0.95</stat> <stat name="cumulative_inserts">150</stat> <stat name="cumulative_evictions">0</stat> </stats> </entry> <entry> <name>standard</name> <class>org.apache.solr.handler.component.SearchHandler</class> <version>$Revision: 1052938 $</version> <description>Search using components: org.apache.solr.handler.component.QueryComponent, org.apache.solr.handler.component.FacetComponent</description> <stats> <stat name="handlerStart">1298759020886</stat> <stat name="requests">359</stat> <stat name="errors">0</stat> <stat name="timeouts">0</stat> <stat name="totalTime">9122</stat> <stat name="avgTimePerRequest">25.409472</stat> <stat name="avgRequestsPerSecond">0.446995</stat> </stats> </entry> The method of integrating with monitoring system various from system to system., as an example you may explore ./examples/8/check_solr.rb for a simple Ruby script that queries the core and check if the average hit ratio and the average time per request are above a defined threshold. ./check_solr.rb -w 13 -c 20 -imtracks CRITICAL - Average Time per request more than 20 milliseconds old: 39.5 In the previous example, we have defined 20 milliseconds as the threshold and the average time for a request to serve is 39.5 milliseconds (which is far greater than the threshold we had set). Ping status It is defined as the outcome from PingRequestHandler, which is primarily used for reporting SolrCore health to a Load Balancer; that is, this handler has been designed to be used as the endpoint for an HTTP Load Balancer to use while checking the "health" or "up status" of a Solr server. In a simpler term, ping status denotes the availability of your Solr server (up-time and downtime) for the defined duration. Additionally, it should be configured with some defaults indicating a request that should be executed. If the request succeeds, then the PingRequestHandler will respond with a simple OK status. If the request fails, then the PingRequestHandler will respond with the corresponding HTTP error code. Clients (such as load balancers) can be configured to poll the PingRequestHandler monitoring for these types of responses (or for a simple connection failure) to know if there is a problem with the Solr server. PingRequestHandler can be implemented which looks something like the following: <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <lst name="invariants"> <str name="qt">/search</str><!-- handler to delegate to --> <str name="q">some test query</str> </lst> </requestHandler> You may try this out even with a more advanced option, which is to configure the handler with a healthcheckFile that can be used to enable/disable the PingRequestHandler. It would look something like the following: <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <!-- relative paths are resolved against the data dir --> <str name="healthcheckFile">server-enabled.txt</str> <lst name="invariants"> <str name="qt">/search</str><!-- handler to delegate to --> <str name="q">some test query</str> </lst> </requestHandler> A couple of points which you should know while selecting the healthcheckFile option are: If the health check file exists, the handler will execute the query and returns status as described previously. If the health check file does not exist, the handler will throw an HTTP error even though the server is working fine and the query would have succeeded. This health check file feature can be used as a way to indicate to some load balancers that the server should be "removed from rotation" for maintenance, or upgrades, or whatever reason you may wish. Business rules You might come across situations wherein your customer who is running an e-store consisting of different types of products such as jewelry, electronic gazettes, automotive products, and so on defines a business need which is flexible enough to cope up with changes in the search results based on the search keyword. For instance, imagine of a customer's requirement wherein your need to add facets such as Brand, Model, Lens, Zoom, Flash, Dimension, Display, Battery, Price, and so on whenever a user searches for "Camera" keyword. So far the requirement is easy and can be achieved in simpler way. Now let us add some complexity in our requirement wherein facets such as Year, Make, Model, VIN, Mileage, Price, and so on should get automatically added when the user searches for a keyword "Bike". Worried about how to overrule such complex requirement? This is where business rules come into play. There is n-number of rule engines (both proprietary and open source) in market such as Drools, JRules, and so on which can be plugged-in into your Solr. Drools Now let us understand how Drools functions. It injects the rules into working memory, and then it evaluates which custom rules should be triggered based on the conditions stated in the working memory. It is based on if-then clauses, which enables the rules coder to define the what condition must be true (using if or when clause), and what action/event should be triggered when the defined condition is met, that is true (using then clause). Drools conditions are nothing but any Java object that the application wishes to inject as input. A business rule is more or less in the following format: rule "ruleName" when // CONDITION then //ACTION We will now show you how to write an example rule in Drools: rule "WelcomeLucidWorks" no-loop when $respBuilder : ResponseBuilder(); then $respBuilder.rsp.add("welcome", "lucidworks"); end In the given code snippet, it checks for ResponseBuilder object (one of the prime objects which help in processing search requests in a SearchComponent) in the working memory and then adds a key-value pair to that ResponseBuilder (in our case, welcome and lucidworks). Summary In this article, we saw how to nest a query within another query, learned about stats.jsp, how to use ping status, and what are business rules, how and when they prove to be important for us and how to write your custom rule using Drools. Resources for Article: Further resources on this subject: Getting Started with Apache Solr [Article] Making Big Data Work for Hadoop and Solr [Article] Apache Solr Configuration [Article]
Read more
  • 0
  • 0
  • 2070

article-image-move-further-numpy-modules
Packt
13 May 2013
7 min read
Save for later

Move Further with NumPy Modules

Packt
13 May 2013
7 min read
(For more resources related to this topic, see here.) Linear algebra Linear algebra is an important branch of mathematics. The numpy.linalg package contains linear algebra functions. With this module, you can invert matrices, calculate eigenvalues, solve linear equations, and determine determinants, among other things. Time for action – inverting matrices The inverse of a matrix A in linear algebra is the matrix A-1, which when multiplied with the original matrix, is equal to the identity matrix I. This can be written, as A* A-1 = I. The inv function in the numpy.linalg package can do this for us. Let's invert an example matrix. To invert matrices, perform the following steps: We will create the example matrix with the mat. A = np.mat("0 1 2;1 0 3;4 -3 8") print "An", A The A matrix is printed as follows: A [[ 0 1 2] [ 1 0 3] [ 4 -3 8]] Now, we can see the inv function in action, using which we will invert the matrix. inverse = np.linalg.inv(A) print "inverse of An", inverse The inverse matrix is shown as follows: inverse of A [[-4.5 7. -1.5] [-2. 4. -1. ] [ 1.5 -2. 0.5]] If the matrix is singular or not square, a LinAlgError exception is raised. If you want, you can check the result manually. This is left as an exercise for the reader. Let's check what we get when we multiply the original matrix with the result of the inv function: print "Checkn", A * inverse The result is the identity matrix, as expected. Check[[ 1. 0. 0.][ 0. 1. 0.][ 0. 0. 1.]] What just happened? We calculated the inverse of a matrix with the inv function of the numpy.linalg package. We checked, with matrix multiplication, whether this is indeed the inverse matrix. import numpy as npA = np.mat("0 1 2;1 0 3;4 -3 8")print "An", Ainverse = np.linalg.inv(A)print "inverse of An", inverseprint "Checkn", A * inverse Solving linear systems A matrix transforms a vector into another vector in a linear way. This transformation mathematically corresponds to a system of linear equations. The numpy.linalg function, solve, solves systems of linear equations of the form Ax = b; here A is a matrix, b can be 1D or 2D array, and x is an unknown variable. We will see the dot function in action. This function returns the dot product of two floating-point arrays. Time for action – solving a linear system Let's solve an example of a linear system. To solve a linear system, perform the following steps: Let's create the matrices A and b. iA = np.mat("1 -2 1;0 2 -8;-4 5 9")print "An", Ab = np.array([0, 8, -9])print "bn", b The matrices A and b are shown as follows: Solve this linear system by calling the solve function. x = np.linalg.solve(A, b)print "Solution", x The following is the solution of the linear system: Solution [ 29. 16. 3.] Check whether the solution is correct with the dot function. print "Checkn", np.dot(A , x) The result is as expected: Check[[ 0. 8. -9.]] What just happened? We solved a linear system using the solve function from the NumPy linalg module and checked the solution with the dot function. import numpy as npA = np.mat("1 -2 1;0 2 -8;-4 5 9")print "An", Ab = np.array([0, 8, -9])print "bn", bx = np.linalg.solve(A, b)print "Solution", xprint "Checkn", np.dot(A , x) Finding eigenvalues and eigenvectors Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector. Eigenvectors are vectors corresponding to eigenvalues. The eigvals function in the numpy.linalg package calculates eigenvalues. The eig function returns a tuple containing eigenvalues and eigenvectors. Time for action – determining eigenvalues and eigenvectors Let's calculate the eigenvalues of a matrix. Perform the following steps to do so: Create a matrix as follows: A = np.mat("3 -2;1 0")print "An", A The matrix we created looks like the following: A[[ 3 -2][ 1 0]] Calculate eigenvalues by calling the eig function. print "Eigenvalues", np.linalg.eigvals(A) The eigenvalues of the matrix are as follows: Eigenvalues [ 2. 1.] Determine eigenvalues and eigenvectors with the eig function. This function returns a tuple, where the first element contains eigenvalues and the second element contains corresponding Eigenvectors, arranged column-wise. eigenvalues, eigenvectors = np.linalg.eig(A)print "First tuple of eig", eigenvaluesprint "Second tuple of eign", eigenvectors The eigenvalues and eigenvectors will be shown as follows: First tuple of eig [ 2. 1.]Second tuple of eig[[ 0.89442719 0.70710678][ 0.4472136 0.70710678]] Check the result with the dot function by calculating the right- and left-hand sides of the eigenvalues equation Ax = ax. for i in range(len(eigenvalues)):print "Left", np.dot(A, eigenvectors[:,i])print "Right", eigenvalues[i] * eigenvectors[:,i]print The output is as follows: Left [[ 1.78885438][ 0.89442719]]Right [[ 1.78885438][ 0.89442719]]Left [[ 0.70710678][ 0.70710678]]Right [[ 0.70710678][ 0.70710678]] What just happened? We found the eigenvalues and eigenvectors of a matrix with the eigvals and eig functions of the numpy.linalg module. We checked the result using the dot function . import numpy as npA = np.mat("3 -2;1 0")print "An", Aprint "Eigenvalues", np.linalg.eigvals(A)eigenvalues, eigenvectors = np.linalg.eig(A)print "First tuple of eig", eigenvaluesprint "Second tuple of eign", eigenvectorsfor i in range(len(eigenvalues)):print "Left", np.dot(A, eigenvectors[:,i])print "Right", eigenvalues[i] * eigenvectors[:,i]print Singular value decomposition Singular value decomposition is a type of factorization that decomposes a matrix into a product of three matrices. The singular value decomposition is a generalization of the previously discussed eigenvalue decomposition. The svd function in the numpy.linalg package can perform this decomposition. This function returns three matrices – U, Sigma, and V – such that U and V are orthogonal and Sigma contains the singular values of the input matrix. The asterisk denotes the Hermitian conjugate or the conjugate transpose. Time for action – decomposing a matrix It's time to decompose a matrix with the singular value decomposition. In order to decompose a matrix, perform the following steps: First, create a matrix as follows: A = np.mat("4 11 14;8 7 -2")print "An", A The matrix we created looks like the following: A[[ 4 11 14][ 8 7 -2]] Decompose the matrix with the svd function. U, Sigma, V = np.linalg.svd(A, full_matrices=False)print "U"print Uprint "Sigma"print Sigmaprint "V"print V The result is a tuple containing the two orthogonal matrices U and V on the left- and right-hand sides and the singular values of the middle matrix. [-0.31622777 0.9486833 ]]Sigma[ 18.97366596 9.48683298]V[[-0.33333333 -0.66666667 -0.66666667][ 0.66666667 0.33333333 -0.66666667]]U[[-0.9486833 -0.31622777] We do not actually have the middle matrix—we only have the diagonal values. The other values are all 0. We can form the middle matrix with the diag function. Multiply the three matrices. This is shown, as follows: print "Productn", U * np.diag(Sigma) * V The product of the three matrices looks like the following: Product[[ 4. 11. 14.][ 8. 7. -2.]] What just happened? We decomposed a matrix and checked the result by matrix multiplication. We used the svd function from the NumPy linalg module. import numpy as npA = np.mat("4 11 14;8 7 -2")print "An", AU, Sigma, V = np.linalg.svd(A, full_matrices=False)print "U"print Uprint "Sigma"print Sigmaprint "V"print Vprint "Productn", U * np.diag(Sigma) * V Pseudoinverse The Moore-Penrose pseudoinverse of a matrix can be computed with the pinv function of the numpy.linalg module (visit http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse). The pseudoinverse is calculated using the singular value decomposition. The inv function only accepts square matrices; the pinv function does not have this restriction.
Read more
  • 0
  • 0
  • 2067
article-image-report-data-filtering
Packt
25 Aug 2014
13 min read
Save for later

Report Data Filtering

Packt
25 Aug 2014
13 min read
In this article, written by Yoav Yahav, author of the book SAP BusinessObjects Reporting Cookbook, we will cover the following recipes: Applying a simple filter Working with the filter bar Using input controls Working with an element link (For more resources related to this topic, see here.) Filtering data can be done in several ways. We can filter the results at the query level when there is a requirement to use a mandatory filter or set of filters that will fetch only specific types of rows that will correspond to the business question; otherwise, the report won't be accurate or useful. The other level of filtering is performed at the report level. This level of filtering interacts with the data that was retrieved by the user and enables us to eliminate irrelevant rows. The main question that arises when using a report-level filter is why shouldn't we implement filters in the query level? Well, the answer has various reasons, which are as follows: We need to compare and analyze just a part of the entire data that the query retrieved (for example, filtering the first quarter's data out of the current year's entire dataset) We need to view the data separately, for example, each tab can be filtered by a different value (for example, each report tab can display a different region's data) We need to filter measure objects that are different from the aggregative level of the query; for example, we have retrieved a well-detailed query displaying sales of various products at the customer level, but we also need to display only the products that had income of more than one million dollars in another report tab The business user requires interactive functionality from the filter: a drop-down box, a checklist, a spinner, or a slider—capabilities that can't be performed by a query filter We need to perform additional calculations on a variable in the report and apply a filter to it In this article, we will explore the different types of filters that can be applied in reports: simple ones, interactive ones, and filters that can combine interactivity and a custom look and feel adjusted by the business user. Applying a simple filter The first type of filter is a basic one that enables us to implement quick and simple filter logic, which is similar to the way we build it on the query panel. Getting ready We have created a query that retrieves a dataset displaying the Net Sales by Product, Line, and Year. Using a simple filter, we would like to filter only the year 2008 records as well as the Sports Line. How to do it... Perform the following steps to apply a simple filter: We will navigate to the Analysis toolbar, and in the Filters tab, click on the Filter icon and choose Add Filter, as shown in the following screenshot: In the Report Filter window, as shown in the following screenshot, we will be able to add filters, edit them, and apply them on a specific table or on the entire report tab: By clicking on the Add filter icon located in the top-left corner, we will be able to add a condition. Clicking on this button will open the list of existing objects in the report; by choosing the Year object, we will add our first filter, as shown in the following screenshot: After we choose the Year object, a filter condition structure will appear in the top-right corner of the window, enabling us to pick an operator and a value similar to the way we establish query filters, as shown in the following screenshot: We will add a second filter as well using the Add filter button and adding the Line object to the filter area. The AND operator will appear between the two filters, establishing an intersection relationship between them. This operator can be easily changed to the OR operator by clicking on it. The table will be affected accordingly and will display only the year 2008 and the Sports Line records, as shown: In order to edit the filter, we can either access it through the Analysis toolbar or mark one of the filtered columns, enabling us to get an easier edit using the toolbar or the right-click menu, as shown in the following screenshot: How it works... The report filter simply corresponds to the values defined in the filtered set of conditions that are established by simple and easy use of the filter panel. The filters can be applied on any type of data display, table, or chart. Like the query filters, the report filters use the logic of the operators AND/OR as well and can be used by clicking on the operator name. In order to view the filters that have been applied to the report tabs and tables, you can navigate to the document structure and filter's left-hand side panel and click on the Filter button. There's more... Filters can be applied on a specific table in the report or on the entire report. In order to switch between these options, when you create the filter, you need to mark the report area. To create a report-level filter or a specific column in a table, you need to filter a specific table in the report tab. Working with the filter bar Another great functionality that filters can provide us is interaction with the report data. There are cases when we are required to perform quick filtering as well as switch dynamically between values as we need to analyze different filtered datasets. Working with the filter bar can address these requirements simply and easily. Getting ready We want to perform a quick dynamic filtering on our existing table by adding the Country dimension object to the filter bar. How to do it.... Perform the following steps: By navigating to the Analysis toolbar and then to the Interact tab, we will click on the Filter Bar icon: By doing so, a gray filter pane area will appear under the formula bar with a guiding message saying Drop objects here to add report filters, as shown in the following screenshot: In order to create a new filter, we can either drag an object directly to the filter pane area from the Available Objects pane or use the quick add icon located in the filter bar on the left-hand side of the message. In our scenario, we will use the Available Objects pane and drag the Country dimension object directly to the filter bar: By adding the Country object to the filter bar, a quick drop-down list filter will be created, enabling us to filter the table data by choosing any country value: This filter will enable us to quickly create filtered sets of data using the drop-down list as well as using an interactive component that doesn't have to be in the table. How it works... The filter bar is an interactive component that enables us to create as many dynamic filters as we need and to locate them in a single filtering area for easy control of the data, and they aren't even required to appear in the table itself. The filter bar is restricted to filter only single values; in order to filter several values, we will need to either use a different type of filter, such as input control, or create a grouped values logic. There's more... When we drop several dimension objects onto the filter pane, they will be displayed accordingly; however, a cascading effect of filters (picking a specific country in the first filter and in the second filter seeing only that country) will be supported only if hierarchies have been defined in the universe. Using input controls Input controls are another type of filter that enable us to interact with the report data. An input control performs as an interactive left-hand side panel, which can be created in various types that have a different look and feel as well as a different functionality. We can use an input control to address the following tasks: Applying a different look and feel to the filter—making filters more intuitive and easy to operate (using radio buttons, comboboxes, and other filter types) Applying multiple values Applying dynamic filters to measure values using input control components, such as spinners and sliders Enabling a default value option and a custom list of values Getting ready In this example, we will filter several components in the report area, a chart and a table, using the Region dimension object. We will be using the multiple value option to enhance the filter functionality. First, we will navigate to the input control panel located in the left-hand side area as the third option from the top and click on the New option. How to do it... Perform the following steps: We will choose the object that we need to filter with the table and the chart, as shown in the following screenshot: After choosing the Region object, we will move on to the Define Input Control window. Input controls enable multiple-value functionality, and in the Choose Control Type window, we will choose the Check boxes input control type, as shown in the following screenshot: In the input control properties located at the right-hand side, we can also add a description to the input control, set a default value, and customize the list of values if we need specific values. After choosing the Check boxes component, we will advance to the next window, choosing the data element we want to apply the control on. We will tick both of the components, the chart and the table, in order to affect all the data components using a single control, as shown in the following screenshot: By clicking on the Finish button, the input control will appear at the left-hand side: We can easily change the selected values to all values (Select All), one, or several values, filtering both of the tables as shown: How it works... As we have seen, input controls act as special interactive filters that can be used by picking one of them from the input control templates, that is, the type that is the most suitable to filter the data in the report. Our main consideration when choosing an input control is to determine the type of list we need to pick: single or multiple. The second consideration should be the interactive functionality that we need from such a control: a simple value pick or perhaps an arithmetic operator, such as a greater or less than operator, which can be applied to a measure object. There's more... An input control can also be created using the Analysis toolbar and the Filter tab. In order to edit the existing input control, we can access the mini toolbar above the input control. Here, we will be able to edit the control, show its dependencies (the elements that are affected by it), or delete it, as shown in the following screenshot: We can also display a single informative cell describing which table and report a filter has been applied on. This useful option can be applied by navigating to the Report Element toolbar, choosing Report Filter Summary from the Cell subtoolbar, and dragging it to the report area, as shown in the following screenshot: By clicking on the Map button, we will switch to a graphical tree view of the input control, showing the values that were picked in the filter as well as its dependencies: If you need to display the values of the input control in the report area, simply drag the object that you used in the control to the report area, turn it into a horizontal table, and then edit the dependencies of the control so that it will be applied on the new table as well. Working with an element link An element link is a feature designed to pass a value from an existing table or a chart to another data component in the report area. Element links transform the values in a table or a chart into dynamic values that can filter other data elements. The main difference between element links and other types of filtering is that when using an element link, we are actually using and working within a table, keeping its structure and passing a parameter from it to another structure. This feature can be great to work with when we are using a detailed table and want to use its values to filter another chart that will visualize the summarized data and vice versa. How to do it... Perform the following steps: We will pass the Country value from the detailed table to the line quantity sales pie chart, enabling the business user to filter the pie dynamically while working with the detailed table. By clicking on the Country column, we will navigate in the speed menu to Linking | Add Element Link, as shown in the following screenshot: In the next window, we will choose the passing method and decide whether to pass the entire row values or a Single object value. In our example, we will use the Single object option, as shown in the following screenshot: In the next screen, we will be able to add a description of our choice to the element link. And finally, we will define the dependencies via the report elements that we want to pass the country value to, as shown in the following screenshot: By clicking on the Finish button, we will switch to the report view, and by marking the Country column or any other column, we will be able to pass the Country value to the pie chart, as shown in the following screenshot: By clicking on a different Country value, such as Colombia, we will be able to pass it to the pie and filter the results accordingly: Notice that the pie chart results have changed and that the country value is marked in bold inside the tooltip box, showing the column that was actually used to pass the value. How it works... The element link simply links the tables and other data components. It is actually a type of input control designed to work directly from a table rather than a filter component panel or a bar. By clicking on any country value, we simply pass it to the dependency component that uses the value as an input in order to present the relevant data. There's more... An element link can be edited and adjusted in a way similar to the way in which an input control is edited. By right-clicking on the Element Link icon located on the header of the rightmost column, we will be able to edit it, as shown in the following screenshot: Another good way to view the element link status and edit it is to switch to the Input Controls panel where you can view it as well, as shown in the following screenshot: Summary In this article, we came to know about the filtering techniques we can apply to the report tables and charts. Resources for Article: Further resources on this subject: Exporting SAP BusinessObjects Dashboards into Different Environments [Article] SAP BusinessObjects: Customizing the Dashboard [Article] SAP HANA Architecture [Article]
Read more
  • 0
  • 0
  • 2066

article-image-report-authoring
Packt
16 Sep 2013
12 min read
Save for later

Report Authoring

Packt
16 Sep 2013
12 min read
(For more resources related to this topic, see here.) In this article, we will cover some fundamental techniques that will be used in your day-to-day life as a Report Studio author. In each recipe, we will take a real-life example and see how it can be accomplished. At the end of the article, you will learn several concepts and ideas which you can mix-and-match to build complex reports. Though this article is called Report Authoring Basic Concepts, it is not a beginner's guide or a manual. It expects the following: You are familiar with the Report Studio environment, components, and terminologies You know how to add items on the report page and open various explorers and panes You can locate the properties window and know how to test run the report Based on my personal experience, I will recommend this article to new developers with two days to two months of experience. In the most raw terminology, a report is a bunch of rows and columns. The aim is to extract the right rows and columns from the database and present them to the users. The selection of columns drive what information is shown in the report, and the selection of rows narrow the report to a specific purpose and makes it meaningful. The selection of rows is controlled by filters. Report Studio provides three types of filtering: detail , summary , and slicer. Slicers are used with dimensional models.). In the first recipe of this article, we will cover when and why to use the detail and summary filters. Once we get the correct set of rows by applying the filters, the next step is to present the rows in the most business-friendly manner. Grouping and ordering plays an important role in this. The second recipe will introduce you to the sorting technique for grouped reports. With grouped reports, we often need to produce subtotals and totals. There are various types of aggregation possible. For example, average, total, count, and so on. Sometimes, the nature of business demands complex aggregation as well. In the third recipe, you will learn how to introduce aggregation without increasing the length of the query. You will also learn how to achieve different aggregation for subtotals and totals. The fourth recipe will build upon the filtering concept you have learnt earlier. It will talk about implementing the if-then-elselogic in filters. Then we will see some techniques on data formatting, creating sections in a report, and hiding a column in a crosstab. Finally, the eighth and last recipe of this article will show you how to use prompt's Use Value and Display Value properties to achieve better performing queries. The examples used in all the recipes are based on the GO Data Warehouse (query) package that is supplied with IBM Cognos 10.1.1 installation. These recipe samples can be downloaded from the Packt Publishing website. They use the relational schema from the Sales and Marketing (query) / Sales (query) namespace. The screenshots used throughout this article are taken from Cognos Version 10.1.1 and 10.2. Summary filters and detail filters Business owners need to see the sales quantity of their product lines to plan their strategy. They want to concentrate only on the highest selling product for each product line. They would also like the facility to select only those orders that are shipped in a particular month for this analysis. In this recipe, we will create a list report with product line, product name, and quantity as columns. We will also create an optional filter on the Shipment Month Key. Also, we will apply correct filtering to bring up only the top selling product per product line. Getting ready Create a new list report based on the GO Data Warehouse (query) package. From the Sales (query) namespace, bring up Products / Product line , Products / Product , and Sales fact / Quantity as columns, the way it is shown in the following screenshot: How to do it... Here we want to create a list report that shows product line, product name, and quantity, and we want to create an optional filter on Shipment Month. The report should also bring up only the top selling product per product line. In order to achieve this, perform the following steps: We will start by adding the optional filter on Shipment Month. To do that, click anywhere on the list report on the Report page. Then, click on Filters from the toolbar. In the Filters dialog box, add a new detail filter. In the Create Filter screen, select Advanced and then click on OK as shown in the following screenshot: By selecting Advanced , we will be able to filter the data based on the fields that are not part of our list table like the Month Key in our example as you will see in the next step. Define the filter as follows: [Sales (query)].[Time (ship date)].[Month key (ship date)] = ?ShipMonth? Validate the filter and then click on OK. Set the usage to Optional as shown in the following screenshot: Now we will add a filter to bring only the highest sold product per product line. To achieve this, select Product line and Product (press Ctrl and select the columns) and click on the group button from the toolbar. This will create a grouping as shown in the following screenshot: Now select the list and click on the filter button again and select Edit Filters . This time go to the Summary Filters tab and add a new filter. In the Create Filter screen, select Advanced and then click on OK. Define the filter as follows: [Quantity] = maximum([Quantity] for [Product line]). Set usage to Required and set the scope to Product as shown in the following screenshot: Now run the report to test the functionality. You can enter 200401as the Month Key as that has data in the Cognos supplied sample. How it works... Report Studio allows you to define two types of filters. Both work at different levels of granularity and hence have different applications. The detail filter The detail filter works at the lowest level of granularity in a selected cluster of objects. In our example, this grain is the Sales entries stored in Sales fact . By putting a detail filter on Shipment Month, we are making sure that only those sales entries which fall within the selected month are pulled out. The summary filter In order to achieve the highest sold product per product line, we need to consider the aggregated sales quantity for the products. If we put a detail filter on quantity, it will work at sales entry level. You can try putting a detail filter of [Quantity] = maximum([Quantity]for[Productline])and you will see that it gives incorrect results. So, we need to put a summary filter here. In order to let the query engine know that we are interested in filtering sales aggregated at product level, we need to set the SCOPE to Product . This makes the query engine calculate [Quantity]at product level and then allows only those products where the value matches maximum([Quantity]for [Product line]). There's more... When you define multiple levels of grouping, you can easily change the scope of summary filters to decide the grain of filtering. For example, if you need to show only those products whose sales are more than 1000 and only those product lines whose sales are more than 25000, you can quickly put two summary filters for code with the correct Scope setting. Before/after aggregation The detail filter can also be set to apply after aggregation (by changing the application property). However, I think this kills the logic of the detail filter. Also, there is no control on the grain at which the filter will apply. Hence, Cognos sets it to before aggregation by default, which is the most natural usage of the detail filter. See also The Implementing if-then-else in filtering recipe Sorting grouped values The output of the previous recipe brings the right information back on the screen. It filters the rows correctly and shows the highest selling product per product line for the selected shipment month. For better representation and to highlight the best-selling product lines, we need to sort the product lines in descending order of quantity. Getting ready Open the report created in the previous recipe in Cognos Report Studio for further amendments. How to do it... In the report created in the previous recipe, we managed to show data filtered by the shipment month. To improve the reports look and feel, we will sort the output to highlight the best-selling products. To start this, perform the following steps: Open the report in Cognos Report Studio. Select the Quantity column. Click on the Sort button from the toolbar and choose Sort Descending . Run the report to check if sorting is working. You will notice that sorting is not working. Now go back to Report Studio, select Quantity , and click on the Sort button again. This time choose Edit Layout Sorting under the Other Sort Options header. Expand the tree for Product line . Drag Quantity from Detail Sort List to Sort List under Product line as shown in the following screenshot: Click on the OK button and test the report. This time the rows are sorted in descending order of Quantity as required. How it works... The sort option by default works at the detailed level. This means the non-grouped items are sorted by the specified criteria within their own groups. Here we want to sort the product lines that are grouped (not the detailed items). In order to sort the groups, we need to define a more advanced sorting using the Edit Layout Sorting options shown in this recipe. There's more... You can also define sorting for the whole list report from the Edit Layout Sorting dialog box. You can use different items and ordering for different groups and details. You can also choose to sort certain groups by the data items that are not shown in the report. You need to bring only those items from source (model) to the query, and you will be able to pick it in the sorting dialog. Aggregation and rollup aggregation Business owners want to see the unit cost of every product. They also want the entries to be grouped by product line and see the highest unit cost for each product line. At the end of the report, they want to see the average unit cost for the whole range. Getting ready Create a simple list report with Products / Product line , Products / Product , and Sales fact / Unit cost as columns. How to do it... In this recipe, we want to examine how to aggregate the data and what is meant by rollup aggregation. Using the new report that you have created, this is how we are going to start this recipe: We will start by examining the Unit cost column. Click on this column and check the Aggregate Function property. Set this property to Average . Add grouping for Product line and Product by selecting those columns and then clicking on the GROUP button from the toolbar. Click on the Unit cost column and then click on the Summarize button from the toolbar. Select the Total option from the list. Now, again click on the Summarize button and choose the Average option as shown in the following screenshot: The previous step will create footers as shown in the following screenshot: Now delete the line with the <Average (Unit cost)> measure from Product line . Similarly, delete the line with the <Unit cost> measure from Summary . The report should look like the following screenshot: Click on the Unit cost column and change its Rollup Aggregate Function property to Maximum . Run the report to test it. How it works... In this recipe, we have seen two properties of the data items related to aggregation of the values. The aggregation property We first examined the aggregation property of unit cost and ensured that it was set to average. Remember that the unit cost here comes from the sales table. The grain of this table is sales entries or orders. This means there will be many entries for each product and their unit cost will repeat. We want to show only one entry for each product and the unit cost needs to be rolled up correctly. The aggregation property determines what value is shown for unit cost when calculated at product level. If it is set to Total , it will wrongly add up the unit costs for each sales entry. Hence, we are setting it to Average . It can be set to Minimum or Maximum depending on business requirements. The rollup aggregation property In order to show the maximum unit cost for product type, we create an aggregate type of footer in step 4 and set the Rollup Aggregation to Maximum in step 8. Here we could have directly selected Maximum from the Summarize drop-down toolbox. But that creates a new data item called Maximum (Unit Cost) . Instead, we ask Cognos to aggregate the number in the footer and drive the type by rollup aggregation property. This will reduce one data item in query subject and native SQL. Multiple aggregation We also need to show the overall average at the bottom. For this we have to create a new data item. Hence, we select unit cost and create an Average type of aggregation in step 5. This calculates the Average (Unit Cost) and places it on the product line and in the overall footer. We then deleted the aggregations that are not required in step 7. There's more... The rollup aggregation of any item is important only when you create the aggregation of Aggregate type. When it is set to automatic, Cognos will decide the function based on the data type, which is not preferred. It is good practice to always set the aggregation and rollup aggregation to a meaningful function rather than leaving them as automatic.
Read more
  • 0
  • 0
  • 2060
Modal Close icon
Modal Close icon