Data | Tech News, Tutorials & Expert Insights

article-image-dpm-non-aware-windows-workload-protection

16 Jul 2013

18 min read

DPM Non-aware Windows Workload Protection

16 Jul 2013

(For more resources related to this topic, see here.) Protecting DFS with DPM DFS stands for Distributed File System . It was introduced in Windows Server 2003, and is a set of services available as a role on Windows Server operating systems that allow you to group file shares held in different locations (different servers) under one folder known as DFS root . The actual locations of the file shares are transparent to the end user. DFS is also often used for redundancy of file shares. For more information on DFS Windows Server 2008: http://technet.microsoft.com/en-us/library/cc753479%28v=ws.10%29.aspx Windows Server 2008 R2 and Windows Server 2012: http://technet.microsoft.com/en-us/library/cc732006.aspx Before DFS can be protected it is important to know how it is structured. DFS consists of both data and configuration information: The configuration for DFS is stored in the registry of each server, and in either the DFS tree during standalone DFS deployments, or in Active Directory when domain-based DFS is deployed. DFS data is stored on each server in the DFS tree. The data consists of the multiple shares that make up the DFS root. Protecting DFS with DPM is fairly straightforward. It is recommended to protect the actual file shares directly on each of the servers in the DFS root. When you have a standalone DFS deployment you should protect the system state on the servers in the DFS root, and when you have a domain-based DFS deployment we recommend you protect your Active Directory of the domain controller that hosts the DFS root. If you are using DFS replication it is also recommended to protect the shadow copy components on servers that host the replication data, in addition to the previously mentioned items. These methods would allow you to restore DFS by restoring the data and either system state or Active Directory depending on your deployment type. Another option is to use the DfsUtil tool to export/import your DFS configuration. This is a command-line utility that comes with Windows Server that can export the namespace configuration to a file. The configuration can then be imported back into a DFS server to restore a DFS namespace. DPM can be set up to protect the DFS export. You would still need to protect the actual data directly. An example of using the DfsUtil tool would be: Run DfsUtil root export domainnamerootname dfsrootname.xml to export the DFS configuration to an XML file, then run DfsUtil root import to import the DFS configuration back in. For more information on the DfsUtil tool, visit the following URL: http://blogs.technet.com/b/josebda/archive/2009/05/01/using-the-windows-server-2008-dfsutil-exe-command-line-to-manage-dfs-namespaces.aspx That covers the backing up of DFS with DPM. Protecting Dynamics CRM with DPM Microsoft Dynamics CRM is Microsoft's customer relationship management (CRM) software in the CRM market. Microsoft Dynamics CRM Version 1.0 was released in 2003. It then progressed to Version 4.0 and the latest one is 2011. CRM is a part of the Microsoft Dynamics product family. In this section we will cover protecting Versions 4.0 and 2011. Note that when protecting Microsoft Dynamics CRM on either Version 4.0 or 2011, you should keep a note of your update-rollup level some place safe, so that you can install CRM back to that level in the event of a restore. You will need to restore the CRM database and this could lead to an error if CRM is not at the correct update level. To protect Microsoft Dynamics CRM 4.0, back up the following components: Microsoft CRM Server database This is straightforward; you simply need to protect the SQL CRM databases. The two databases you want to protect are the following: The configuration database: MSCRM_CONFIG The organization database: OrganizationName_MSCRM Microsoft CRM Server program files By default, these files will be located at C:Program FilesMicrosoft CRM. Microsoft CRM website By default the CRM website files are located in the C:Inetpubwwwroot directory. The web.config file can be protected. It only needs protecting if it has been changed from the default settings. Microsoft CRM registry subkey Back up the HKEY_LOCAL_MACHINESOFTWAREMicrosoftMSCRM key. Microsoft CRM customizations To protect customizations or any third-party add-ons you will need to understand the specific components to back up and protect. Other components to back up for protecting Microsoft CRM include the following: System state of your domain controller. Exchange server if the CRM's e-mail router is used. To protect Microsoft Dynamics CRM 2011, back up the following components: Microsoft CRM 2011 databases This is straightforward, you simply need to protect the SQL CRM databases. The two databases you want to protect are: The configuration database: MSCRM_CONFIG The organization database: OrganizationName_MSCRM Microsoft CRM 2011 program files By default, these files will be located at C:Program FilesMicrosoft CRM. Microsoft CRM 2011 website By default the CRM website files are located in the C:Program FilesMicrosoft CRMCRMWeb directory. The web.config file can be protected. It only needs protecting if it has been changed from the default settings. Microsoft CRM 2011 registry subkey Back up the HKEY_LOCAL_MACHINESOFTWAREMicrosoftMSCRM subkey. Microsoft CRM 2011 customizations To protect customizations or any third-party add-ons you will need to understand the specific components to back up and protect. Other components to back up for protecting Microsoft CRM 2011 include: System state of your domain controller. Exchange server if the CRM's e-mail router is used. SharePoint if CRM and SharePoint integration is in use. Note that for both CRM 4.0 and CRM 2011, you could have more than one OrganizationName_MSCRM database if you have more than one organization in CRM. Be sure to protect all of the OrganizationName_MSCRM databases that may exist. That wraps up the Microsoft Dynamics CRM protection for both 4.0 and 2011. You simply need to configure protection of the mentioned components with DPM. Now let's look at what it will take to protect another product from the Dynamics family. Protecting Dynamics GP with DPM Dynamics GP is Microsoft's ERP and accounting software package for mid-market businesses. GP has standard accounting functions but it can do more such as Sales Order Processing, Order Management, Inventory Management, and Demand Planner for forecasting, thus making it usable as a full-blown ERP. GP was once known as Great Plains Software before acquisition by Microsoft. The most recent versions of GP are Microsoft Dynamics GP 10.0 and Dynamics GP 2010 R2. GP holds your organization's financial data. If you use it as an ERP solution, it holds even more critical data, and losing this data could be devastating to an organization. Yes, there is a built-in backup utility in GP but this does not cover all bases in protecting your GP. In fact, the built-in backup process only backs up the SQL database, and does not cover items like: Customized forms Reports Financial statement formats The sysdata folder These are the GP components you should protect with DPM: SQL administrative databases: Master, TempDB, and Model Microsoft Dynamics GP system database (DYNAMICS) Each of your company databases If you use SQL Server Agent to schedule automatic tasks, back up the msdb database forms.dic (for customized forms) can be found in %systemdrive%Program Files (x86)Microsoft DynamicsGP2010 reports.dic (for reports) can be found in %systemdrive%Program Files (x86)Microsoft DynamicsGP2010 Backing up these components with DPM should be sufficient protection in the event a restore is needed. Protecting TMG 2010 with DPM Threat Management Gateway (TMG ) is a part of the Forefront product family. The predecessor to TMG is Internet Security and Acceleration Server (ISA Server ). TMG is fundamentally a firewall, but a very powerful one with features such as VPN, web caching, reverse proxy, advanced stateful packet, WAN failover, malware protection, routing, load balancer, and much more. There have been several forum threads on the Microsoft DPM TechNet forums asking about DPM protecting TMG, which sparked the inclusion of this section in the book. TMG is a critical part of networks and should have high priority in regards to backup, right up there with your other critical business applications. In many environments, if TMG is down, there are a good amount of users that cannot access certain business applications which causes downtime. Let's take a look at how and what to protect in regards to TMG. The first step is to allow DPM traffic on TMG so that the agent can communicate with DPM. You will need to install the DPM agent on TMG and then start protecting it from there. Follow the ensuing steps to protect your TMG server: On the TMG server, go to Start | All Programs | Microsoft TMG Server . Open the TMG Server Management MMC. Expand Arrays and then TMG Server computer , then click on Firewall Policy . On the View menu, click on Show System Policy Rules . Right-click on the Allow remote management from selected computers using MMC system policy rule. Select Edit System Policy . In the System Policy Editor dialog box, click to clear the Enable this configuration group checkbox, and then click on OK . Click on Apply to update the firewall configuration, and then click on OK . Right-click on the Allow RPC from TMG server to trusted servers system policy rule. Select Edit System Policy . In the System Policy Editor dialog box, click to clear the Enforce strict RPC compliance checkbox, and then click on OK . Click on Apply to update the firewall configuration, and then click on OK . On the View menu, click on Hide System Policy Rules . Right-click on Firewall Policy . Select New and then Access Rule . In the New Access Rule Wizard window, type a name in the Access rule name box. Click on Next . Check the Allow checkbox and then click on Next . In the This rule applies to list, select All outbound traffic from the drop-down menu and click on Next . On the Access Rule Sources page, click on Add . In the Add Network Entities dialog window, click on New and select Computer from the drop-down list. Now type the name of your DPM server and type the DPM server's IP address in the Computer IP Address field. Click on OK when you are done. You will then see your DPM server listed under the Computers folder in the Add Network Entities window. Select it and click on Add . This will bring the DPM computer into your access rule wizard. Click on Next . In the Add Rule Destinations window click on Add . The Add Network Entities window will come up again. In this window expand the Networks folder, and then select Local Host and click on Add . Now click on Next . Your rule should have both the DPM server and Local Host listed for both incoming and outgoing. Click on Next , leave the default All Users entry in the This rule applies to requests from the following user sets box, click on Next again. Click on Finish . Right-click on the new rule (DPM2010 in this example), and then click on Move Up . Right-click on the new rule, and select Properties . In the rule name properties dialog box (DPM2010 Properties ), click on the Protocols tab, then click on Filtering . Now select Configure RPC Protocol . In the Configure RPC protocol policy dialog box, check the Enforce strict RPC compliance checkbox, and then click on OK twice. Click on Apply to update the firewall policy, and then click on OK . Now you will need to attach the DPM agent for the TMG server. Follow the ensuing steps to complete this task: Open the DPM Administrator Console. Click on the Management tab on the navigation bar. Now click on the Agents tab. On the Actions pane, click on Install . Now the Protection Agent Install Wizard window should pop up. Choose the Attach agents checkbox. Choose Computer on trusted domain , and click on Next . Select the TMG server from the list and click on Add and then click on Next . Enter credentials for the domain account. The account that is used here needs to have administrative rights on the computer you are going to protect. Click on Next to continue. You will receive a warning that DPM cannot tell if the TMG server is clustered or not. Click on OK for this. On the next screen click on Attach to continue. Next you have to install the agent on the TMG firewall and point it to the correct DPM server. Follow the ensuing steps to complete this task: From the TMG server that you will be protecting, access the DPM server over the network and copy the folder with the agent installed in it down to the local machine. Use this path DPMSERVERNAME%systemdrive%program filesMicrosoft DPMDPMProtectionAgentsRA3.03.0.7696.0i386. Then from the local folder on the protected computer, run dpmra.msi to install the agent. Open a command prompt (make sure you have elevated privileges), change directory to C:Program FilesMicrosoft Data Protection ManagerDPMbin then run the following: SetDpmServer.exe -dpmServerName <serverName> userName <userName> Following is the example of the previous command: SetDpmServer.exe -dpmServerName buchdpm Now restart the TMG server. Once your TMG server comes back, check the Windows services to make sure that the DPMRA service is set to automatic, and then start it. That is it for configuring DPM to start protecting TMG, but there are a few more things that we still need to cover on this topic. With TMG backup you can choose to back up certain components of TMG, depending on your recovery needs. With DPM you can back up the TMG hard drive, TMG logs that are stored in SQL, TMG's system state, or BMR of TMG. Following is the list of components you should back up depending on your circumstances: What can be included in TMG server backup: TMG configuration settings (exported through TMG) TMG firewall settings (exported through TMG) TMG logfiles (stored in SQL databases) TMG install directory (only needed if you have custom forms for things such as an Outlook Web Access login screen TMG server system state TMG BMR None of the previous components are required for protection of TMG. In fact, protecting the SQL logfiles tends to cause more issues than it helps, as they change so often. These SQL log databases change so often that DPM will send an error when the old SQL databases no longer shown under protection. The logfiles are not required to restore your TMG. For a standard TMG restore, you will need to reinstall TMG, reconfigure NIC settings, import any certificates, and restore TMG configuration and firewall settings. For more information on backing up TMG 2010, visit the following page: http://technet.microsoft.com/en-us/library/cc984454.aspx. DPM cannot back up the TMG configuration and firewall settings natively. This needs to be scripted and scheduled through Windows Task Scheduler, and then placed on the local hard drive. DPM can back up the .XML settings for TMG export from there. You can find the TMG server's export script at http://msdn.microsoft.com/en-us/library/ms812627.aspx. Place this script into a .VBS file, and then set up a scheduled task to call this file to run. This automates the export of your TMG server settings. There is another way to back up the entire TMG server. This is a new type of protection, specific to TMG 2010. This protection is BMR and is available because TMG is now installed on top of Windows Server 2008 and Windows Server 2008 R2. Protecting the BMR of your TMG gives you the ability to restore your entire TMG in the event that it fails-configuration and firewall settings included. BMR will also bring back certificates and NIC card settings. Note that BMR of TMG restored on a virtual machine can't use its NIC card settings. It only on the same hardware. Well that covers how to protect TMG with DPM. As you can see that there are some improvements through BMR, and if you do not employ BMR protection you can still automate the process of protecting TMG. How to protect IIS Internet Information Services (IIS ) is Microsoft's web server platform. It is included for free with Windows Server operating systems. Its modular nature makes it scalable for different organization web server need. The latest version is IIS 8. It can be used for more than standard web hosting, for example as an FTP server or for media delivery . Knowing what to protect when it comes to IIS will come in handy in almost any environment you may work in. Backing up IIS is one thing but you need to ensure that you understand the websites or web applications you are running, so that you know how to back them up too. In this section, we are going to look at the protection of IIS. To protect IIS, you should backup the following components: IIS configuration files Website or web applications data SSL certificates Registry (only needed if website or web application required modifications of the registry) Metabase The IIS configuration files are located in the %systemdrive%windowssystem32inetsrvconfig directory (and subdirectories). The website or web application files are typically found in C:inetpubwwwroot. Now this is the default location but the website or web application files can be located anywhere on an IIS server. To export SSL certificates directly from IIS, follow the ensuing steps: Open the Microsoft IIS 7 console. In the left-hand pane, select the server name. In the center pane click on the server certificates icon. Right-click on the certificate you wish to export and select export . Enter a file path, name the certificate file, and give it a password. Click on OK and your certificate will be exported as a .pfx file in the path you specified. Metabase is an internal database that holds IIS configuration data. It is made up of two files: MBSchema.xml and MetaBase.xml. These can be found in %SystemRoot%system32inetsrv. A good thing to know is that if you protect the system state of a server, then IIS configuration will be included in this backup. This does not include the website or web application files, so you will still need to protect these in addition to a system state backup. That covers the items you will need to protect IIS with DPM backup. Protecting Lync 2010 with DPM Lync 2010 is Microsoft's Unified Communication platform complete with IM, presence, conferencing, enterprise video and voice, and more. Lync was formerly known as Office Communicator. Lync is quickly becoming an integral part of business communications. With Lync being a critical application to organizations, it important to ensure this platform is backed up. Lync is a massive product with many moving parts. We are not going to cover all of Lync's architecture as this would need its own book. We are going to focus on what should be backed up to ensure protection of your Lync deployment. Overall, we want to protect Lync's settings and configuration data. The majority of this data is stored in the Lync Central Management store. The following are the components that needs to be protected in order to back up Lync: Settings and configuration data Topology configuration (Xds.mdf) Location information (Lis.mdf) Response group configuration (RgsConfig.mdf) Data stored in databases User data (Rtc.mdf) Archiving data (LcsLog.mdf) Monitoring data (csCDR.mdf and QoeMetrics.mdf) File stores Lync server file store Archiving file store These stores will be file shares on the Lync server, named in the format lyncservernamesharename. To track down these file shares if you don't know where they are, go to the Lync Topology Builder and look in the File stores node. Note the files named Meeting.Active should not be backed up. These files are in use and locked while a meeting takes place. Other components as follows: Active Directory (User SIP data, a pointer to the Central Management store, and objects for Response Group and Conferencing Attendant) Certification authority (CA) and certificates (if you use an internal CA) Microsoft Exchange and Exchange Unified Messaging (UM) if you are using UM with your Exchange Domain Name System (DNS) records and IP addresses IIS on Lync Server DHCP Configuration Group Chat (if used) XMPP gateways if you are using XMPP gateway Public switched telephone network (PSTN) gateway configuration, if your Lync is connected to one Firewall and Load Balancer (if used) configurations Summary Now that we had a chance to look at several Microsoft workloads that are used in organizations today and how to protect them with DPM, you should have a good understanding what it takes to back them up. These workloads included Lync 2010, IIS, CRM, GP, DFS, and TMG. Note there are many more Microsoft workloads that DPM cannot protect natively, which we were unable to cover in this article. Resources for Article : Further resources on this subject: Overview of Microsoft Dynamics CRM 2011 [Article] Deploying .NET-based Applications on to Microsoft Windows CE Enabled Smart Devices [Article] Working with Dashboards in Dynamics CRM [Article]

0
0
2135

article-image-out-process-distributed-caching

Packt

06 Sep 2013

7 min read

Out-of-process distributed caching

Packt

06 Sep 2013

7 min read

(For more resources related to this topic, see here.) Getting ready Out-of-process caching is a way of distributing your caching needs in a different JVM and/or infrastructure. Ehcache provides a convenient deployable WAR file that works on most web containers/standalone servers whose mission is to provide an easy API interface to distributed cache. At the moment of writing, you can download it from http://sourceforge.net/projects/ehcache/files/ehcache-server/, or you can include it in your Maven POM and will be delivered as a WAR file. The cache server requires no special configuration on the Tomcat container. However, if you are running GlassFish, Jetty, WebLogic, or any other application server (or servlet container), there are minimal configuration changes to do. Please refer to the Ehcache cache server documentation for details on these. While using the RESTful interface, it is important to note that you have three ways to set the MIME type for exchanging data back and forth to the cache server, namely Text/XML, application/JSON, and application/x-java-serialized-object. You can use any programming language to invoke the web service interface and cache your objects (except for application/x-java-serialized-object for obvious reasons). Refer to the recipe8 project directory within the source code bundle for a fully working sample of this recipe content and further information related to this topic. How to do it... Add Ehcache and Ehcache cache server dependencies to your POM.xml file. <dependency> <groupId>net.sf.ehcache</groupId> <artifactId>ehcache-server</artifactId> <version>1.0.0</version> <type>war</type> </dependency> <dependency> <groupId>net.sf.ehcache</groupId> <artifactId>ehcache</artifactId> <version>2.6.0</version> <type>pom</type> </dependency> Edit ehcache.xml in the cache server to hold your cache setup (the cache name is very important).You can find this file here: ${CACHE_SERVER}/WEB-INF/classes/ehcache.xml. <?xml version="1.0" encoding="UTF-8"?> <ehcache xsi_noNamespaceSchemaLocation="ehcache.xsd" updateCheck="true" monitoring="autodetect" dynamicConfig="true">  <cache name="remoteCache" maxElementsInMemory="10000" eternal="true" diskPersistent="true" overflowToDisk="true"/> ... Disable the SOAP interface in the cache server web.xml (since we are going to use RESTful) file:You can find this file here: ${CACHE_SERVER}/WEB-INF/web.xml. <?xml version="1.0" encoding="UTF-8"?> <web-app xsi_schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5"> ...  ... Make your objects-to-be-cached serializable: import java.io.Serializable; public final class Item implements Serializable { Invoke the RESTful (or SOAP) interface to save/retrieve/delete cached objects: ... public void saveItemInCache(String key, Serializable item) { //sample URL: http://localhost:8080/ehcache/rest/cacheName/{id} //here cacheName is the cache name you set up in the cache-server ehcache.xml String url = CACHE_SERVER_URL + "cacheName" + "/" + key; //initialize Apache HTTP Client client = new DefaultHttpClient(); //create Cache Element to be sent Element element = new Element(key, item); //serialize object to be sent to EhCache Server byte[] itemToByteArray = SerializationUtils.serialize(element); //create PUT request HttpPut putRequest = new HttpPut(url); //set header to read java-serialized-object mime type putRequest.setHeader ("Content-Type", "application/x-java-serialized-object"); ... How it works... The Ehcache cache server utility is a versatile tool that lets us distribute cache engines in a very flexible way. It provides a very simple API exposure via RESTful or SOAP-based web services. We start by editing the ehcache.xml configuration file within the cache server application by adding a cache that we would like to use for our cached objects: ...  <cache name="remoteCache" maxElementsInMemory="10000" eternal="true" diskPersistent="true" overflowToDisk="true"/> ... The cache name defined here is very important because this will be the endpoint of our RESTful URL pattern that the cache server will identify and use. Then, we need to edit the web.xml file within the cache server application (located in {CACHE-SERVER}/WEB-INF/) in order to comment out (or completely remove) service definitions that we are not going to use (that is, SOAP if you are using RESTful or vice versa).  <cache name="remoteCache" maxElementsInMemory="10000" ... Now, your URL would be something like this: //sample URL: http://localhost:8080/ehcache/rest/remoteCache/{id} Here, id is just the key value you assign to the cached object. Finally, you just use any http/SOAP client library (or Java default Net API classes) to invoke the web service. In the case of RESTful services, you need to be aware that the HTTP method sent determines whether you are storing, updating, retrieving, or deleting a cached item. They are as follows: GET /{cache}/{element}: This retrieves an object by its key from the O-O-P cache layer. PUT /{cache}/{element}: This stores an item in the O-O-P cache layer. DELETE /{cache}/{element}: This deletes an item from the O-O-P cache layer. HEAD /{cache}/{element}: This retrieves metadata (cache configuration values) from the O-O-P cache layer. OPTIONS /{cache}/{element}: This returns the WADL describing operations. For changing the context you can edit the file ${CACHE_SERVER}/META-INF/context.xml and place your desired context name. As for security, look for the file ${CACHE_SERVER}/WEB-INF/server_security_config.xml_rename_to_activate and open it to read the instructions. Summary This article provided details on implementing distributed caching using the Ehcache server, and also explained in brief what out-of-process caching is. Resources for Article : Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [Article] Play Framework: Data Validation Using Controllers [Article] Building Applications with Spring Data Redis [Article]

0
0
2130

Packt

04 Oct 2013

7 min read

CQL for client applications

Packt

04 Oct 2013

7 min read

(For more resources related to this topic, see here.) Using the Thrift API The Thrift library is based on the Thrift RPC protocol. High-level clients built over it have been a standard way of building an application for a long time. In this section, we'll explain how to write a client application using CQL as the query language and thrift as the Java API. When we start Cassandra, by default it listens to Thrift clients (start_rpc: true property in the CASSANDRA_HOME/conf/cassandra.yaml file enables this). Let's build a small program that connects to Cassandra using the Thrift API, and runs CQL 3 queries for reading/writing data in the UserProfiles table we created for the facebook application. The program can be built by performing the following steps: For downloading the Thrift Library, you need to enter apache-assandra-thrift-1.2.x.jar (which is to be found in the CASSANDRA_HOME/lib folder) into your classpath. If your Java project is mavenized, you need to insert the following entry in pom.xml under the dependency section (version will vary depending upon your Cassandra server installation): <dependency> <groupId>org.apache.cassandra</groupId> <artifactId>cassandra-thrift</artifactId> <version>1.2.5</version> </dependency> For connecting to the Cassandra server on a given host and port, you need to open org.apache.thrift.transport.TTransport to the Cassandra node and create an instance of org.apache.cassandra.thrift.Cassandra.Client as follows: TTransport transport = new TFramedTransport(new TSocket("localhost", 9160)); TProtocol protocol = new TBinaryProtocol(transport); Cassandra.Client client = new Cassandra.Client(protocol); transport.open(); client.set_cql_version("3.0.0"); The default CQL version for Thrift is 2.0.0. You must set it to 3.0.0 if you are writing CQL 3 queries and don't want to see any version related errors. After you are done with transport, close it gracefully (usually at the end of read/write operations) as follows: transport.close(); Creating a schema : The executeQuery() utility method accepts String CQL 3 query and runs it: CqlResult executeQuery(String query) throws Exception { return client.execute_cql3_query(ByteBuffer.wrap(query.getBytes("UTF-8")), Compression.NONE, ConsistencyLevel.ONE); } Now, create keyspace and the table by directly executing CQL 3 query: //Create keyspace executeQuery("CREATE KEYSPACE facebook WITH replication = "{'class':'SimpleStrategy','replication_factor':3};"); executeQuery("USE facebook;"); //Create table executeQuery("CREATE TABLE UserProfiles(" +"email_id text," + "password text,"+ "name text," + "age int," + "profile_picture blob," + "PRIMARY KEY(email_id)" + ");" ); Reading/writing data: A couple of records can be inserted as follows: executeQuery("USE facebook;"); executeQuery("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('john.smith@example.com','p4ssw0rd',' John Smith',32,0x8e37);"); executeQuery("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('david.bergin@example.com','guess1t',' David Bergin',42,0xc9f1);"); Executing the SELECT query returns CQLResult, on which we can iterate easily to fetch records: CqlResult result = executeQuery("SELECT * FROM facebook.UserProfiles " + "WHERE email_id = 'john.smith@example.com';"); for (CqlRow row : result.getRows()) { System.out.println(row.getKey(); } Using the Datastax Java driver The Datastax Java driver is based on the Cassandra binary protocol that was introduced in Cassandra 1.2, and works only with CQL 3. The Cassandra binary protocol is specifically made for Cassandra in contrast to Thrift, which is a generic framework and has many limitations. Now, we are going to write a Java program that uses the Datastax Java driver for reading/writing data into Cassandra, by performing the following steps: Downloading the driver library : This driver library JAR file must be in your classpath in order to build an application using it. If you have a maven-based Java project, you need to insert the following entry into the pom.xml file under the dependeny section: <dependency> <groupId>com.datastax.cassandra</groupId> <artifactId>cassandra-driver-core</artifactId> <version>1.0.1</version> </dependency> This driver project is hosted on Github: (https://github.com/datastax/java-driver). It makes sense to check and download the latest version. Configuring Cassandra to listen to native clients : In the newer version of Cassandra, this would be enabled by default and Cassandra will listen to clients using binary protocol. But the earlier Cassandra installations may require enabling this. All you have to do is to check and enable the start_native_transport property into the CASSANDRA_HOME/conf/Cassandra.yaml file by inserting/uncommenting the following line: start_native_transport: true The port that Cassandra will use for listening to native clients is determined by the native_transport_port property. It is possible for Cassandra to listen to both Thrift and native clients simultaneously. If you want to disable Thrift, just set the start_rpc property to false in CASSANDRA_HOME/conf/Cassandra.yaml. Connecting to Cassandra : The com.datastax.driver.core.Cluster class is the entry point for clients to connect to the Cassandra cluster: Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build(); After you are done with using it (usually when application shuts down), close it gracefully: cluster.shutdown(); Creating a session : An object of com.datastax.driver.core.Session allows you to execute a CQL 3 statement. The following line creates a Session instance: Session session = cluster.connect(); Creating a schema : Before reading/writing data, let's create a keyspace and a table similar to UserProfiles in the facebook application we built earlier: // Create Keyspace session.execute("CREATE KEYSPACE facebook WITH replication = " + "{'class':'SimpleStrategy','replication_factor':1};"); session.execute("USE facebook"); // Create table session.execute("CREATE TABLE UserProfiles(" + "email_id text," + "password text,"+ "name text," + "age int," + "profile_picture blob," + "PRIMARY KEY(email_id)" + ");" ); Reading/writing data : We can insert a couple of records as follows: session.execute("USE facebook"); session.execute("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('john.smith@example.com','p4ssw0rd','John Smith',32,0x8e37);"); session.execute("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('david.bergin@example.com','guess1t','David Bergin',42,0xc9f1);"); Finding and printing records : A SELECT query returns an instance of com.datastax.driver.core.ResultSet. You can fetch individual rows by iterating over it using the com.datastax.driver.core.Row object: ResultSet results = session.execute ("SELECT * FROM facebook.UserProfiles " + "WHERE email_id = 'john.smith@example.com';"); for (Row row : results) { System.out.println ("Email: " + row.getString("email_id") + "tName: " + row.getString("name")+ "t Age : " + row.getInt("age")); } Deleting records : We can delete a record as follows: session.execute("DELETE FROM facebook.UserProfiles WHERE email_id='john.smith@example.com';"); Using high-level clients In addition to the libraries based on Thrift and binary protocols, some high-level clients are built with the purpose to ease development and provide additional services, such as connection pooling, load balancing, failover, secondary indexing, and so on. Some of them are listed here: Astyanax (https://github.com/Netflix/astyanax): Astyanax is a high-level Java client for Cassandra. It allows you to run both simple and prepared CQL queries. Hector (https://github.com/hector-client/hector): Hector is a high-level client for Cassandra. At the time of writing this book, it supported CQL 2 only (not CQL 3). Kundera (https://github.com/impetus-opensource/Kundera): Kundera is a JPA 2.0-based object datastore mapping library for Cassandra and many other NoSQL datastores. CQL 3 queries are run with Kundera using the native queries as described in JPA specification. Summary From this article, we basically learn about using CQL in queries using three different preceding methods. Resources for Article : Further resources on this subject: Quick start – Creating your first Java application [Article] Apache Cassandra: Libraries and Applications [Article] Getting Started with Apache Cassandra [Article]

0
0
2129

article-image-make-efficient-data-driven-decisions

Aaron Lazar

15 Feb 2018

7 min read

How to make efficient data-driven decisions

Aaron Lazar

15 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an extract from the book Predictive Analytics with TensorFlow, authored by Md. Rezaul Karim. The book will help you build, tune, and deploy predictive data models with TensorFlow.[/box] Today we’ll learn to take decisions driven by data with the help of few examples. The growing demand for data is a key challenge. Decision support teams such as institutional research and business intelligence often cannot take the right decisions on how to expand their business and research outcomes from a huge collection of data. Although data plays an important role in driving the decision, however, in reality, taking the right decision at right time is the goal. In other words, the goal is the decision support, not the data support. This can be achieved through an advanced use of data management and analytics. Data value chain for making decisions The following diagram in figure 1 (source: H. Gilbert Miller and Peter Mork, From Data to Decisions: A Value Chain for Big Data, Proc. Of IT Professional, Volume: 15, Issue: 1, Jan.-Feb. 2013, DOI: 10.1109/MITP.2013.11) shows the data chain towards taking actual decisions–that is, the goal. The value chains start through the data discovery stage consisting of several steps such as data collection and annotating data preparation, and then organizing them in a logical order having the desired flow. Then comes the data integration for establishing a common data representation of the data. Since the target is to take the right decision, for future reference having the appropriate provenance of the data–that is, where it comes from, is important: Well, now your data is somehow integrated into a presentable format, it's time for the data exploration stage, which consists of several steps such as analyzing the integrated data and visualization before taking the actions to take on the basis of the interpreted results. However, is this enough before taking the right decision? Probably not! The reason is that it lacks enough analytics, which eventually helps to take the decision with an actionable insight. Predictive analytics comes in here to fill the gap between. Now let's see an example of how in the following section. From disaster to decision – Titanic survival example Here is the challenge, Titanic–Machine Learning from Disaster from Kaggle (https://www.kaggle.com/c/titanic): "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy" But going into this deeper, we need to know about the data of passengers travelling in the Titanic during the disaster so that we can develop a predictive model that can be used for survival analysis. The dataset can be downloaded from the preceding URL. Table 1 here shows the metadata about the Titanic survival dataset: A snapshot of the dataset can be seen as follows: The ultimate target of using this dataset is to predict what kind of people survived the Titanic disaster. However, a bit of exploratory analysis of the dataset is a mandate. At first, we need to import necessary packages and libraries: import pandas as pd import matplotlib.pyplot as plt import numpy as np Now read the dataset and create a panda's DataFrame: df = pd.read_csv('/home/asif/titanic_data.csv') Before drawing the distribution of the dataset, let's specify the parameters for the graph: fig = plt.figure(figsize=(18,6), dpi=1600) alpha=alpha_scatterplot = 0.2 alpha_bar_chart = 0.55 fig = plt.figure() ax = fig.add_subplot(111) Draw a bar diagram for showing who survived versus who did not: ax1 = plt.subplot2grid((2,3),(0,0)) ax1.set_xlim(-1, 2) df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart) plt.title("Survival distribution: 1 = survived") Plot a graph showing survival by Age: plt.subplot2grid((2,3),(0,1)) plt.scatter(df.Survived, df.Age, alpha=alpha_scatterplot) plt.ylabel("Age") plt.grid(b=True, which='major', axis='y') plt.title("Survival by Age: 1 = survived") Plot a graph showing distribution of the passengers classes: ax3 = plt.subplot2grid((2,3),(0,2)) df.Pclass.value_counts().plot(kind="barh", alpha=alpha_bar_chart) ax3.set_ylim(-1, len(df.Pclass.value_counts())) plt.title("Class dist. of the passengers") Plot a kernel density estimate of the subset of the 1st class passengers' age: plt.subplot2grid((2,3),(1,0), colspan=2) df.Age[df.Pclass == 1].plot(kind='kde') df.Age[df.Pclass == 2].plot(kind='kde') df.Age[df.Pclass == 3].plot(kind='kde') plt.xlabel("Age") plt.title("Age dist. within class") plt.legend(('1st Class', '2nd Class','3rd Class'),loc='best') Plot a graph showing passengers per boarding location: ax5 = plt.subplot2grid((2,3),(1,2)) df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart) ax5.set_xlim(-1, len(df.Embarked.value_counts())) plt.title("Passengers per boarding location") Finally, we show all the subplots together: plt.show() >>> The figure shows the survival distribution, survival by age, age distribution, and the passengers per boarding location: However, to execute the preceding code, you need to install several packages such as matplotlib, pandas, and scipy. They are listed as follows: Installing pandas: Pandas is a Python package for data manipulation. It can be installed as follows: $ sudo pip3 install pandas #For Python 2.7, use the following: $ sudo pip install pandas Installing matplotlib: In the preceding code, matplotlib is a plotting library for mathematical objects. It can be installed as follows: $ sudo apt-get install python-matplotlib # for Python 2.7 $ sudo apt-get install python3-matplotlib # for Python 3.x Installing scipy: Scipy is a Python package for scientific computing. Installing blas and lapack and gfortran are a prerequisite for this one. Now just execute the following command on your terminal: $ sudo apt-get install libblas-dev liblapack-dev $ sudo apt-get install gfortran $ sudo pip3 install scipy # for Python 3.x $ sudo pip install scipy # for Python 2.7 For Mac, use the following command to install the above modules: $ sudo easy_install pip $ sudo pip install matplotlib $ sudo pip install libblas-dev liblapack-dev $ sudo pip install gfortran $ sudo pip install scipy For windows, I am assuming that Python 2.7 is already installed at C:Python27. Then open the command prompt and type the following command: C:Usersadmin-karim>cd C:/Python27 C:Python27> python -m pip install <package_name> # provide package name accordingly. For Python3, issue the following commands: C:Usersadmin-karim>cd C:Usersadmin-karimAppDataLocalPrograms PythonPython35Scripts C:Usersadmin-karimAppDataLocalProgramsPythonPython35 Scripts>python3 -m pip install <package_name> Well, we have seen the data. Now it's your turn to do some analytics on top of the data. Say predicting what kinds of people survived from that disaster. Don't you agree that we have enough information about the passengers, but how could we do the predictive modeling so that we can draw some fairly straightforward conclusions from this data? For example, say being a woman, being in 1st class, and being a child were all factors that could boost passenger chances of survival during this disaster. In a brute-force approach–for example, using if/else statements with some sort of weighted scoring system, you could write a program to predict whether a given passenger would survive the disaster. However, does writing such a program in Python make much sense? Naturally, it would be very tedious to write, difficult to generalize, and would require extensive fine tuning for each variable and samples (that is, passenger). This is where predictive analytics with machine learning algorithms and emerging tools comes in so that you could build a program that learns from sample data to predict whether a given passenger would survive. If you found this post useful and would like to explore more, head over to grab the book, Predictive Analytics with TensorFlow written by Md. Rezaul Karim.

0
0
2129

Packt

19 Jan 2017

7 min read

Clustering Model with Spark

Packt

19 Jan 2017

7 min read

0
0
2123

article-image-explaining-data-exploration-in-under-a-minute

Amarabha Banerjee

08 Feb 2018

5 min read

Explaining Data Exploration in under a minute

Amarabha Banerjee

08 Feb 2018

5 min read

[box type="note" align="" class="" width=""]Below given article is taken from the book Machine Learning with R written by Brett Lantz. This book will help you harness the power of R for statistical computing and data science.[/box] Today, we shall explore different data exploration techniques and a real world example of using these techniques. Introduction Data Exploration is a term used for finding insightful information from data. To find insights from data various steps such as data munging, data analysis, data modeling, and model evaluation are taken. In any real data exploration project, commonly six steps are involved in the exploration process. They are as follows: Asking the right questions: Asking the right questions will help in understanding the objective and target information sought from the data. Questions can be asked such as What are my expected findings after the exploration is finished?, or What kind of information can I extract through the exploration? Data collection: Once the right questions have been asked the target of exploration is cleared. Data collected from various sources is in unorganized and diverse format. Data may come from various sources such as files, databases, internet, and so on. Data collected in this way is raw data and needs to be processed to extract meaningful information. Most of the analysis and visualizing tools or applications expect data to be in a certain format to generate results and hence the raw data is of no use for them. Data munging: Raw data collected needs to be converted into the desired format of the tools to be used. In this phase, raw data is passed through various processes such as parsing the data, sorting, merging, filtering, dealing with missing values, and so on. The main aim is to transform raw data in the format that the analyzing and visualizing tools understand. Once the data is compatible with the tools, analysis and visualizing tools are used to generate the different results. Basic exploratory data analysis: Once the data munging is done and data is formating for the tools, it can be used to perform data exploration and analysis. Tools provide various methods and techniques to do the same. Most analyzing tools allow statistical functions to be performed on the data. Visualizing tools help in visualizing the data in different ways. Using basic statistical operations and visualizing the same data can be understood in better way. Advanced exploratory data analysis: Once the basic analysis is done it's time to look at an advanced stage of analysis. In this stage, various prediction models are formed on basis of requirement. Machine learning algorithms are utilized to train the model and generate the inferences. Various tuning on the model is also done to ensure correctness and effectiveness of the model. Model assessment: When the models are mare, they are evaluated to find the best model from the given different models. The major factor to decide the best model is to see how perfect or closely it can predict the values. Models are tuned here also for increasing the accuracy and effectiveness. Various plots and graphs are used to see the model’s prediction. Real world example - using Air Quality Dataset Air quality datasets come bundled with R. They contain data about the New York Air Quality Measurements of 1973 for five months from May to September recorded daily. To view all the available datasets use the data() function, it will display all the datasets available with R installation. How to do it Perform the following step to see all the datasets in R and using airquality: > data() > str(airquality) Output 'data.frame': 153 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ... $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ... $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ... > head(airquality) Output Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 How it works The str command is used to display the structure of the dataset, as you can see it contains the information about the observation of ozone, solar, wind, and temp attributes recorded each day for five months. Using the head function, you can see the first few lines of actual data. The dataset is very basic and is enough to start processing and analyzing data at a very basic level. Kaggle website, which has various diverse kinds of datasets. Apart from datasets it also holds many competitions in data science fields to solve real-world problems. You can find the competitions, datasets, kernels, and jobs at https://www. kaggle.com/. Many competitions are organized by large corporate bodies, government agencies, or from academia. Many of the competitions have prize money associated with them. The following screenshot shows competitions and prize money. You can simply create an account and start participating in competitions by submitting code and the output and the same will be assessed. Assessment or evaluation criteria is available on the detail page of each competition. By participating and using https:/ / www.kaggle. com/ one gains experience in solving real-world problems. It gives you a taste of what data scientist do. On the jobs page various jobs for data scientists and analysis is listed and you can apply if the profile is suitable or matches with your interests. If you liked our post, be sure to check out Machine Learning with R which consists of more useful machine learning techniques with R.

0
0
2110

Packt

28 Oct 2013

12 min read

Low-Level Index Control

Packt

28 Oct 2013

12 min read

(For more resources related to this topic, see here.) Altering Apache Lucene scoring With the release of Apache Lucene 4.0 in 2012, all the users of this great, full text search library, were given the opportunity to alter the default TF/IDF based algorithm. Lucene API was changed to allow easier modification and extension of the scoring formula. However, that was not the only change that was made to Lucene when it comes to documents score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use different scoring formula for our documents. In this section we will take a deeper look at what Lucene 4.0 brings and how those features were incorporated into ElasticSearch. Setting per-field similarity Since ElasticSearch 0.90, we are allowed to set a different similarity for each of the fields we have in our mappings. For example, let's assume that we have the following simple mapping that we use, in order to index blog posts (stored in the posts_no_similarity.json file): { "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } } What we would like to do is, use the BM25 similarity model for the name field and the contents field. In order to do that, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings (stored in the posts_similarity.json file) would appear as shown in the following code: { "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed", "similarity" : "BM25" } } } } } And that's all, nothing more is needed. After the preceding change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name and contents fields. In case of the Divergence from randomness and Information based similarity model, we need to configure some additional properties to specify the behavior of those similarities. How to do that is covered in the next part of the current section. Default codec properties When using the default codec we are allowed to configure the following properties: min_block_size: It specifies the minimum block size Lucene term dictionary uses to encode blocks. It defaults to 25. max_block_size: It specifies the maximum block size Lucene term dictionary uses to encode blocks. It defaults to 48. Direct codec properties The direct codec allows us to configure the following properties: min_skip_count: It specifies the minimum number of terms with a shared prefix to allow writing of a skip pointer. It defaults to 8. low_freq_cutoff: The codec will use a single array object to hold postings and positions that have document frequency lower than this value. It defaults to 32. Memory codec properties By using the memory codec we are allowed to alter the following properties: pack_fst: It is a Boolean option that defaults to false and specifies if the memory structure that holds the postings should be packed into the FST. Packing into FST will reduce the memory needed to hold the data. acceptable_overhead_ratio: It is a compression ratio of the internal structure specified as a float value which defaults to 0.2. When using the 0 value, there will be no additional memory overhead but the returned implementation may be slow. When using the 0.5 value, there can be a 50 percent memory overhead, but the implementation will be fast. Values higher than 1 are also possible, but may result in high memory overhead. Pulsing codec properties When using the pulsing codec we are allowed to use the same properties as with the default codec and in addition to them one more property, which is described as follows: freq_cut_off: It defaults to 1. The document frequency at which the postings list will be written into the term dictionary. The documents with the frequency equal to or less than the value of freq_cut_off will be processed. Bloom filter-based codec properties If we want to configure a bloom filter based codec, we can use the bloom_filter type and set the following properties: delegate: It specifies the name of the codec we want to wrap, with the bloom filter. ffp: It is a value between 0 and 1.0 which specifies the desired false positive probability. We are allowed to set multiple probabilities depending on the amount of documents per Lucene segment. For example, the default value of 10k=0.01, 1m=0.03 specifies that the fpp value of 0.01 will be used when the number of documents per segment is larger than 10.000 and the value of 0.03 will be used when the number of documents per segment is larger than one million. For example, we could configure our custom bloom filter based codec to wrap a direct posting format as shown in the following code (stored in posts_bloom_custom.json file): { "settings" : { "index" : { "codec" : { "postings_format" : { "custom_bloom" : { "type" : "bloom_filter", "delegate" : "direct", "ffp" : "10k=0.03, 1m=0.05" } } } } }, "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "postings_format" : "custom_bloom" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } } NRT, flush, refresh, and transaction log In an ideal search solution, when new data is indexed it is instantly available for searching. At the first glance it is exactly how ElasticSearch works even in multiserver environments. But this is not the truth (or at least not all the truth) and we will show you why it is like this. Let's index an example document to the newly created index by using the following command: curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test" }' Now, we will replace this document and immediately we will try to find it. In order to do this, we'll use the following command chain: curl –XPOST localhost:9200/test/test/1 -d '{ "title": "test2" }' ; curl localhost:9200/test/test/_search?pretty The preceding command will probably result in the response, which is very similar to the following response: {"ok":true,"_index":"test","_type":"test","_id":"1","_version":2}{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "test", "_id" : "1", "_score" : 1.0, "_source" : { "title": "test" } } ] } } The first line starts with a response to the indexing command—the first command. As you can see everything is correct, so the second, search query should return the document with the title field test2, however, as you can see it returned the first document. What happened? But before we give you the answer to the previous question, we should take a step backward and discuss about how underlying Apache Lucene library makes the newly indexed documents available for searching. Updating index and committing changes The segments are independent indices, which means that queries that are run in parallel to indexing, from time to time should add newly created segments to the set of those segments that are used for searching. Apache Lucene does that by creating subsequent (because of write-once nature of the index) segments_N files, which list segments in the index. This process is called committing. Lucene can do this in a secure way—we are sure that all changes or none of them hits the index. If a failure happens, we can be sure that the index will be in consistent state. Let's return to our example. The first operation adds the document to the index, but doesn't run the commit command to Lucene. This is exactly how it works. However, a commit is not enough for the data to be available for searching. Lucene library use an abstraction class called Searcher to access index. After a commit operation, the Searcher object should be reopened in order to be able to see the newly created segments. This whole process is called refresh. For performance reasons ElasticSearch tries to postpone costly refreshes and by default refresh is not performed after indexing a single document (or a batch of them), but the Searcher is refreshed every second. This happens quite often, but sometimes applications require the refresh operation to be performed more often than once every second. When this happens you can consider using another technology or requirements should be verified. If required, there is possibility to force refresh by using ElasticSearch API. For example, in our example we can add the following command: curl –XGET localhost:9200/test/_refresh If we add the preceding command before the search, ElasticSearch would respond as we had expected. Changing the default refresh time The time between automatic Searcher refresh can be changed by using the index.refresh_interval parameter either in the ElasticSearch configuration file or by using the update settings API. For example: curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "refresh_interval" : "5m" } }' The preceding command will change the automatic refresh to be done every 5 minutes. Please remember that the data that are indexed between refreshes won't be visible by queries. As we said, the refresh operation is costly when it comes to resources. The longer the period of refresh is, the faster your indexing will be. If you are planning for very high indexing procedure when you don't need your data to be visible until the indexing ends, you can consider disabling the refresh operation by setting the index.refresh_interval parameter to -1 and setting it back to its original value after the indexing is done. The transaction log Apache Lucene can guarantee index consistency and all or nothing indexing, which is great. But this fact cannot ensure us that there will be no data loss when failure happens while writing data to the index (for example, when there isn't enough space on the device, the device is faulty or there aren't enough file handlers available to create new index files). Another problem is that frequent commit is costly in terms of performance (as you recall, a single commit will trigger a new segment creation and this can trigger the segments to merge). ElasticSearch solves those issues by implementing transaction log. Transaction log holds all uncommitted transactions and from time to time, ElasticSearch creates a new log for subsequent changes. When something goes wrong, transaction log can be replayed to make sure that none of the changes were lost. All of these tasks are happening automatically, so, the user may not be aware of the fact that commit was triggered at a particular moment. In ElasticSearch, the moment when the information from transaction log is synchronized with the storage (which is Apache Lucene index) and transaction log is cleared is called flushing. Please note the difference between flush and refresh operations. In most of the cases refresh is exactly what you want. It is all about making new data available for searching. From the opposite side, the flush operation is used to make sure that all the data is correctly stored in the index and transaction log can be cleared. In addition to automatic flushing, it can be forced manually using the flush API. For example, we can run a command to flush all the data stored in the transaction log for all indices, by running the following command: curl –XGET localhost:9200/_flush Or we can run the flush command for the particular index, which in our case is the one called library: curl –XGET localhost:9200/library/_flush curl –XGET localhost:9200/library/_refresh In the second example we used it together with the refresh, which after flushing the data opens a new searcher. The transaction log configuration If the default behavior of the transaction log is not enough ElasticSearch allows us to configure its behavior when it comes to the transaction log handling. The following parameters can be set in the elasticsearch.yml file as well as using index settings update API to control transaction log behavior: index.translog.flush_threshold_period: It defaults to 30 minutes (30m). It controls the time, after which flush will be forced automatically even if no new data was being written to it. In some cases this can cause a lot of I/O operation, so sometimes it's better to do flush more often with less data being stored in it. index.translog.flush_threshold_ops: It specifies the maximum number of operations after which the flush operation will be performed. It defaults to 5000. index.translog.flush_threshold_size: It specifies the maximum size of the transaction log. If the size of the transaction log is equal to or greater than the parameter, the flush operation will be performed. It defaults to 200 MB. index.translog.disable_flush: This option disables automatic flush. By default flushing is enabled, but sometimes it is handy to disable it temporarily, for example, during import of large amount of documents. All of the mentioned parameters are specified for an index of our choice, but they are defining the behavior of the transaction log for each of the index shards. Of course, in addition to setting the preceding parameters in the elasticsearch.yml file, they can also be set by using Settings Update API. For example: curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "translog.disable_flush" : true } }' The preceding command was run before the import of a large amount of data, which gave us a performance boost for indexing. However, one should remember to turn on flushing when the import is done. Near Real Time GET Transaction log gives us one more feature for free that is, real-time GET operation, which provides the possibility of returning the previous version of the document including non-committed versions. The real-time GET operation fetches data from the index, but first it checks if a newer version of that document is available in the transaction log. If there is no flushed document, data from the index is ignored and a newer version of the document is returned—the one from the transaction log. In order to see how it works, you can replace the search operation in our example with the following command: curl -XGET localhost:9200/test/test/1?pretty ElasticSearch should return the result similar to the following: { "_index" : "test", "_type" : "test", "_id" : "1", "_version" : 2, "exists" : true, "_source" : { "title": "test2" } } If you look at the result, you would see that again, the result was just as we expected and no trick with refresh was required to obtain the newest version of the document.

0
0
2106

article-image-introducing-qlikview-elements

Packt

24 Sep 2013

6 min read

Introducing QlikView elements

Packt

24 Sep 2013

6 min read

(For more resources related to this topic, see here.) People People are the only active element of data visualization, and as such, they are the most important. We briefly describe the roles of several people that participate in our project, but we mainly focus on the person who is going to analyze and visualize the data. After the meeting, we get together with our colleague, Samantha, who is the analyst that supports the sales and executive teams. She currently manages a series of highly personalized Excels that she creates from standard reports generated within the customer invoice and project management system. Her audience ranges from the CEO down to sales managers. She is not a pushover, but she is open to try new techniques, especially given that the sponsor of this project is the CEO of QDataViz, Inc. As a data discovery user, Samantha possesses the following traits: Ownership She has a stake in the project's success or failure. She, along with the company, stands to grow as a result of this project, and most importantly, she is aware of this opportunity. Driven She is focused on grasping what we teach her and is self-motivated to continue learning after the project is fi nished. The cause of her drive is unimportant as long as she remains honest. Honest She understands that data is a passive element that is open to diverse interpretations by different people. She resists basing her arguments on deceptive visualization techniques or data omission. Flexible She does not endanger her job and company results following every technological fad or whimsical idea. However, she realizes that technology does change and that a new approach can foment breakthroughs. Analytical She loves finding anomalies in the data and being the reason that action is taken to improve QDataViz, Inc. As a means to achieve what she loves, she understands how to apply functions and methods to manipulate data. Knowledgeable She is familiar with the company's data, and she understands the indicators needed to analyze its performance. Additionally, she serves as a data source and gives context to analysis. Team player She respects the roles of her colleagues and holds them accountable. In turn, she demands respect and is also obliged to meet her responsibilities. Data Our next meeting involves Samantha and Ivan, our Information Technology (IT) Director. While Ivan explains the data available in the customer invoice and project management system's well-defined databases, Samantha adds that she has vital data in Microsoft Excel that is missing from those databases. One Excel file contains the sales budget and another contains an additional customer grouping; both files are necessary to present information to the CEO. We take advantage of this discussion to highlight the following characteristics that make data easy to analyze. Reliable Ivan is going to document the origin of the tables and fields, which increases Samantha's confidence in the data. He is also going to perform a basic data cleansing and eliminate duplicate records whose only difference is a period, two transposed letters, or an abbreviation. Once the system is operational, Ivan will consider the impact any change in the customer invoice and project management system may have on the data. He will also verify that the data is continually updated while Samantha helps con firm the data's validity. Detailed Ivan will preserve as much detail as possible. If he is unable to handle large volumes of data as a whole, he will segment the detailed data by month and reduce the detail of a year's data in a consistent fashion. Conversely, he is will consider adding detail by prorating payments between the products of paid invoices in order to maintain a consistent level of detail between invoices and payments. Formal An Excel file as a data source is a short-term solution. While Ivan respects its temporary use to allow for a quick, first release of the data visualization project, he takes responsibility to find a more stable medium to long-term solution. In the span of a few months, he will consider modifying the invoice system, investing in additional software, or creating a simple portal to upload Excel files to a database. Flexible Ivan will not prevent progress solely for bureaucratic reasons. Samantha respects that Ivan's goal is to make data more standardized, secure, and recoverable. However, Ivan knows that if he does not move as quickly as business does, he will become irrelevant as Samantha and others create their own black market of company data. Referential Ivan is going to make available manifold perspectives of QDataViz, Inc. He will maintain history, budgets, and forecasts by customers, salespersons, divisions, states, and projects. Additionally, he will support segmenting these dimensions into multiple groups, subgroups, classes, and types. Tools We continue our meeting with Ivan and Samantha, but we now change our focus to what tool we will use to foster great data visualization and analysis. We create the following list of basic features we hope from this tool: Fast and easy implementation We should be able to learn the tool quickly and be able to deliver a first version of our data visualization project within a matter of weeks. In this fashion, we start receiving a return on our investment within a short period of time. Business empowerment Samantha should be able to continue her analysis with little help from us. Also, her audience should be able to easily perform their own lightweight analysis and follow up on the decisions made. Enterprise-ready Ivan should be able to maintain hundreds or thousands of users and data volumes that exceed 100 million rows. He should also be able to restrict access to certain data to certain users. Finally, he needs to have the confidence that the tools will remain available even if a server fails. Based on these expectations, we talk about data discovery tools, which are increasingly becoming part of the architecture of many organizations. Samantha can use these tools for self-service data analysis. In other words, she can create her own data visualizations without having to depend on pre-built graphs or reports. At the same time, Ivan can be reassured that the tool does not interfere with his goal of providing an enterprise solution that offers scalability, security, and high availability. The data discovery tool we are going to use is QlikView, and the following diagram shows the overall architecture we will build and where this article focuses its attention: Summary In this article, we learned about People, data, and tools which are an essential part of creating great data visualization and analysis. Resources for Article: Further resources on this subject: Meet QlikView [Article] Linking Section Access to multiple dimensions [Article] Creating sheet objects and starting new list using Qlikview 11 [Article]

0
0
2101

Packt

29 Dec 2014

11 min read

Creating a Map

Packt

29 Dec 2014

11 min read

In this article by Thomas Newton and Oscar Villarreal, authors of the book Learning D3.js Mapping, we will cover the following topics through a series of experiments: Foundation – creating your basic map Experiment 1 – adjusting the bounding box Experiment 2 – creating choropleths Experiment 3 – adding click events to our visualization (For more resources related to this topic, see here.) Foundation – creating your basic map In this section, we will walk through the basics of creating a standard map. Let's walk through the code to get a step-by-step explanation of how to create this map. The width and height can be anything you want. Depending on where your map will be visualized (cellphones, tablets, or desktops), you might want to consider providing a different width and height: var height = 600; var width = 900; The next variable defines a projection algorithm that allows you to go from a cartographic space (latitude and longitude) to a Cartesian space (x,y)—basically a mapping of latitude and longitude to coordinates. You can think of a projection as a way to map the three-dimensional globe to a flat plane. There are many kinds of projections, but geo.mercator is normally the default value you will use: var projection = d3.geo.mercator(); var mexico = void 0; If you were making a map of the USA, you could use a better projection called albersUsa. This is to better position Alaska and Hawaii. By creating a geo.mercator projection, Alaska would render proportionate to its size, rivaling that of the entire US. The albersUsa projection grabs Alaska, makes it smaller, and puts it at the bottom of the visualization. The following screenshot is of geo.mercator: This following screenshot is of geo.albersUsa: The D3 library currently contains nine built-in projection algorithms. An overview of each one can be viewed at https://github.com/mbostock/d3/wiki/Geo-Projections. Next, we will assign the projection to our geo.path function. This is a special D3 function that will map the JSON-formatted geographic data into SVG paths. The data format that the geo.path function requires is named GeoJSON: var path = d3.geo.path().projection(projection); var svg = d3.select("#map") .append("svg") .attr("width", width) .attr("height", height); Including the dataset The necessary data has been provided for you within the data folder with the filename geo-data.json: d3.json('geo-data.json', function(data) { console.log('mexico', data); We get the data from an AJAX call. After the data has been collected, we want to draw only those parts of the data that we are interested in. In addition, we want to automatically scale the map to fit the defined height and width of our visualization. If you look at the console, you'll see that "mexico" has an objects property. Nested inside the objects property is MEX_adm1. This stands for the administrative areas of Mexico. It is important to understand the geographic data you are using, because other data sources might have different names for the administrative areas property: Notice that the MEX_adm1 property contains a geometries array with 32 elements. Each of these elements represents a state in Mexico. Use this data to draw the D3 visualization. var states = topojson.feature(data, data.objects.MEX_adm1); Here, we pass all of the administrative areas to the topojson.feature function in order to extract and create an array of GeoJSON objects. The preceding states variable now contains the features property. This features array is a list of 32 GeoJSON elements, each representing the geographic boundaries of a state in Mexico. We will set an initial scale and translation to 1 and 0,0 respectively: // Setup the scale and translate projection.scale(1).translate([0, 0]); This algorithm is quite useful. The bounding box is a spherical box that returns a two-dimensional array of min/max coordinates, inclusive of the geographic data passed: var b = path.bounds(states); To quote the D3 documentation: "The bounding box is represented by a two-dimensional array: [[left, bottom], [right, top]], where left is the minimum longitude, bottom is the minimum latitude, right is maximum longitude, and top is the maximum latitude." This is very helpful if you want to programmatically set the scale and translation of the map. In this case, we want the entire country to fit in our height and width, so we determine the bounding box of every state in the country of Mexico. The scale is calculated by taking the longest geographic edge of our bounding box and dividing it by the number of pixels of this edge in the visualization: var s = .95 / Math.max((b[1][0] - b[0][0]) / width, (b[1][1] - b[0][1]) / height); This can be calculated by first computing the scale of the width, then the scale of the height, and, finally, taking the larger of the two. All of the logic is compressed into the single line given earlier. The three steps are explained in the following image: The value 95 adjusts the scale, because we are giving the map a bit of a breather on the edges in order to not have the paths intersect the edges of the SVG container item, basically reducing the scale by 5 percent. Now, we have an accurate scale of our map, given our set width and height. var t = [(width - s * (b[1][0] + b[0][0])) / 2, (height - s * (b[1][1] + b[0][1])) / 2]; When we scale in SVG, it scales all the attributes (even x and y). In order to return the map to the center of the screen, we will use the translate function. The translate function receives an array with two parameters: the amount to translate in x, and the amount to translate in y. We will calculate x by finding the center (topRight – topLeft)/2 and multiplying it by the scale. The result is then subtracted from the width of the SVG element. Our y translation is calculated similarly but using the bottomRight – bottomLeft values divided by 2, multiplied by the scale, then subtracted from the height. Finally, we will reset the projection to use our new scale and translation: projection.scale(s).translate(t); Here, we will create a map variable that will group all of the following SVG elements into a <g> SVG tag. This will allow us to apply styles and better contain all of the proceeding paths' elements: var map = svg.append('g').attr('class', 'boundary'); Finally, we are back to the classic D3 enter, update, and exit pattern. We have our data, the list of Mexico states, and we will join this data to the path SVG element: mexico = map.selectAll('path').data(states.features); //Enter mexico.enter() .append('path') .attr('d', path); The enter section and the corresponding path functions are executed on every data element in the array. As a refresher, each element in the array represents a state in Mexico. The path function has been set up to correctly draw the outline of each state as well as scale and translate it to fit in our SVG container. Congratulations! You have created your first map! Experiment 1 – adjusting the bounding box Now that we have our foundation, let's start with our first experiment. For this experiment, we will manually zoom in to a state of Mexico using what we learned in the previous section. For this experiment, we will modify one line of code: var b = path.bounds(states.features[5]); Here, we are telling the calculation to create a boundary based on the sixth element of the features array instead of every state in the country of Mexico. The boundaries data will now run through the rest of the scaling and translation algorithms to adjust the map to the one shown in the following screenshot: We have basically reduced the min/max of the boundary box to include the geographic coordinates for one state in Mexico (see the next screenshot), and D3 has scaled and translated this information for us automatically: This can be very useful in situations where you might not have the data that you need in isolation from the surrounding areas. Hence, you can always zoom in to your geography of interest and isolate it from the rest. Experiment 2 – creating choropleths One of the most common uses of D3.js maps is to make choropleths. This visualization gives you the ability to discern between regions, giving them a different color. Normally, this color is associated with some other value, for instance, levels of influenza or a company's sales. Choropleths are very easy to make in D3.js. In this experiment, we will create a quick choropleth based on the index value of the state in the array of all the states. We will only need to modify two lines of code in the update section of our D3 code. Right after the enter section, add the following two lines: //Update var color = d3.scale.linear().domain([0,33]).range(['red', 'yellow']); mexico.attr('fill', function(d,i) {return color(i)}); The color variable uses another valuable D3 function named scale. Scales are extremely powerful when creating visualizations in D3; much more detail on scales can be found at https://github.com/mbostock/d3/wiki/Scales. For now, let's describe what this scale defines. Here, we created a new function called color. This color function looks for any number between 0 and 33 in an input domain. D3 linearly maps these input values to a color between red and yellow in the output range. D3 has included the capability to automatically map colors in a linear range to a gradient. This means that executing the new function, color, with 0 will return the color red, color(15) will return an orange color, and color(33) will return yellow. Now, in the update section, we will set the fill property of the path to the new color function. This will provide a linear scale of colors and use the index value i to determine what color should be returned. If the color was determined by a different value of the datum, for instance, d.sales, then you would have a choropleth where the colors actually represent sales. The preceding code should render something as follows: Experiment 3 – adding click events to our visualization We've seen how to make a map and set different colors to the different regions of this map. Next, we will add a little bit of interactivity. This will illustrate a simple reference to bind click events to maps. First, we need a quick reference to each state in the country. To accomplish this, we will create a new function called geoID right below the mexico variable: var height = 600; var width = 900; var projection = d3.geo.mercator(); var mexico = void 0; var geoID = function(d) { return "c" + d.properties.ID_1; }; This function takes in a state data element and generates a new selectable ID based on the ID_1 property found in the data. The ID_1 property contains a unique numeric value for every state in the array. If we insert this as an id attribute into the DOM, then we would create a quick and easy way to select each state in the country. The following is the geoID function, creating another function called click: var click = function(d) { mexico.attr('fill-opacity', 0.2); // Another update! d3.select('#' + geoID(d)).attr('fill-opacity', 1); }; This method makes it easy to separate what the click is doing. The click method receives the datum and changes the fill opacity value of all the states to 0.2. This is done so that when you click on one state and then on the other, the previous state does not maintain the clicked style. Notice that the function call is iterating through all the elements of the DOM, using the D3 update pattern. After making all the states transparent, we will set a fill-opacity of 1 for the given clicked item. This removes all the transparent styling from the selected state. Notice that we are reusing the geoID function that we created earlier to quickly find the state element in the DOM. Next, let's update the enter method to bind our new click method to every new DOM element that enter appends: //Enter mexico.enter() .append('path') .attr('d', path) .attr('id', geoID) .on("click", click); We also added an attribute called id; this inserts the results of the geoID function into the id attribute. Again, this makes it very easy to find the clicked state. The code should produce a map as follows. Check it out and make sure that you click on any of the states. You will see its color turn a little brighter than the surrounding states. Summary You learned how to build many different kinds of maps that cover different kinds of needs. Choropleths and data visualizations on maps are some of the most common geographic-based data representations that you will come across. Resources for Article: Further resources on this subject: Using Canvas and D3 [article] Interacting with your Visualization [article] Simple graphs with d3.js [article]

0
0
2100

article-image-making-most-your-hadoop-data-lake-part-2-optimized-file-formats

Kristen Hardwick

30 Jun 2014

5 min read

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

Kristen Hardwick

30 Jun 2014

5 min read

One major factor of making the conversion to Hadoop is the concept of the Data Lake. That idea suggests that users keep as much data as possible in HDFS in order to prepare for future use cases and as-yet-unknown data integration points. As your data grows, it is important to make sure that the data is being stored in a way that prolongs that behavior. Data compression is not the only technique that can be used to speed up job performance and improve cluster organization. In addition to the Text and Sequence File options that are typically used by default, Hadoop offers a few more optimized file formats that are specifically designed to improve the process of interacting with the data. In the second part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address another important factor in improving manageability—optimized file formats. Using a smarter file for your data: RCFile RCFile stands for Record Columnar File, and it serves as an ideal format for storing relational data that will be accessed through Hive. This format offers performance improvements by storing the data in an optimized way. First, the data is partitioned horizontally, into groups of rows. Then each row group is partitioned vertically, into collections of columns. Finally, the data in each column collection is compressed and stored in column-row format, as if it were a column-oriented database. The first benefit of this altered storage mechanism is apparent at the row level. All HDFS blocks used to form RCFiles will be made up of the horizontally partitioned collections of rows. This is significant because it ensures that no row of data will be split across multiple blocks, and will therefore always be on the same machine. This is not the case for traditional HDFS file formats, which typically use data size to split the file. This optimized data storage will reduce the amount of network bandwidth that is required to serve queries. The second benefit comes from optimizations at the column level, in the form of disk I/O reduction. Since the columns are stored vertically within each row group, the system will be able to seek directly to the required column position in the file, rather than being required to scan across all columns and filter out data that is not necessary. This is extremely useful in queries that only require access to a small subset of the existing columns. RCFiles can be used natively in both Hive and Pig with very little configuration. In Hive CREATE TABLE … STORED AS RCFILE; ALTER TABLE … SET FILEFORMAT RCFILE; SET hive.default.fileformat=RCFile; In Pig: register …/piggybank.jar; a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.hiverc.HiveRCInputFormat(…); The Pig jar file referenced here is just one option for enabling the RCFile. At the time of writing, there was also an RCFilePigStorageclass available through Twitter’s Elephant Bird open source library. Hortonworks’ ORCFile and Cloudera’s Parquet formats RCFiles provide optimization for relational files primarily by implementing modifications at the storage level. New innovations have provided improvements on the RCFile format, namely the ORCFile format from Hortonworks and the Parquet format from Cloudera. When storing data using the Optimized Row Columnar file or Parquet formats, several pieces of metadata are automatically written at the column level within each row group; for example, minimum and maximum values for numeric data types and dictionary-style metadata for text data types. The specific metadata is also configurable. One such use case would be for a user to configure the dataset to be sorted on a given set of columns for efficient access. This excess metadata allows for queries to take advantage of an improvement on the original RCFiles–predicate pushdown. That technique allows Hive to evaluate the where clause during the record gathering process, instead of filtering data after all records have been collected. The predicate pushdown technique will evaluate the conditions of the query against the metadata associated with a particular row group, allowing it to skip over entire file blocks if possible, or to seek directly to the correct row. One major benefit of this process is that the more complex a particular where clause is, the more potential there is for row groups and columns to be filtered as irrelevant to the final result. Cloudera’s Parquet format is typically used in conjunction with Impala, but just like with RCFiles, ORCFiles can be incorporated into both Hive and Pig. HCatalog can be used as the primary method to read and write ORCFiles using Pig. The commands for Hive are provided below: In Hive: CREATE TABLE … STORED AS ORC; ALTER TABLE … SET FILEFORMAT ORC SET hive.default.fileformat=Orc Conclusion This post has detailed the alternatives to the default file formats that can be used in Hadoop in order to optimize data access and storage. This information combined with the compression techniques described in the previous post (part 1) will provide some guidelines that can be used to ensure that users can make the most of the Hadoop Data Lake. About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.

0
0
2089

article-image-article-optimizing-programs

Packt

29 May 2013

6 min read

Optimizing Programs

Packt

29 May 2013

6 min read

(For more resources related to this topic, see here.) Using transaction SAT to find problem areas In this recipe, we will see the steps required to analyze the execution of any report, transaction, or function module using the transaction SAT. Getting ready For this recipe, we will analyze the runtime of a standard program RIBELF00 (Display Document Flow Program). The program selection screen contains a number of fields. We will execute the program on the order number (aufnr) and see the behavior. How to do it... For carrying out runtime analysis using transaction SAT, proceed as follows: Call transaction SAT. The screen appears as shown: Enter a suitable name for the variant (in our case, YPERF_VARIANT) and click the Create button below it. This will take you to the Variant creation screen. On the Duration and Type tab, switch on Aggregation by choosing the Per Call Position radio-button. Then, click on the Statements tab. On the Statements tab, make sure Internal Tables, the Read Operations checkbox and the Change Operations checkbox, and the Open SQL checkbox under Database Access are checked. Save your variant. Come back to the main screen of SAT. Make sure that within Data Formatting on the initial screen of SAT, the checkbox for Determine Names of Internal Tables is selected. Next, enter the name of the program that is to be traced in the field provided (in our case, it is RIBELF00). Then click the button. The screen of the program appears as shown. We will enter an order number range and execute the program. Once the program output is generated, click on the Back key to come back to program selection screen. Click on the Back key once again to generate the evaluation results. How it works... We carried out the execution of the program through the transaction SAT and the evaluation results were generated. On the left are the Trace Results (in tree form) listing the statements/ events with the most runtime. These are like a summary report of the entire measurement of the program. They are listed in descending order of the Net time in microseconds and the percentage of the total time. For example, in our case, the OPEN CURSOR event takes 68 percent of the total runtime of the program. Selecting the Hit List tab will show the top time consumer components of the program. In this example, the access of database tables AFRU and VBAK takes most of the time. Double-clicking any item in the Trace Results window on the left-hand side will display (in the Hit List area on the right-hand pane) details of contained items along with execution time of each item. From the Hit List window, double-clicking a particular item will take us to the relevant line in the program code. For example, when we double-click the Open Cursor VBAK line, it will take us to the corresponding program code. We have carried out analysis with Aggregation switched on. The switching on of Aggregation shows one single entry for a multiple calls of a particular line of code. Because of this, the results are less detailed and easier to read, since the hit list and the call hierarchy in the results are much more simplified. Also within the results, by default, the names of the internal table used are not shown. In order for the internal table names to appear in the evaluation result, the Determine Names checkbox of Internal tables indicator is checked. As a general recommendation, the runtime analysis should be carried out several times for best results. The reason being that the DB-measurement time could be dependent on a variety of factors, such as system load, network performance, and so on. Creation of secondary indexes in database tables Very often, the cause of a long running report is full-scan of a database table specified within the code, mainly because no suitable index exists. In this recipe, we will see the steps required in creating a new secondary index in database table for performance improvement. Creating indexes lets you optimize standard reports as well as your own reports. In this recipe, we will create a secondary index on a test table ZST9_VBAK (that is simply a copy of VBAK). How to do it... For creating a secondary index, proceed as follows: Call transaction SE11. Enter the name of the table in the field provided, in our case, ZST9_VBAK. Then click the Display button. This will take you to the Display Table screen. Next, choose the menu path Goto | Indexes. This will display all indexes that currently exist for the table. Click the Create button and then choose the option Create Extension Index The dialog box appears. Enter a three-digit name for the index. Then, press Enter. This will take you to the extension index maintenance screen. On the top part, enter the short description in the Short Description field provided. We will create a non-unique index so the Non-unique index radio button is selected (on the middle part of the screen). On the lower part of the screen, specify the field names to be used in the index. In our case, we use MANDT and AUFNR . Then, activate your index using keys Ctrl + F3. The index will be created in the database with appropriate message of creation shown below Status. How it works... This will create the index on the database. Since we created an extension index, the index will not be overwritten by SAP during an upgrade. Now any report that accesses ZST9_VBAK table specifying MANDT and AUFNR in the WHERE clause, will take advantage of index scan using our new secondary index. There's more... It is recommended by SAP that the index be first created in development system and then transport to quality, and to the production system. Secondary indexes are not automatically generated on target systems after being transported. We should check the status on the Activation Log in the target systems, and use the Database Utility to manually activate the index in question. A secondary index, preferably, must have fields that are not common (or as much as uncommon as possible) with other indexes. Too many redundant secondary indexes (that is, too many common fields across several indexes) on a table has a negative impact on performance. For instance, a table with 10 secondary indexes is sharing more than three fields. In addition, tables that are rarely modified (and very often read) are the ideal candidates for secondary indexes. See also http://help.sap.com/saphelp_erp2005/helpdata/EN/85/685a41cdbf80 47e10000000a1550b0/content.htm http://help.sap.com/saphelp_nw04/helpdata/en/cf/21eb2d446011d1 89700000e8322d00/frameset.htmhttp://docs.oracle.com/cd/ SELECT clause E17076_02/html/programmer_reference/am_second.html http://forums.sdn.sap.com/thread.jspa?threadID=1469347

0
0
2086

Packt

11 Oct 2013

10 min read

Administrating Solr

Packt

11 Oct 2013

10 min read

0
0
2070

article-image-move-further-numpy-modules

Packt

13 May 2013

7 min read

Move Further with NumPy Modules

Packt

13 May 2013

7 min read

(For more resources related to this topic, see here.) Linear algebra Linear algebra is an important branch of mathematics. The numpy.linalg package contains linear algebra functions. With this module, you can invert matrices, calculate eigenvalues, solve linear equations, and determine determinants, among other things. Time for action – inverting matrices The inverse of a matrix A in linear algebra is the matrix A-1, which when multiplied with the original matrix, is equal to the identity matrix I. This can be written, as A* A-1 = I. The inv function in the numpy.linalg package can do this for us. Let's invert an example matrix. To invert matrices, perform the following steps: We will create the example matrix with the mat. A = np.mat("0 1 2;1 0 3;4 -3 8") print "An", A The A matrix is printed as follows: A [[ 0 1 2] [ 1 0 3] [ 4 -3 8]] Now, we can see the inv function in action, using which we will invert the matrix. inverse = np.linalg.inv(A) print "inverse of An", inverse The inverse matrix is shown as follows: inverse of A [[-4.5 7. -1.5] [-2. 4. -1. ] [ 1.5 -2. 0.5]] If the matrix is singular or not square, a LinAlgError exception is raised. If you want, you can check the result manually. This is left as an exercise for the reader. Let's check what we get when we multiply the original matrix with the result of the inv function: print "Checkn", A * inverse The result is the identity matrix, as expected. Check[[ 1. 0. 0.][ 0. 1. 0.][ 0. 0. 1.]] What just happened? We calculated the inverse of a matrix with the inv function of the numpy.linalg package. We checked, with matrix multiplication, whether this is indeed the inverse matrix. import numpy as npA = np.mat("0 1 2;1 0 3;4 -3 8")print "An", Ainverse = np.linalg.inv(A)print "inverse of An", inverseprint "Checkn", A * inverse Solving linear systems A matrix transforms a vector into another vector in a linear way. This transformation mathematically corresponds to a system of linear equations. The numpy.linalg function, solve, solves systems of linear equations of the form Ax = b; here A is a matrix, b can be 1D or 2D array, and x is an unknown variable. We will see the dot function in action. This function returns the dot product of two floating-point arrays. Time for action – solving a linear system Let's solve an example of a linear system. To solve a linear system, perform the following steps: Let's create the matrices A and b. iA = np.mat("1 -2 1;0 2 -8;-4 5 9")print "An", Ab = np.array([0, 8, -9])print "bn", b The matrices A and b are shown as follows: Solve this linear system by calling the solve function. x = np.linalg.solve(A, b)print "Solution", x The following is the solution of the linear system: Solution [ 29. 16. 3.] Check whether the solution is correct with the dot function. print "Checkn", np.dot(A , x) The result is as expected: Check[[ 0. 8. -9.]] What just happened? We solved a linear system using the solve function from the NumPy linalg module and checked the solution with the dot function. import numpy as npA = np.mat("1 -2 1;0 2 -8;-4 5 9")print "An", Ab = np.array([0, 8, -9])print "bn", bx = np.linalg.solve(A, b)print "Solution", xprint "Checkn", np.dot(A , x) Finding eigenvalues and eigenvectors Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector. Eigenvectors are vectors corresponding to eigenvalues. The eigvals function in the numpy.linalg package calculates eigenvalues. The eig function returns a tuple containing eigenvalues and eigenvectors. Time for action – determining eigenvalues and eigenvectors Let's calculate the eigenvalues of a matrix. Perform the following steps to do so: Create a matrix as follows: A = np.mat("3 -2;1 0")print "An", A The matrix we created looks like the following: A[[ 3 -2][ 1 0]] Calculate eigenvalues by calling the eig function. print "Eigenvalues", np.linalg.eigvals(A) The eigenvalues of the matrix are as follows: Eigenvalues [ 2. 1.] Determine eigenvalues and eigenvectors with the eig function. This function returns a tuple, where the first element contains eigenvalues and the second element contains corresponding Eigenvectors, arranged column-wise. eigenvalues, eigenvectors = np.linalg.eig(A)print "First tuple of eig", eigenvaluesprint "Second tuple of eign", eigenvectors The eigenvalues and eigenvectors will be shown as follows: First tuple of eig [ 2. 1.]Second tuple of eig[[ 0.89442719 0.70710678][ 0.4472136 0.70710678]] Check the result with the dot function by calculating the right- and left-hand sides of the eigenvalues equation Ax = ax. for i in range(len(eigenvalues)):print "Left", np.dot(A, eigenvectors[:,i])print "Right", eigenvalues[i] * eigenvectors[:,i]print The output is as follows: Left [[ 1.78885438][ 0.89442719]]Right [[ 1.78885438][ 0.89442719]]Left [[ 0.70710678][ 0.70710678]]Right [[ 0.70710678][ 0.70710678]] What just happened? We found the eigenvalues and eigenvectors of a matrix with the eigvals and eig functions of the numpy.linalg module. We checked the result using the dot function . import numpy as npA = np.mat("3 -2;1 0")print "An", Aprint "Eigenvalues", np.linalg.eigvals(A)eigenvalues, eigenvectors = np.linalg.eig(A)print "First tuple of eig", eigenvaluesprint "Second tuple of eign", eigenvectorsfor i in range(len(eigenvalues)):print "Left", np.dot(A, eigenvectors[:,i])print "Right", eigenvalues[i] * eigenvectors[:,i]print Singular value decomposition Singular value decomposition is a type of factorization that decomposes a matrix into a product of three matrices. The singular value decomposition is a generalization of the previously discussed eigenvalue decomposition. The svd function in the numpy.linalg package can perform this decomposition. This function returns three matrices – U, Sigma, and V – such that U and V are orthogonal and Sigma contains the singular values of the input matrix. The asterisk denotes the Hermitian conjugate or the conjugate transpose. Time for action – decomposing a matrix It's time to decompose a matrix with the singular value decomposition. In order to decompose a matrix, perform the following steps: First, create a matrix as follows: A = np.mat("4 11 14;8 7 -2")print "An", A The matrix we created looks like the following: A[[ 4 11 14][ 8 7 -2]] Decompose the matrix with the svd function. U, Sigma, V = np.linalg.svd(A, full_matrices=False)print "U"print Uprint "Sigma"print Sigmaprint "V"print V The result is a tuple containing the two orthogonal matrices U and V on the left- and right-hand sides and the singular values of the middle matrix. [-0.31622777 0.9486833 ]]Sigma[ 18.97366596 9.48683298]V[[-0.33333333 -0.66666667 -0.66666667][ 0.66666667 0.33333333 -0.66666667]]U[[-0.9486833 -0.31622777] We do not actually have the middle matrix—we only have the diagonal values. The other values are all 0. We can form the middle matrix with the diag function. Multiply the three matrices. This is shown, as follows: print "Productn", U * np.diag(Sigma) * V The product of the three matrices looks like the following: Product[[ 4. 11. 14.][ 8. 7. -2.]] What just happened? We decomposed a matrix and checked the result by matrix multiplication. We used the svd function from the NumPy linalg module. import numpy as npA = np.mat("4 11 14;8 7 -2")print "An", AU, Sigma, V = np.linalg.svd(A, full_matrices=False)print "U"print Uprint "Sigma"print Sigmaprint "V"print Vprint "Productn", U * np.diag(Sigma) * V Pseudoinverse The Moore-Penrose pseudoinverse of a matrix can be computed with the pinv function of the numpy.linalg module (visit http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse). The pseudoinverse is calculated using the singular value decomposition. The inv function only accepts square matrices; the pinv function does not have this restriction.

0
0
2067

Packt

25 Aug 2014

13 min read

Report Data Filtering

Packt

25 Aug 2014

13 min read

0
0
2066

Packt

16 Sep 2013

12 min read

Report Authoring

Packt

16 Sep 2013

12 min read

0
0
2060

How-To Tutorials - Data

DPM Non-aware Windows Workload Protection

Out-of-process distributed caching

CQL for client applications

How to make efficient data-driven decisions

Clustering Model with Spark

Explaining Data Exploration in under a minute

Low-Level Index Control

Introducing QlikView elements

Creating a Map

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

Trending Topics

Optimizing Programs

Administrating Solr

Move Further with NumPy Modules

Report Data Filtering

Report Authoring

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access