Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-backup-and-restore-improvements
Packt
25 Apr 2014
11 min read
Save for later

Backup and Restore Improvements

Packt
25 Apr 2014
11 min read
(For more resources related to this topic, see here.) Database backups to a URL and Microsoft Azure Storage The ability to backup to a URL was introduced in SQL Server 2012 Service Pack 1 cumulative update package 2. Prior to this, if you wanted to backup to a URL in SQL Server 2012, you needed to use Transact-SQL or PowerShell. SQL Server 2014 has integrated this option into Management Studio too. The reason for allowing backups to a URL is to allow you to integrate your SQL Server backups with cloud-based storage and store your backups in Microsoft Azure. By being able to create a backup there, you can keep database backups of your on-premise database in Microsoft Azure. This makes your backups safer and protected in the event that your main site is lost to a disaster as your backups are stored offsite. This can avoid the need for an actual disaster recovery site. In order to create a backup to Microsoft Azure Storage, you need a storage account and a storage container. From a SQL Server perspective, you will require a URL, which will specify a Uniform Resource Identifier (URI) to a unique backup file in Microsoft Cloud. It is the URL that provides the location for the backup and the backup filename. The URL will need to point to a blob, not just a container. If it does not exist, then it is created. However, if a backup file exists, then the backup will fail. This is unless the WITH FORMAT command is specified, which like in older versions of SQL Server allows the backup to overwrite the existing backup with the new one that you wish to create. You will also need to create a SQL Server credential to allow the SQL Server to authenticate with Microsoft Azure Storage. This credential will store the name of the storage account and also the access key. The WITH CREDENTIAL statement must be used when issuing the backup or restore commands. There are some limitations you need to consider when backing up your database to a URL and making use of Microsoft Azure Storage to store your database backups: Maximum backup size of 1 TB (Terabyte). Cannot be combined with backup devices. Cannot append to existing backups—in SQL Server, you can have more than one backup stored in a file. When taking a backup to a URL, the ratio should be of one backup to one file. You cannot backup to multiple blobs. In a normal SQL Server backup, you can stripe it across multiple files. You cannot do this with a backup to a URL on Microsoft Azure. There are some limitations you need to consider when backing up to the Microsoft Azure Storage; you can find more information on this at http://msdn.microsoft.com/en-us/library/dn435916(v=sql.120).aspx#backuptaskssms. For the purposes of this exercise, I have created a new container on my Microsoft Azure Storage account called sqlbackup. With the storage account container, you will now take the backup to a URL. As part of this process, you will create a credential using your Microsoft Azure publishing profile. This is slightly different to the process we just discussed, but you can download this profile from Microsoft Azure. Once you have your publishing profile, you can follow the steps explained in the following section. Backing up a SQL Server database to a URL You can use Management Studio's backup task to initiate the backup. In order to do this, you need to start Management Studio and connect to your local SQL Server instance. You will notice that I have a database called T3, and it is this database that I will be backing up to the URL as follows: Right-click on the database you want to back up and navigate to Tasks | Backup. This will start the backup task wizard for you. On the General page, you should change the backup destination from Disk to URL. Making this change will enable all the other options needed for taking a backup to a URL. You will need to provide a filename for your backup, then create the SQL Server credential you want to use to authenticate on the Windows Azure Storage container. Click on the Create Credential button to open the Create credential dialog box. There is an option to use your publishing profile, so click on the Browse button and select the publishing profile that you downloaded from the Microsoft Azure web portal. Once you have selected your publishing profile, it will prepopulate the credential name, management certificate, and subscription ID fields for you. Choose the appropriate Storage Account for your backups. Following this, you should then click on Create to create the credential. You will need to specify the Windows Azure Storage container to use for the backup. In this case, I entered sqlbackup. When you have finished, your General page should look like what is shown in the following screenshot: Following this, click on OK and the backup should run. If you want to use Transact-SQL, instead of Management Studio, to take the backup, the code would look like this: BACKUP DATABASE [T3] TO URL = N'https://gresqlstorage.blob.core.windows.net/sqlbackup/t3.bak' WITH CREDENTIAL = N'AzureCredential' , NOFORMAT, NOINIT, NAME = N'T3-Full Database Backup', NOSKIP, NOREWIND, NOUNLOAD, STATS = 10 GO This is a normal backup database statement, as it has always been, but it specifies a URL and a credential to use to take the backup as well. Restoring a backup stored on Windows Azure Storage In this section, you will learn how to restore a database using the backup you have stored on Windows Azure Storage: To carry out the restore, connect to your local instance of SQL Server in Management Studio, right-click on the databases folder, and choose the Restore database option. This will open the database restore pages. In the Source section of the General page, select the Device option, click on the dropdown and change the backup media type to URL, and click on Add. In the next screen, you have to specify the Windows Azure Storage account connection information. You will need to choose the storage account to connect to and specify an access key to allow SQL Server to connect to Microsoft Azure. You can get this from the Storage section of the Microsoft Azure portal. After this, you will need to specify a credential to use. In this case, I will use the credential that was created when I took the backup earlier. Click on Connect to connect to Microsoft Azure. You will then need to chose the backup to restore from. In this case, I'll use the backup of the T3 database that was created in the preceding section. You can then complete the restore options as you would do with a local backup. In this case, the database has been called T3_cloud, mainly for reference so that it can be easily identified. If you want to restore the existing database, you need to use the WITH REPLACE command in the restore statement. The restore statement would look like this: RESTORE DATABASE t3 FROM URL =' https://gresqlstorage.blob.core.windows.net/sqlbackup/t3.bak ' WITH CREDENTIAL = ' N'AzureCredential' ' ,REPLACE ,STATS = 5 When the restore has been completed, you will have a new copy of the database on the local SQL Server instance. SQL Server Managed Backup to Microsoft Azure Building on the ability to take a backup of a SQL Server database to a URL and Microsoft Azure Storage, you can now set up Managed Backups of your SQL Server databases to Microsoft Azure. It allows you to automate your database backups to the Microsoft Azure Storage. All database administrators appreciate automation, as it frees their time to focus on other projects. So, this feature will be useful to you. It's fully customizable, and you can build your backup strategy around the transaction workload of your database and set a retention policy. Configuring SQL Server-managed backups to Microsoft Azure In order to set up and configure Managed Backups in SQL Server 2014, a new stored procedure has been introduced to configure Managed Backups on a specific database. The stored procedure is called smart_admin.sp_set_db_backup. The syntax for the stored procedure is as follows: EXEC smart_admin.sp_set_db_backup [@database_name = ] 'database name' ,[@enable_backup = ] { 0 | 1} ,[@storage_url = ] 'storage url' ,[@retention_days = ] 'retention_period_in_days' ,[@credential_name = ] 'sql_credential_name' ,[@encryption_algorithm] 'name of the encryption algorithm' ,[@encryptor_type] {'CERTIFICATE' | 'ASYMMETRIC_KEY'} ,[@encryptor_name] 'name of the certificate or asymmetric key' This stored procedure will be used to set up Managed Backups on the T3 database. The SQL Server Agent will need to be running for this to work. In my case, I executed the following code to enable Managed Backups on my T3 database: Use msdb; GO EXEC smart_admin.sp_set_db_backup @database_name='T3' ,@enable_backup=1 ,@storage_url = 'https://gresqlstorage.blob.core.windows.net/' ,@retention_days=5 ,@credential_name='AzureCredential' ,@encryption_algorithm =NO_ENCRYPTION To view the Managed Backup information, you can run the following query: Use msdb GO SELECT * FROM smart_admin.fn_backup_db_config('T3') The results should look like this: To disable the Managed Backup, you can use the smart_admin.sp_set_db_backup procedure to disable it: Use msdb; GO EXEC smart_admin.sp_set_db_backup @database_name='T3' ,@enable_backup=0 Encryption For the first time in SQL Server, you can encrypt your backups using the native SQL Server backup tool. In SQL Server 2014, the backup tool supports several encryption algorithms, including AES 128, AES 192, AES 256, and Triple DES. You will need a certificate or an asymmetric key when taking encrypted backups. Obviously, there are a number of benefits to encrypting your SQL Server database backups, including securing the data in the database. This can also be very useful if you are using transparent data encryption (TDE) to protect your database's data files. Encryption is also supported using SQL Server Managed Backup to Microsoft Azure. Creating an encrypted backup To create an encrypted SQL Server backup, there are a few prerequisites that you need to ensure are set up on the SQL Server. Creating a database master key for the master database Creating the database master key is important because it is used to protect the private key certificate and the asymmetric keys that are stored in the master database, which will be used to encrypt the SQL Server backup. The following Transact-SQL will create a database master key for the master database: USE master; GO CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'P@$$W0rd'; GO In this example, a simple password has been used. In a production environment, it would be advisable to create a master key with a more secure password. Creating a certificate or asymmetric key The backup encryption process will need to make use of a certificate or asymmetric key to be able to take the backup. The following code creates a certificate that can be used to back up your databases using encryption: Use Master GO CREATE CERTIFICATE T3DBBackupCertificate WITH SUBJECT = 'T3 Backup Encryption Certificate'; GO Now you can take an encrypted backup of the database. Creating an encrypted database backup You can now take an encrypted backup of your databases. The following Transact-SQL statements back up the T3 database using the certificate you created in the preceding section: BACKUP DATABASE t3 TO DISK = N'C:Backupt3_enc.bak' WITH COMPRESSION, ENCRYPTION ( ALGORITHM = AES_256, SERVER CERTIFICATE = T3DBBackupCertificate ), STATS = 10 GO This is a local backup; it's located in the C:backup folder, and the encryption algorithm used is AES_256. Summary This article has shown some of the new backup features of SQL Server 2014. The ability to backup to Microsoft Azure Storage means that you can implement a robust backup and restore strategy at a relatively lower cost. Resources for Article: Further resources on this subject: SQL Server 2008 R2: Multiserver Management Using Utility Explorer [Article] Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Manage SQL Azure Databases with the Web Interface 'Houston' [Article]
Read more
  • 0
  • 0
  • 1943

article-image-oracle-business-intelligence-getting-business-information-data
Packt
13 Oct 2010
11 min read
Save for later

Oracle Business Intelligence : getting business information from data

Packt
13 Oct 2010
11 min read
Most businesses today use Business Intelligence (BI), the process of obtaining business information from available data, to control their affairs. If you're new to Business Intelligence, then this definition may leave you with the following questions: What is data? What is the information obtained from it? What is the difference between data and the information obtained from it? You may be confused even more if you learn that data represents groups of information related to an object or a set of objects. Depending on your needs, though, such groups of information may or may not be immediately useful, and often require additional processing such as filtering, formatting, and/or calculating to take on a meaning. For example, information about your customers may be organized in a way that is stored in several database tables related to each other. For security purposes, some pieces of information stored in this way may be encoded, or just represented in binary, and therefore not immediately readable. It's fairly obvious that some processing must be applied before you can make use of such information. So, data can be thought of as the lowest level of abstraction from which meaningful information is derived. But what is information anyway? Well, a piece of information normally represents an answer to a certain question. For example, you want to know how many new customers have registered on your site this year. An answer to this question can be obtained with a certain query issued against the table containing customer registration dates, giving you the information you asked for. Data, information, and Business Intelligence Although the terms data and information refer to similar things, they aren't really interchangeable as there is some difference in their meaning and spirit. Talking about data, as a rule, involves its structure, format, storage, as well as ways in which you can access and manipulate it. In contrast, when talking about information, you mean food for your decision-making process. So, data can be viewed as low-level information structures, where the internal representation matters. Therefore, the ways in which you can extract useful information from data entirely depend on the structure and storage of that data. The following diagram gives a conceptual view of delivering information from different data sets: As you can see from the figure, information can be derived from different data sources, and by different means. Once it's derived, though, it doesn't matter where it has come from, letting its consumers concentrate on the business aspects rather than on the specifics of the internal structure. For example, you might derive some pieces of data from the Web, using the Oracle Database's XQuery feature, and then process it as native database data. To produce meaningful information from your data, you will most likely need to perform several processing steps, load new data, and summarize the data. This is why the Business Intelligence layer usually sits on top of many data sources, consolidating information from various business systems and heterogeneous platforms. The following figure gives a graphical depiction of a Business Intelligence system. In particular, it shows you that the Business Intelligence layer consumes information derived from various sources and heterogeneous platforms. It is intuitively clear that the ability to solve problems is greatly enhanced if you can effectively handle all the information you're getting. On the other hand, extracting information from data coming in from different sources may become a nightmare if you try to do it on your own, with only the help of miscellaneous tools. Business Intelligence comes to the rescue here, ensuring that the extraction, transformation, and consolidation of data from disparate sources becomes totally transparent to you. For example, when using a Business Intelligence application for reporting, you may never figure out exactly what happens behind the scenes when you instruct the system to prepare another report. The information you need for such a report may be collected from many different sources, hiding the complexities associated with handling heterogeneous data. But, without Business Intelligence, that would be a whole different story, of course. Imagine for a moment that you have to issue several queries against different systems, using different tools, and you then have to consolidate the results somehow—all just to answer a single business question such as: what are the top three customers for the preceding quarter? As you have no doubt realized, the software at the Business Intelligence layer is used to provide a business-centric view of data, eliminating as much of the technology-specific logic as possible. What this means in practice is that information consumers working at the Business Intelligence layer may not even know that, say, customer records are stored in a Lightweight Directory Access Protocol (LDAP) database, but purchase orders are kept in a relational database. The kind of business questions you may need to answer As you just learned, Business Intelligence is here to consolidate information from disparate sources so that you need not concern yourself with it. Okay, but why might you need to gather and process heterogeneous data? The answer is clear. You might need it in order to answer analytical questions that allow you to understand and run your business better. In the following two sections, you'll look at some common questions that Business Intelligence can help you answer. Then, you'll see how you can ask those questions with the help of Business Intelligence tools. Answering basic business questions The set of questions you may need your Business Intelligence system to answer will vary depending on your business and, of course, your corresponding functions. However, to give you a taste of what Business Intelligence can do for you, let's firrst look at some questions that are commonly brought up by business users: What is the average salary throughout the entire organization? Which customers produce the most revenue? What is the amount of revenue each salesman brought in over the preceding quarter? What is the profitability of each product? If you run your business online, you may be also interested in hit counting and traffic analysis questions, such as the following: How much traffic does a certain account generate over a month? What pages in your site are most visited? What are the profits made online? Looking at the business analysis requests presented here, a set of questions related to your own business may flash into your mind. Answering probing analytical questions In the preceding section, you looked at some common questions a business analyst is usually interested in asking. But bowing to the reality, you may have to answer more probing questions in your decision-making process, in order to determine changes in the business and find ways to improve it. Here are some probing analytical questions you might need to find answers to: How do sales for this quarter compare to sales for the preceding quarter? What factors impact our sales? Which products are sold better together? What are ten top-selling products in this region? What are the factors influencing the likelihood of purchase? As you can see, each of these questions reflects a certain business problem. Looking through the previous list, though, you might notice that some of the questions shown here can be hard to formulate with the tools available in a computer application environment. There's nothing to be done here; computers like specific questions. Unlike humans, machines can give you exactly what you ask for, not what you actually mean. So, even an advanced Business Intelligence application will require you to be as specific as possible when it comes to putting a question to it. It's fairly clear that the question about finding the factors impacting sales needs to be rephrased to become understandable for a Business Intelligence application. How you would rephrase it depends on the specifics of your business, of course. Often, it's good practice to break apart a problem into simpler questions. For example, the first question on the above list—the one about comparing quarter sales—might be logically divided into the following two questions: What are the sales figures for this quarter? What are the sales figures for the last quarter? Once you get these questions answered, you can compare the results, thus answering the original, more generically phrased question. It can also provide one definition or variation for drill down. In the above example, it's fairly obvious what specific questions can be derived from the generic question. There may be probing questions, though, whose derived questions are not so obvious. For example, consider the following question: What motivates a customer to buy? This could perhaps be broken down into the following questions: Where did visitors come from? Which pages did they visit before reaching the product page? Of course, the above list does not seem to be complete—some other questions might be added. Asking business questions using data-access tools As you might guess, although all these questions sound simple when formulated in plain English, they are more difficult to describe when using data-access tools. If you're somewhat familiar with SQL, you might notice that most of the analytical questions discussed here cannot be easily expressed with the help of SQL statements, even if the underlying data is relational. For example, the problem of finding the top three salespersons for a year may require you to write a multi-line SQL request including several sub-queries. Here is what such a query might look like: SELECT emp.ename salesperson, top_emp_orders.sales sales FROM (SELECT all_orders.sales_empno empno, all_orders.total_sales FROM (SELECT sales_empno, SUM(ord_total) total_sales, RANK() OVER (ORDER BY SUM(ord_total) DESC) sal_rank FROM orders WHERE EXTRACT(YEAR FROM ord_dt) = 2009 GROUP BY sales_empno )all_orders WHERE all_orders.sal_rank<=3 )top_emp_orders, employees emp WHERE top_emp_orders.empno = emp.empno ORDER BY sales DESC; This might produce something like this: If you're not an SQL guru of course, writing the above query and then debugging it could easily take a couple of hours. Determining profitability by customer, for example, might take you another couple of hours to write a proper SQL query. In other words, business questions are often somewhat tricky (if possible at all) to implement with SQL. All this does not mean that SQL is not used in the area of Business Intelligence. Quite the contrary, SQL is still indispensable here. In fact, SQL has a lot to offer when it comes to data analysis. As you just saw, though, composing complex queries assumes solid SQL skills. Thankfully, most Business Intelligence tools use SQL behind the scenes totally transparently to users. Now let's look at a simple example illustrating how you can get an analytical question answered with a Business Intelligence tool—Oracle BI Discoverer Plus in this particular example. Suppose you simply want to calculate the average salary sum over the organization. This example could use the records from the hr.employees demonstration table. Creating a worksheet representing the records of a database table in the Discoverer Plus focuses on issues related to analyzing data, and creating reports with the tools available through the Oracle Business Intelligence suite. For now, look at the following screenshot to see what such a worksheet might look like: As you can see in the previous screenshot, a Discoverer Plus worksheet is similar to one in MS Excel. As in Excel, there are toolbars and menus offering a lot of options for manipulating and analyzing data presented on the worksheet. In addition, Discoverer Plus offers Item Navigator, which enables you to add data to (or remove it from) the worksheet. The data structure you can see in Item Navigator is retrieved from the database. When we return to our example, answering the question: "what is the average salary across the organization?"Similarly, in Excel, it is as simple as selecting the Salary SUM column on the worksheet, choosing an appropriate menu, and setting some parameters in the dialog shown next. After you click the OK button in this dialog box, the calculated average will be added to the worksheet in the position specified. So, the Total dialog shown in the following screenshot provides an efficient means for automating the process of creating a total on a specified data column: As you can see, this approach doesn't require you to write an SQL query on your own. Instead, Discoverer Plus will do it for you implicitly, thus allowing you to concentrate on business issues rather than data access issues. This previous example should have given you a taste of what Business Intelligence can do for you.
Read more
  • 0
  • 0
  • 1938

article-image-interacting-your-visualization
Packt
22 Oct 2013
9 min read
Save for later

Interacting with your Visualization

Packt
22 Oct 2013
9 min read
(For more resources related to this topic, see here.) The ultimate goal of visualization design is to optimize applications so that they help us perform cognitive work more efficiently. Ware C. (2012) The goal of data visualization is to help the audience gain information from a large quantity of raw data quickly and efficiently through metaphor, mental model alignment, and cognitive magnification. So far in this article we have introduced various techniques to leverage D3 library implementing many types of visualization. However, we haven't touched a crucial aspect of visualization: human interaction. Various researches have concluded the unique value of human interaction in information visualization. Visualization combined with computational steering allows faster analyses of more sophisticated scenarios...This case study adequately demonstrate that the interaction of a complex model with steering and interactive visualization can extend the applicability of the modelling beyond research Barrass I. & Leng J (2011) In this article we will focus on D3 human visualization interaction support, or as mentioned earlier learn how to add computational steering capability to your visualization. Interacting with mouse events The mouse is the most common and popular human-computer interaction control found on most desktop and laptop computers. Even today, with multi-touch devices rising to dominance, touch events are typically still emulated into mouse events; therefore making application designed to interact via mouse usable through touches. In this recipe we will learn how to handle standard mouse events in D3. Getting ready Open your local copy of the following file in your web browser: https://github.com/NickQiZhu/d3-cookbook/blob/master/src/chapter10/mouse.html How to do it... In the following code example we will explore techniques of registering and handling mouse events in D3. Although, in this particular example we are only handling click and mousemove, the techniques utilized here can be applied easily to all other standard mouse events supported by modern browsers: <script type="text/javascript"> var r = 400; var svg = d3.select("body") .append("svg"); var positionLabel = svg.append("text") .attr("x", 10) .attr("y", 30); svg.on("mousemove", function () { //<-A printPosition(); }); function printPosition() { //<-B var position = d3.mouse(svg.node()); //<-C positionLabel.text(position); } svg.on("click", function () { //<-D for (var i = 1; i < 5; ++i) { var position = d3.mouse(svg.node()); var circle = svg.append("circle") .attr("cx", position[0]) .attr("cy", position[1]) .attr("r", 0) .style("stroke-width", 5 / (i)) .transition() .delay(Math.pow(i, 2.5) * 50) .duration(2000) .ease('quad-in') .attr("r", r) .style("stroke-opacity", 0) .each("end", function () { d3.select(this).remove(); }); } }); </script> This recipe generates the following interactive visualization: Mouse Interaction How it works... In D3, to register an event listener, we need to invoke the on function on a particular selection. The given event listener will be attached to all selected elements for the specified event (line A). The following code in this recipe attaches a mousemove event listener which displays the current mouse position (line B): svg.on("mousemove", function () { //<-A printPosition(); }); function printPosition() { //<-B var position = d3.mouse(svg.node()); //<-C positionLabel.text(position); } On line C we used d3.mouse function to obtain the current mouse position relative to the given container element. This function returns a two-element array [x, y]. After this we also registered an event listener for mouse click event on line D using the same on function: svg.on("click", function () { //<-D for (var i = 1; i < 5; ++i) { var position = d3.mouse(svg.node()); var circle = svg.append("circle") .attr("cx", position[0]) .attr("cy", position[1]) .attr("r", 0) .style("stroke-width", 5 / (i)) // <-E .transition() .delay(Math.pow(i, 2.5) * 50) // <-F .duration(2000) .ease('quad-in') .attr("r", r) .style("stroke-opacity", 0) .each("end", function () { d3.select(this).remove(); // <-G }); } }); Once again, we retrieved the current mouse position using d3.mouse function and then generated five concentric expanding circles to simulate the ripple effect. The ripple effect was simulated using geometrically increasing delay (line F) with decreasing stroke-width (line E). Finally when the transition effect is over, the circles were removed using transition end listener (line G). There's more... Although, we have only demonstrated listening on the click and mousemove events in this recipe, you can listen on any event that your browser supports through the on function. The following is a list of mouse events that are useful to know when building your interactive visualization: click: Dispatched when user clicks a mouse button dbclick: Dispatched when a mouse button is clicked twice mousedown: Dispatched when a mouse button is pressed mouseenter: Dispatched when mouse is moved onto the boundaries of an element or one of its descendent elements mouseleave: Dispatched when mouse is moved off of the boundaries of an element and all of its descendent elements mousemove: Dispatched when mouse is moved over an element mouseout: Dispatched when mouse is moved off of the boundaries of an element mouseover: Dispatched when mouse is moved onto the boundaries of an element mouseup: Dispatched when a mouse button is released over an element Interacting with a multi-touch device Today, with the proliferation of multi-touch devices, any visualization targeting mass consumption needs to worry about its interactability not only through the traditional pointing device, but through multi-touches and gestures as well. In this recipe we will explore touch support offered by D3 to see how it can be leveraged to generate some pretty interesting interaction with multi-touch capable devices. Getting ready Open your local copy of the following file in your web browser: https://github.com/NickQiZhu/d3-cookbook/blob/master/src/chapter10/touch.html. How to do it... In this recipe we will generate a progress-circle around the user's touch and once the progress is completed then a subsequent ripple effect will be triggered around the circle. However, if the user prematurely ends his/her touch, then we shall stop the progress-circle without generating the ripples: <script type="text/javascript"> var initR = 100, r = 400, thickness = 20; var svg = d3.select("body") .append("svg"); d3.select("body") .on("touchstart", touch) .on("touchend", touch); function touch() { d3.event.preventDefault(); var arc = d3.svg.arc() .outerRadius(initR) .innerRadius(initR - thickness); var g = svg.selectAll("g.touch") .data(d3.touches(svg.node()), function (d) { return d.identifier; }); g.enter() .append("g") .attr("class", "touch") .attr("transform", function (d) { return "translate(" + d[0] + "," + d[1] + ")"; }) .append("path") .attr("class", "arc") .transition().duration(2000) .attrTween("d", function (d) { var interpolate = d3.interpolate( {startAngle: 0, endAngle: 0}, {startAngle: 0, endAngle: 2 * Math.PI} ); return function (t) { return arc(interpolate(t)); }; }) .each("end", function (d) { if (complete(g)) ripples(d); g.remove(); }); g.exit().remove().each(function () { this.__stopped__ = true; }); } function complete(g) { return g.node().__stopped__ != true; } function ripples(position) { for (var i = 1; i < 5; ++i) { var circle = svg.append("circle") .attr("cx", position[0]) .attr("cy", position[1]) .attr("r", initR - (thickness / 2)) .style("stroke-width", thickness / (i)) .transition().delay(Math.pow(i, 2.5) * 50) .duration(2000).ease('quad-in') .attr("r", r) .style("stroke-opacity", 0) .each("end", function () { d3.select(this).remove(); }); } } </script> This recipe generates the following interactive visualization on a touch enabled device: Touch Interaction How it works... Event listener for touch events are registered through D3 selection's on function similar to what we have done with mouse events in the previous recipe: d3.select("body") .on("touchstart", touch) .on("touchend", touch); One crucial difference here is that we have registered our touch event listener on the body element instead of the svg element since with many OS and browsers there are default touch behaviors defined and we would like to override it with our custom implementation. This is done through the following function call: d3.event.preventDefault(); Once the touch event is triggered we retrieve multiple touch point data using the d3.touches function as illustrated by the following code snippet: var g = svg.selectAll("g.touch") .data(d3.touches(svg.node()), function (d) { return d.identifier; }); Instead of returning a two-element array as what d3.mouse function does, d3.touches returns an array of two-element arrays since there could be multiple touch points for each touch event. Each touch position array has data structure that looks like the following: Touch Position Array Other than the [x, y] position of the touch point each position array also carries an identifier to help you differentiate each touch point. We used this identifier here in this recipe to establish object constancy. Once the touch data is bound to the selection the progress circle was generated for each touch around the user's finger: g.enter() .append("g") .attr("class", "touch") .attr("transform", function (d) { return "translate(" + d[0] + "," + d[1] + ")"; }) .append("path") .attr("class", "arc") .transition().duration(2000).ease('linear') .attrTween("d", function (d) { // <-A var interpolate = d3.interpolate( {startAngle: 0, endAngle: 0}, {startAngle: 0, endAngle: 2 * Math.PI} ); return function (t) { return arc(interpolate(t)); }; }) .each("end", function (d) { // <-B if (complete(g)) ripples(d); g.remove(); }); This is done through a standard arc transition with attribute tweening (line A). Once the transition is over if the progress-circle has not yet been canceled by the user then a ripple effect similar to what we have done in the previous recipe was generated on line B. Since we have registered the same event listener touch function on both touchstart and touchend events, we can use the following lines to remove progress-circle and also set a flag to indicate that this progress circle has been stopped prematurely: g.exit().remove().each(function () { this.__stopped__ = true; }); We need to set this stateful flag since there is no way to cancel a transition once it is started; hence, even after removing the progress-circle element from the DOM tree the transition will still complete and trigger line B. There's more... We have demonstrated touch interaction through the touchstart and touchend events; however, you can use the same pattern to handle any other touch events supported by your browser. The following list contains the proposed touch event types recommended by W3C: touchstart: Dispatched when the user places a touch point on the touch surface touchend: Dispatched when the user removes a touch point from the touch surface touchmove: Dispatched when the user moves a touch point along the touch surface touchcancel: Dispatched when a touch point has been disrupted in an implementation-specific manner
Read more
  • 0
  • 0
  • 1936

article-image-predicting-hospital-readmission-expense-using-cascading
Packt
04 Jun 2015
10 min read
Save for later

Predicting Hospital Readmission Expense Using Cascading

Packt
04 Jun 2015
10 min read
In this article by Michael Covert, author of the book Learning Cascading, we will look at a system that allows for health care providers to create complex predictive models that can assess who is most at risk for such readmission using Cascading. (For more resources related to this topic, see here.) Overview Hospital readmission is an event that health care providers are attempting to reduce, and it is the primary target of new regulations of the Affordable Care Act, passed by the US government. A readmission is defined as any reentry to a hospital 30 days or less from a prior discharge. The financial impact of this is that US Medicare and Medicaid will either not pay or will reduce the payment made to hospitals for expenses incurred. By the end of 2014, over 2600 hospitals will incur these losses from a Medicare and Medicaid tab that is thought to exceed $24 billion annually. Hospitals are seeking to find ways to predict when a patient is susceptible to readmission so that actions can be taken to fully treat the patient before discharge. Many of them are using big data and machine learning-based predictive analytics. One such predictive engine is MedPredict from Analytics Inside, a company based in Westerville, Ohio. MedPredict is the predictive modeling component of the MedMiner suite of health care products. These products use Concurrent Cascading products to perform nightly rescoring of inpatients using a highly customizable calculation known as LACE, which stands for the following: Length of stay: This refers to the number of days a patient been in hospital Acute admissions through emergency department: This refers to whether a patient has arrived through the ER Comorbidities: A comorbidity refers to the presence of a two or more individual conditions in a patient. Each condition is designated by a diagnosis code. Diagnosis codes can also indicate complications and severity of a condition. In LACE, certain conditions are associated with the probability of readmission through statistical analysis. For instance, a diagnosis of AIDS, COPD, diabetes, and so on will each increase the probability of readmission. So, each diagnosis code is assigned points, with other points indicating "seriousness" of the condition. Diagnosis codes: These refer to the International Classification of Disease codes. Version 9 (ICD-9) and now version 10 (ICD-10) standards are available as well. Emergency visits: This refers to the number of emergency room visits the patient has made in a particular window of time. The LACE engine looks at a patient's history and computes a score that is a predictor of readmissions. In order to compute the comorbidity score, the Charlson Comorbidity Index (CCI) calculation is used. It is a statistical calculation that factors in the age and complexity of the patient's condition. Using Cascading to control predictive modeling The full data workflow to compute the probability of readmissions is as follows: Read all hospital records and reformat them into patient records, diagnosis records, and discharge records. Read all data related to patient diagnosis and diagnosis records, that is, ICD-9/10, date of diagnosis, complications, and so on. Read all tracked diagnosis records and join them with patient data to produce a diagnosis (comorbidity) score by summing up comorbidity "points". Read all data related to patient admissions, that is, records associated with admission and discharge, length of stay, hospital, admittance location, stay type, and so on. Read patient profile record, that is, age, race, gender, ethnicity, eye color, body mass indicator, and so on. Compute all intermediate scores for age, emergency visits, and comorbidities. Calculate the LACE score (refer to Figure 2). Assign a date and time to it. Take all the patient information, as mentioned in the preceding points, and run it through MedPredict to produce these variety of metrics: Expected length of stay Expected expense Expected outcome Probability of readmission Figure 1 – The data workflow The Cascading LACE engine The calculational aspects of computing LACE scores makes it ideal for Cascading as a series of reusable subassemblies. Firstly, the extraction, transformation, and loading (ETL) of patient data is complex and costly. Secondly, the calculations are data-intensive. The CCI alone has to examine a patient's medical history and must find all matching diagnosis codes (such as ICD-9 or ICD-10) to assign a score. This score must be augmented by the patient's age, and lastly, a patient's inpatient discharge records must be examined for admittance to the ER as well as emergency room visits. Also, many hospitals desire to customize these calculations. The LACE engine supports and facilitates this since scores are adjustable at the diagnosis code level, and MedPredict automatically produces metrics about how significant an individual feature is to the resulting score. Medical data is quite complex too. For instance, the particular diagnosis codes that represent cancer are many, and their meanings are quite nuanced. In some cases, metastasis (spreading of cancer to other locations in the body) may have occurred, and this is treated as a more severe situation. In other situations, measured values may be "bucketed", so this implies that we track the number of emergency room visits over 1 year, 6 months, 90 days, and 30 days. The Cascading LACE engine performs these calculations easily. It is customized through a set of hospital supplied parameters, and it has the capability to perform full calculations nightly due to its usage of Hadoop. Using this capability, a patient's record can track the full history of the LACE index over time. Additionally, different sets of LACE indices can be computed simultaneously, maybe one used for diabetes, the other for Chronic Obstructive Pulmonary Disorder (COPD), and so on. Figure 2 – The LACE subassembly MedPredict tracking The Lace engine metrics feed into MedPredict along with many other variables cited previously. These records are rescored nightly and the patient history is updated. This patient history is then used to analyze trends and generate alerts when the patient is showing an increased likelihood of variance to the desired metric values. What Cascading does for us We chose Cascading to help reduce the complexity of our development efforts. MapReduce provided us with the scalability that we desired, but we found that we were developing massive amounts of code to do so. Reusability was difficult, and the Java code library was becoming large. By shifting to Cascading, we found that we could encapsulate our code better and achieve significantly greater reusability. Additionally, we reduced complexity as well. The Cascading API provides simplification and understandability, which accelerates our development velocity metrics and also reduces bugs and maintenance cycles. We allow Cascading to control the end-to-end workflow of these nightly calculations. It handles preprocessing and formatting of data. Then, it handles running these calculations in parallel, allowing high speed hash joins to be performed, and also for each leg of the calculation to be split into a parallel pipe. Next, all these calculations are merged and the final score is produced. The last step is to analyze the patient trends and generate alerts where potential problems are likely to occur. Cascading has allowed us to produce a reusable assembly that is highly parameterized, thereby allowing hospitals to customize their usage. Not only can thresholds, scores, and bucket sizes be varied, but if it's desired, additional information could be included for things, such as medical procedures performed on the patient. The local mode of Cascading allows for easy testing, and it also provides a scaled down version that can be run against a small number of patients. However, by using Cascading in the Hadoop mode, massive scalability can be achieved against very large patient populations and ICD-9/10 code sets. Concurrent also provides an excellent framework for predictive modeling using machine learning through its Pattern component. MedPredict uses this to integrate its predictive engine, which is written using Cascading, MapReduce, and Mahout. Pattern provides an interface for the integration of other external analysis products through the exchange of Predictive Model Markup Language (PMML), an XML dialect that allows many of the MedPredict proprietary machine learning algorithms to be directly incorporated into the full Cascading LACE workflow. MedPredict then produces a variety of predictive metrics in a single pass of the data. The LACE scores (current and historical trends) are used as features for these predictions. Additionally, Concurrent provides a product called Driven that greatly reduces the development cycle time for such large, complex applications. Their lingual product provides seamless integration with relational databases, which is also key to enterprise integration. Results Numerous studies have now been performed using LACE risk estimates. Many hospitals have shown the ability to reduce readmission rates by 5-10 percent due to early intervention and specific guidance given to a patient as a result of an elevated LACE score. Other studies are examining the efficacy of additional metrics, and of segmentation of the patients into better identifying groups, such as heart failure, cancer, diabetes, and so on. Additional effort is being put in to study the ability of modifying the values of the comorbidity scores, taking into account combinations and complications. In some cases, even more dramatic improvements have taken place using these techniques. For up-to-date information, search for LACE readmissions, which will provide current information about implementations and results. Analytics Inside LLC Analytics Inside is based in Westerville, Ohio. It was founded in 2005 and specializes in advanced analytical solutions and services. Analytics Inside produces the RelMiner family of relationship mining systems. These systems are based on machine learning, big data, graph theories, data visualizations, and Natural Language Processing (NLP). For further information, visit our website at http://www.AnalyticsInside.us, or e-mail us at info@AnalyticsInside.us. MedMiner Advanced Analytics for Health Care is an integrated software system designed to help an organization or patient care team in the following ways: Predicting the outcomes of patient cases and tracking these predictions over time Generating alerts based on patient case trends that will help direct remediation Complying better with ARRA value-based purchasing and meaningful use guidelines Providing management dashboards that can be used to set guidelines and track performance Tracking performance of drug usage, interactions, potentials for drug diversion, and pharmaceutical fraud Extracting medical information contained within text documents Designating data security is a key design point PHI can be hidden through external linkages, so data exchange is not required If PHI is required, it is kept safe through heavy encryption, virus scanning, and data isolation Using both cloud-based and on premise capabilities to meet client needs Concurrent Inc. Concurrent Inc. is the leader in big data application infrastructure, delivering products that help enterprises create, deploy, run, and manage data applications at scale. The company's flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading, the most widely deployed technology for data applications with more than 175,000 user downloads a month. Used by thousands of businesses, including eBay, Etsy, The Climate Corporation, and Twitter, Cascading is the defacto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and can be found online at http://concurrentinc.com. Summary Hospital readmission is an event that health care providers are attempting to reduce, and it is a primary target of new regulation from the Affordable Care Act, passed by the US government. This article describes a system that allows for health care providers to create complex predictive models that can assess who is most at risk for such readmission using Cascading. Resources for Article: Further resources on this subject: Hadoop Monitoring and its aspects [article] Introduction to Hadoop [article] YARN and Hadoop [article]
Read more
  • 0
  • 0
  • 1933

article-image-taming-big-data-using-hdinsight
Packt
22 Jan 2015
10 min read
Save for later

Taming Big Data using HDInsight

Packt
22 Jan 2015
10 min read
(For more resources related to this topic, see here.) Era of Big Data In this article by Rajesh Nadipalli, the author of HDInsight Essentials Second Edition, we will take a look at the concept of Big Data and how to tame it using HDInsight. We live in a digital era and are always connected with friends and family using social media and smartphones. In 2014, every second, about 5,700 tweets were sent and 800 links were shared using Facebook, and the digital universe was about 1.7 MB per minute for every person on earth (source: IDC 2014 report). This amount of data sharing and storing is unprecedented and is contributing to what is known as Big Data. The following infographic shows you the details of our current use of the top social media sites (source: https://leveragenewagemedia.com/). Another contributor to Big Data are the smart, connected devices such as smartphones, appliances, cars, sensors, and pretty much everything that we use today and is connected to the Internet. These devices, which will soon be in trillions, continuously collect data and communicate with each other about their environment to make intelligent decisions and help us live better. This digitization of the world has added to the exponential growth of Big Data. According to the 2014 IDC digital universe report, the growth trend will continue and double in size every two years. In 2013, about 4.4 zettabytes were created and in 2020, the forecast is 44 zettabytes, which is 44 trillion gigabytes, (source: http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm). Business value of Big Data While we generated 4.4 zettabytes of data in 2013, only 5 percent of it was actually analyzed, and this is the real opportunity of Big Data. The IDC report forecasts that by 2020, we will analyze over 35 percent of the generated data by making smarter sensors and devices. This data will drive new consumer and business behavior that will drive trillions of dollars in opportunity for IT vendors and organizations analyzing this data. Let's take a look at some real use cases that have benefited from Big Data: IT systems in all major banks are constantly monitoring fraudulent activities and alerting customers within milliseconds. These systems apply complex business rules and analyze the historical data, geography, type of vendor, and other parameters based on the customer to get accurate results. Commercial drones are transforming agriculture by analyzing real-time aerial images and identifying the problem areas. These drones are cheaper and efficient than satellite imagery, as they fly under the clouds and can be used anytime. They identify the irrigation issues related to water, pests, or fungal infections thereby increasing the crop productivity and quality. These drones are equipped with technology to capture high-quality images every second and transfer them to a cloud-hosted Big Data system for further processing (reference: http://www.technologyreview.com/featuredstory/526491/agricultural-drones/). Developers of the blockbuster Halo 4 game were tasked to analyze player preferences and support an online tournament in the cloud. The game attracted over 4 million players in its first five days after its launch. The development team had to also design a solution that kept track of a leader board for the global Halo 4 Infinity challenge, which was open to all the players. The development team chose the Azure HDInsight service to analyze the massive amounts of unstructured data in a distributed manner. The results from HDInsight were reported using Microsoft SQL Server PowerPivot and Sharepoint and the business was extremely happy with the response times for their queries, which was a few hours or less, (source: http://www.microsoft.com/casestudies/Windows-Azure/343-Industries/343-Industries-Gets-New-User-Insights-from-Big-Data-in-the-Cloud/710000002102) Hadoop Concepts Apache Hadoop is the leading open source Big Data platform that can store and analyze massive amounts of structured and unstructured data efficiently and can be hosted on low-cost commodity hardware. There are other technologies that complement Hadoop under the Big Data umbrella such as MongoDB (a NoSQL database), Cassandra (a document database), and VoltDB (an in-memory database). This section describes Apache Hadoop core concepts and its ecosystem. A brief history of Hadoop Doug Cutting created Hadoop and named it after his kid's stuffed yellow elephant and has no real meaning. In 2004, the initial version of Hadoop was launched as Nutch Distributed Filesystem. In February 2006, the Apache Hadoop project was officially started as a standalone development for MapReduce and HDFS. By 2008, Yahoo adopted Hadoop as the engine of its web search with a cluster size of around 10,000. In the same year, Hadoop graduated as the top-level Apache project confirming its success. In 2012, Hadoop 2.x was launched with YARN enabling Hadoop to take on various types of workloads. Today, Hadoop is known by just about every IT architect and business executive as a open source Big Data platform and is used across all industries and sizes of organizations. Core components In this section, we will explore what Hadoop is actually comprised of. At the basic level, Hadoop consists of 4 layers: Hadoop Common: A set of common libraries and utilities used by Hadoop modules. Hadoop Distributed File System (HDFS): A scalable and fault tolerant distributed filesystem for data in any form. HDFS can be installed on commodity hardware and replicates data three times (which is configurable) to make the filesystem robust and tolerate partial hardware failures. Yet Another Resource Negotiator (YARN): From Hadoop 2.0, YARN is the cluster management layer to handle various workloads on the cluster. MapReduce: MapReduce is a framework that allows parallel processing of data in Hadoop. MapReduce breaks a job into smaller tasks and distributes the load to servers that have the relevant data. The design model is "move code and not data" making this framework efficient as it reduces the network and disk I/O required to move the data. The following diagram shows you the high-level Hadoop 2.0 core components: The preceding diagram shows you the components that form the basic Hadoop framework. In the past few years, a vast array of new components have emerged in the Hadoop ecosystem that take advantage of YARN making Hadoop faster, better, and suitable for various types of workloads. The following diagram shows you the Hadoop framework with these new components: Hadoop cluster layout Each Hadoop cluster has two types of machines, which are as follows: Master nodes: This includes HDFS Name Node, HDFS Secondary Name Node, and YARN Resource Manager. Worker nodes: This includes HDFS Data Nodes and YARN Node Managers. The data nodes and node managers are colocated for optimal data locality and performance. A network switch interconnects the master and worker nodes. It is recommended that you have separate servers for each of the master nodes; however, it is possible to deploy all the master nodes onto a single server for development or testing workloads. The following diagram shows you the typical cluster layout: Let's review the key functions of the master and worker nodes: Name node: This is the master for the distributed filesystem and maintains a metadata. This metadata has the listing of all the files, and the location of each block of a file that are stored across the various slaves. Without a name node, HDFS is not accessible. From Hadoop 2.0 onwards, name node HA (High Availability) can be configured with active and standby servers. Secondary name node: This is an assistant to the name node. It communicates only with the name node to take snapshots of HDFS metadata at intervals that is configured at the cluster level. YARN resource manager: This server is a scheduler that allocates the available resources in the cluster among the competing applications. Worker nodes: The Hadoop cluster will have several worker nodes that handle two types of functions—HDFS Data Node and YARN Node Manager. It is typical that each worker node handles both the functions for optimal data locality. This means processing happens on the data that is local to the node and follows the principle "move code and not data". HDInsight Overview HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on the Azure HDInsight cloud service (PaaS). It is 100 percent Apache Hadoop based service in the cloud. HDInsight was developed with the partnership of Hortonworks, and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and the Windows Azure cloud service. The following are the key differentiators for a HDInsight distribution: Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with the Platform as a Service (PaaS) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can visualize and analyze Hadoop data in compelling new ways with an easy-to-use familiar tool. The Excel add-ons PowerBI, PowerPivot, Power Query, and Power Map integrate with HDInsight. Develop in your favorite language: HDInsight has powerful programming extensions for languages, including .Net, C#, Java, and more. Scale using the cloud offering: The Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure Blob storage. Connect an on-premises Hadoop cluster with the cloud: With HDInsight, you can move Hadoop data from an on-site data center to the Azure cloud for backup, dev/test, and cloud bursting scenarios. Includes NoSQL transactional capabilities: HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of Hadoop and allows large online transactional processing (OLTP). HDInsight Emulator: The HDInsight Emulator tool provides a local development environment for Azure HDInsight without the need for a cloud subscription. This can be installed using Microsoft Web Platform Installer. Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with the Platform as a Service (PaaS) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can visualize and analyze Hadoop data in compelling new ways with an easy-to-use familiar tool. The Excel add-ons PowerBI, PowerPivot, Power Query, and Power Map integrate with HDInsight. Develop in your favorite language: HDInsight has powerful programming extensions for languages, including .Net, C#, Java, and more. Scale using the cloud offering: The Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure Blob storage. Connect an on-premises Hadoop cluster with the cloud: With HDInsight, you can move Hadoop data from an on-site data center to the Azure cloud for backup, dev/test, and cloud bursting scenarios. Includes NoSQL transactional capabilities: HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of Hadoop and allows large online transactional processing (OLTP). HDInsight Emulator: The HDInsight Emulator tool provides a local development environment for Azure HDInsight without the need for a cloud subscription. This can be installed using Microsoft Web Platform Installer. Summary We live in a connected digital era and are witnessing unprecedented growth of data. Organizations that are able to analyze Big Data are demonstrating significant return on investment by detecting fraud, improved operations, and reduced time to analyze with a scale-out architecture. Apache Hadoop is the leading open source Big Data platform with strong and diverse ecosystem projects that enable organizations to build a modern data architecture. At the core, Hadoop has two key components: Hadoop Distributed File System also known as HDFS, and a cluster resource manager known as YARN. YARN has enabled Hadoop to be a true multi-use data platform that can handle batch processing, real-time streaming, interactive SQL, and others. Microsoft HDInsight is an enterprise-ready distribution of Hadoop on the cloud that has been developed with the partnership of Hortonworks and Microsoft. The key benefits of HDInsight include scaling up/down as required, analysis using Excel, connecting an on-premise Hadoop cluster with the cloud, and flexible programming and support for NoSQL transactional databases. Resources for Article: Further resources on this subject: Hadoop and HDInsight in a Heartbeat [article] Sizing and Configuring your Hadoop Cluster [article] Introducing Kafka [article]
Read more
  • 0
  • 0
  • 1920

article-image-creating-pivot-table
Packt
29 Aug 2013
8 min read
Save for later

Creating a pivot table

Packt
29 Aug 2013
8 min read
(For more resources related to this topic, see here.) A pivot table is the core business intelligence tool that helps to turn meaningless data from various sources to a meaningful result. By using different ways of presenting data, we are able to identify relations between seemingly separate data and reach conclusions to help us identify our strengths and areas of improvement. Getting ready Prepare the two files entitled DatabaseData_v2.xlsx and GDPData_v2.xlsx.We will be using these results along with other data sources to create a meaningful PowerPivot table that will be used for intelligent business analysis. How to do it... For each of the two files, we will build upon the file and add a pivot table to it, gaining exposure using the data we are already familiar with. The following are the steps to create a pivot table with the DatabaseData_v2.xlsx file, which results in the creation of a DatabaseData_v3.xlsx file: Open the PowerPivot window of the DatabaseData_v2.xlsx file with its 13 tables. Click on the PivotTable button near the middle of the top row and save as New Worksheet. Select the checkboxes as shown in the following screenshot: Select CountryRegion | Name and move it under Row Labels‰ Select Address | City and move it under Row Labels Select Address | AddressLine1 as Count of AddressLine1 and move it under Values Now, this shows the number of clients per city and per country. However, it is very difficult to navigate, as each country name has to be collapsed in order to see the next country. Let us move the CountryRegion | Name column to Slicers Vertical. Now, the PowerPivot Field List dashboard should appear as shown in the following screenshot: Now, the pivot table should display simple results: the number of clients in a region, filterable by the country using slicers. Let us apply some formatting to allow for a better understanding of the data. Right-click on Name under the Slicers Vertical area of the PowerPivot Field List dashboard. Select Field Settings, then change the name to Country Name. We now see that the title of the slicer has changed from Name to Country Name, allowing anyone who views this data to understand better what the data represents. Similarly, right-click on Count of AddressLine1 under Values, select Edit Measure, and then change its name to Number of Clients. Also change the data title City under the Row Labels area to City Name. The result should appear as shown in the following screenshot: Let's see our results change as we click on different country names. We can filter for multiple countries by holding the Ctrl key while clicking, and can remove all filters by clicking the small button on the top-right of slicers. This is definitely easier to navigate through and to understand compared to what we did at first without using slicers, which is how it would appear in Excel 2010 without PowerPivot. However, this table is still too big. Clicking on Canada gives too many cities whose names many of us have not heard about before. Let us break the data further down by including states/provinces. Select StateProvince | Name and move it under Slicers Horizontal and change its title to State Name. It is a good thing that we are renaming these as we go along. Otherwise, there would have been two datasets called Name, and anyone would be confused as we moved along. Now, we should see the state names filter on the top, the country name filter on the left, and a list of cities with the number of clients in the middle part. This, however, is kind of awkward. Let us rearrange the filters by having the largest filter (country) at the top and the sub-filter. (state) on the left-hand side This can be done simply by dragging the Country Name dataset to Slicers Horizontal and State Name to Slicers Vertical. After moving the slicers around a bit, the result should appear as shown in the following screenshot: Again, play with the results and try to understand the features: try filtering by a country—and by a state/province—now there are limited numbers of cities shown for each country and each state/province, making it easier to see the list of cities. However, for countries such as the United States, there are just too many states. Let us change the formatting of the vertical filter to display three states per line, so it is easier to find the state we are looking for. This can be done by right-clicking on the vertical filter, selecting Size and Properties| Position and Layout, and then by changing the Number of Columns value. Repeat the same step for Country Name to display six columns and then change the sizes of the filters to look more selectable. Change the name of the sheet as PivotTable and then save the file as DatabaseData_v3.xlsx. The following are the steps to create a pivot table with the GDPData_v2.xlsx file, which results in the creation of a GDPData_v3.xlsx file: Open the PowerPivot window of the GDPData_v2.xlsx file with its two tables. Click on the PivotTable button near the middle of the top row and save as New Worksheet. Move the dataset from the year 2000 to the year 2010 to the Value field, and move Country Name in the Row Labels field, and Country Name again into the Slicers Horizontal field. In the slicer, select five countries: Canada, China, Korea, Japan, and United States as shown in the following screenshot: Select all fields and reduce the number of its decimal places. We can now clearly see that GDP in China has tripled over the decade, and that only China and Korea saw an increase in GDP from 2008 to 2009 while the GDP of other nations dropped due to the 2008 financial crisis. Knowing the relevant background information of world finance events, we can make intelligent analysis such as which markets to invest in if we are worried about another financial crisis taking place. As the data get larger in size, looking at the GDP number becomes increasingly difficult. In such cases, we can switch the type of data displayed by using available buttons in the PivotTable Tools | Options menu, the Show Value As button. Play around with it and see how it works: % of Column Total shows each GDP as a percentage of the year, while % Different From allows the user to set one value as the standard and compare the rest to it, and the Rank Largest to Smallest option simply shows the ranking based on which country earns the most GDP. Change the name of the sheet as PivotTable and then save the file as GDPData_v3.xlsx. How it works... We looked at two different files and focused on two different fields. The first file was more qualitative and showed the relationship between regions and number of clients, using various features of pivot tables such as slicers. We also looked at how to format various aspects of a pivot table for easier processing and for a better understanding of the represented data. Slicers embedded in the pivot table are a unique and very powerful feature of PowerPivot that allow us to sort through data simply by clicking the different criteria. The increasing numbers of slicers help to customize the data further, enabling the user to create all sorts of data imaginable. There are no differences in horizontal and vertical slicers aside from the fact that they are at different locations. From the second file, we focused more on the quantitative data and different ways of representing the data. By using slicers to limit the number of countries, we were able to focus more on the data presented, and manage to represent the GDP in various formats such as percentages and ranks, and were able to compare the difference between the numbers by selecting one as a standard. A similar method of representing data in a different format could be applied to the first file to show the percentage of clients per nation, and so on. There's more... We covered the very basic setup of creating a pivot table. We can also analyze creating relationships between data and creating custom fields, so that better results are created. So don't worry about why the pivot table looks so small! For those who do not like working with a pivot table, there is also a feature that will convert all cells into Excel formula. Under the PowerPivot Tools | OLAP Tools option, the Convert to Formula button does exactly that. However, be warned that it cannot be undone as the changes are permanent. Summary In this article, we learned how to use the raw data to make some pivot tables that can help us make smart business decisions! Resources for Article: Further resources on this subject: SAP HANA integration with Microsoft Excel [Article] Managing Core Microsoft SQL Server 2008 R2 Technologies [Article] Eloquent relationships [Article]
Read more
  • 0
  • 0
  • 1920
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-cloud
Packt
22 Jan 2015
14 min read
Save for later

In the Cloud

Packt
22 Jan 2015
14 min read
In this article by Rafał Kuć, author of the book Solr Cookbook - Third Edition, covers the cloud side of Solr—SolrCloud, setting up collections, replicas configuration, distributed indexing and searching, as well as aliasing and shard manipulation. We will also learn how to create a cluster. (For more resources related to this topic, see here.) Creating a new SolrCloud cluster Imagine a situation where one day you have to set up a distributed cluster with the use of Solr. The amount of data is just too much for a single server to handle. Of course, you can just set up a second server or go for another master server with another set of data. But before Solr 4.0, you would have to take care of the data distribution yourself. In addition to this, you would also have to take care of setting up replication, data duplication, and so on. With SolrCloud you don't have to do this—you can just set up a new cluster and this article will show you how to do that. Getting ready It shows you how to set up a Zookeeper cluster in order to be ready for production use. How to do it... Let's assume that we want to create a cluster that will have four Solr servers. We also would like to have our data divided between the four Solr servers in such a way that we have the original data on two machines and in addition to this, we would also have a copy of each shard available in case something happens with one of the Solr instances. I also assume that we already have our Zookeeper cluster set up, ready, and available at the address 192.168.1.10 on the 9983 port. For this article, we will set up four SolrCloud nodes on the same physical machine: We will start by running an empty Solr server (without any configuration) on port 8983. We do this by running the following command (for Solr 4.x): java -DzkHost=192.168.1.10:9983 -jar start.jar For Solr 5, we will run the following command: bin/solr -c -z 192.168.1.10:9983 Now we start another three nodes, each on a different port (note that different Solr instances can run on the same port, but they should be installed on different machines). We do this by running one command for each installed Solr server (for Solr 4.x): java -Djetty.port=6983 -DzkHost=192.168.1.10:9983 -jar start.jarjava -Djetty.port=4983 -DzkHost=192.168.1.10:9983 -jar start.jarjava -Djetty.port=2983 -DzkHost=192.168.1.10:9983 -jar start.jar For Solr 5, the commands will be as follows: bin/solr -c -p 6983 -z 192.168.1.10:9983bin/solr -c -p 4983 -z 192.168.1.10:9983bin/solr -c -p 2983 -z 192.168.1.10:9983 Now we need to upload our collection configuration to ZooKeeper. Assuming that we have our configuration in /home/conf/solrconfiguration/conf, we will run the following command from the home directory of the Solr server that runs first (the zkcli.sh script can be found in the Solr deployment example in the scripts/cloud-scripts directory): ./zkcli.sh -cmd upconfig -zkhost 192.168.1.10:9983 -confdir /home/conf/solrconfiguration/conf/ -confname collection1 Now we can create our collection using the following command: curl 'localhost:8983/solr/admin/collections?action=CREATE&name=firstCollection&numShards=2&replicationFactor=2&collection.configName=collection1' If we now go to http://localhost:8983/solr/#/~cloud, we will see the following cluster view: As we can see, Solr has created a new collection with a proper deployment. Let's now see how it works. How it works... We assume that we already have ZooKeeper installed—it is empty and doesn't have information about any collection, because we didn't create them. For Solr 4.x, we started by running Solr and telling it that we want it to run in SolrCloud mode. We did that by specifying the -DzkHost property and setting its value to the IP address of our ZooKeeper instance. Of course, in the production environment, you would point Solr to a cluster of ZooKeeper nodes—this is done using the same property, but the IP addresses are separated using the comma character. For Solr 5, we used the solr script provided in the bin directory. By adding the -c switch, we told Solr that we want it to run in the SolrCloud mode. The -z switch works exactly the same as the -DzkHost property for Solr 4.x—it allows you to specify the ZooKeeper host that should be used. Of course, the other three Solr nodes run exactly in the same manner. For Solr 4.x, we add the -DzkHost property that points Solr to our ZooKeeper. Because we are running all the four nodes on the same physical machine, we needed to specify the -Djetty.port property, because we can run only a single Solr server on a single port. For Solr 5, we use the -z property of the bin/solr script and we use the -p property to specify the port on which Solr should start. The next step is to upload the collection configuration to ZooKeeper. We do this because Solr will fetch this configuration from ZooKeeper when you will request the collection creation. To upload the configuration, we use the zkcli.sh script provided with the Solr distribution. We use the upconfig command (the -cmd switch), which means that we want to upload the configuration. We specify the ZooKeeper host using the -zkHost property. After that, we can say which directory our configuration is stored (the -confdir switch). The directory should contain all the needed configuration files such as schema.xml, solrconfig.xml, and so on. Finally, we specify the name under which we want to store our configuration using the -confname switch. After we have our configuration in ZooKeeper, we can create the collection. We do this by running a command to the Collections API that is available at the /admin/collections endpoint. First, we tell Solr that we want to create the collection (action=CREATE) and that we want our collection to be named firstCollection (name=firstCollection). Remember that the collection names are case sensitive, so firstCollection and firstcollection are two different collections. We specify that we want our collection to be built of two primary shards (numShards=2) and we want each shard to be present in two copies (replicationFactor=2). This means that we will have a primary shard and a single replica. Finally, we specify which configuration should be used to create the collection by specifying the collection.configName property. As we can see in the cloud, a view of our cluster has been created and spread across all the nodes. There's more... There are a few things that I would like to mention—the possibility of running a Zookeeper server embedded into Apache Solr and specifying the Solr server name. Starting an embedded ZooKeeper server You can also start an embedded Zookeeper server shipped with Solr for your test environment. In order to do this, you should pass the -DzkRun parameter instead of -DzkHost=192.168.0.10:9983, but only in the command that sends our configuration to the Zookeeper cluster. So the final command for Solr 4.x should look similar to this: java -DzkRun -jar start.jar In Solr 5.0, the same command will be as follows: bin/solr start -c By default, ZooKeeper will start on the port higher by 1,000 to the one Solr is started at. So if you are running your Solr instance on 8983, ZooKeeper will be available at 9983. The thing to remember is that the embedded ZooKeeper should only be used for development purposes and only one node should start it. Specifying the Solr server name Solr needs each instance of SolrCloud to have a name. By default, that name is set using the IP address or the hostname, appended with the port the Solr instance is running on, and the _solr postfix. For example, if our node is running on 192.168.56.1 and port 8983, it will be called 192.168.56.1:8983_solr. Of course, Solr allows you to change that behavior by specifying the hostname. To do this, start using the -Dhost property or add the host property to solr.xml. For example, if we would like one of our nodes to have the name of server1, we can run the following command to start Solr: java -DzkHost=192.168.1.10:9983 -Dhost=server1 -jar start.jar In Solr 5.0, the same command would be: bin/solr start -c -h server1 Setting up multiple collections on a single cluster Having a single collection inside the cluster is nice, but there are multiple use cases when we want to have more than a single collection running on the same cluster. For example, we might want users and books in different collections or logs from each day to be only stored inside a single collection. This article will show you how to create multiple collections on the same cluster. Getting ready This article will show you how to create a new SolrCloud cluster. We also assume that ZooKeeper is running on 192.168.1.10 and is listening on the 2181 port and that we already have four SolrCloud nodes running as a cluster. How to do it... As we already have all the prerequisites, such as ZooKeeper and Solr up and running, we need to upload our configuration files to ZooKeeper to be able to create collections: Assuming that we have our configurations in /home/conf/firstcollection/conf and /home/conf/secondcollection/conf, we will run the following commands from the home directory of the first run Solr server to upload the configuration to ZooKeeper (the zkcli.sh script can be found in the Solr deployment example in the scripts/cloud-scripts directory): ./zkcli.sh -cmd upconfig -zkhost localhost:2181 -confdir /home/conf/firstcollection/conf/ -confname firstcollection./zkcli.sh -cmd upconfig -zkhost localhost:2181 -confdir /home/conf/secondcollection/conf/ -confname secondcollection We have pushed our configurations into Zookeeper, so now we can create the collections we want. In order to do this, we use the following commands: curl 'localhost:8983/solr/admin/collections?action=CREATE&name=firstCollection&numShards=2&replicationFactor=2&collection.configName=firstcollection'curl 'localhost:8983/solr/admin/collections?action=CREATE&name=secondcollection&numShards=4&replicationFactor=1&collection.configName=secondcollection' Now, just to test whether everything went well, we will go to http://localhost:8983/solr/#/~cloud. As the result, we will see the following cluster topology: As we can see, both the collections were created the way we wanted. Now let's see how that happened. How it works... We assume that we already have ZooKeeper installed—it is empty and doesn't have information about any collections, because we didn't create them. We also assumed that we have our SolrCloud cluster configured and started. We start by uploading two configurations to ZooKeeper, one called firstcollection and the other called secondcollection. After that we are ready to create our collections. We start by creating the collection named firstCollection that is built of two primary shards and one replica. The second collection, called secondcollection is built of four primary shards and it doesn't have any replicas. We can see that easily in the cloud view of the deployment. The firstCollection collection has two shards—shard1 and shard2. Each of the shard has two physical copies—one green (which means active) and one with a black dot, which is the primary shard. The secondcollection collection is built of four physical shards—each shard has a black dot near its name, which means that they are primary shards. Splitting shards Imagine a situation where you reach a limit of your current deployment—the number of shards is just not enough. For example, the indexing throughput is lower and lower, because the disks are not able to keep up. Of course, one of the possible solutions is to spread the index across more shards; however, you already have a collection and you want to keep the data and reindexing is not an option, because you don't have the original data. Solr can help you with such situations by allowing splitting shards of already created collections. This article will show you how to do it. Getting ready This article will show you how to create a new SolrCloud cluster. We also assume that ZooKeeper is running on 192.168.1.10 and is listening on port 2181 and that we already have four SolrCloud nodes running as a cluster. How to do it... Let's assume that we already have a SolrCloud cluster up and running and it has one collection called books. So our cloud view (which is available at http://localhost:8983/solr/#/~cloud) looks as follows: We have four nodes and we don't utilize them fully. We can say that these two nodes in which we have our shards are almost fully utilized. What we can do is create a new collection and reindex the data or we can split shards of the already created collection. Let's go with the second option: We start by splitting the first shard. It is as easy as running the following command: curl 'http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=books&shard=shard1' After this, we can split the second shard by running a similar command to the one we just used: curl 'http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=books&shard=shard2' Let's take a look at the cluster cloud view now (which is available at http://localhost:8983/solr/#/~cloud): As we can see, both shards were split—shard1 was divided into shard1_0 and shard1_1 and shard2 was divided into shard2_0 and shard2_1. Of course, the data was copied as well, so everything is ready. However, the last step should be to delete the original shards. Solr doesn't delete them, because sometimes applications use shard names to connect to a given shard. However, in our case, we can delete them by running the following commands: curl 'http://localhost:8983/solr/admin/collections?action=DELETESHARD&collection=books&shard=shard1' curl 'http://localhost:8983/solr/admin/collections?action=DELETESHARD&collection=books&shard=shard2' Now if we would again look at the cloud view of the cluster, we will see the following: How it works... We start with a simple collection called books that is built of two primary shards and no replicas. This is the collection which shards we will try to divide it without stopping Solr. Splitting shards is very easy. We just need to run a simple command in the Collections API (the /admin/collections endpoint) and specify that we want to split a shard (action=SPLITSHARD). We also need to provide additional information such as which collection we are interested in (the collection parameter) and which shard we want to split (the shard parameter). You can see the name of the shard by looking at the cloud view or by reading the cluster state from ZooKeeper. After sending the command, Solr might force us to wait for a substantial amount of time—shard splitting takes time, especially on large collections. Of course, we can run the same command for the second shard as well. Finally, we end up with six shards—four new and two old ones. The original shard will still contain data, but it will start to re-route requests to newly created shards. The data was split evenly between the new shards. The old shards were left although they are marked as inactive and they won't have any more data indexed to them. Because we don't need them, we can just delete them using the action=DELETESHARD command sent to the same Collections API. Similar to the split shard command, we need to specify the collection name, which shard we want to delete, and the name of the shard. After we delete the initial shards, we now see that our cluster view shows only four shards, which is what we were aiming at. We can now spread the shards across the cluster. Summary In this article, we learned how to set up multiple collections. This article thought us how to increase the number of collections in a cluster. We also worked on a way used to split shards. Resources for Article: Further resources on this subject: Tuning Solr JVM and Container [Article] Apache Solr PHP Integration [Article] Administrating Solr [Article]
Read more
  • 0
  • 0
  • 1916

article-image-calculus
Packt
14 Aug 2013
8 min read
Save for later

Calculus

Packt
14 Aug 2013
8 min read
(For more resources related to this topic, see here.) Derivatives To compute the derivative of a function, create the corresponding expression and use diff(). Its first argument is the expression and the second is the variable with regard to which you want to differentiate. The result is the expression for the derivative: >>> diff(exp(x**2), x)2*x*exp(x**2)>>> diff(x**2 * y**2, y)2*x**2*y Higher-order derivatives can also be computed with a single call to diff(): >>> diff(x**3, x, x)6*x>>> diff(x**3, x, 2)6*x>>> diff(x**2 * y**2, x, 2, y, 2)4 Due to SymPy's focus on expressions rather than functions, the derivatives for symbolic functions can seem a little surprising, but LaTeX rendering in the notebook should make their meaning clear. >>> f = Function('f')>>> diff(f(x**2), x)2*x*Subs(Derivative(f(_xi_1), _xi_1), (_xi_1,), (x**2,)) Let's take a look at the following screenshot: Limits Limits are obtained through limit(). The syntax for the limit of expr when x goes to some value x0 is limit(expr, x, x0). To specify a limit towards infinity, you need to use SymPy's infinity object, named oo. This object will also be returned for infinite limits: >>> limit(exp(-x), x, oo)0>>> limit(1/x**2, x, 0)oo There is also a fourth optional parameter, to specify the direction of approach of the limit target. "+" (the default) gives the limit from above, and "-" is from below. Obviously, this parameter is ignored when the limit target is infinite: >>> limit(1/x, x, 0, "-")-oo>>> limit(1/x, x, 0, "+")oo Let's take a look at the following screenshot: Integrals SymPy has powerful algorithms for integration, and, in particular, can find most integrals of logarithmic and exponential functions expressible with special functions, and many more besides, thanks to Meijer G-functions. The main function for integration is integrate(). It can compute both antiderivatives (indefinite integrals) and definite integrals. Note that the value of an antiderivative is only defined up to an arbitrary constant but the result does not include it. >>> integrate(sin(x), x)-cos(x)>>> integrate(sin(x), (x, 0, pi))2 Unevaluated symbolic integrals and antiderivatives are represented by the Integral class. integrate() may return these objects if it cannot compute the integral. It is also possible to create Integral objects directly, using the same syntax as integrate(). To evaluate them, call their .doit() method: >>> integral = Integral(sin(x), (x, 0, pi))>>> integralIntegral(sin(x), (x, 0, pi))>>> integral.doit()2 Let's take a look at the following screenshot: Taylor series A Taylor series approximation is an approximation of a function obtained by truncating its Taylor series. To compute it, use series(expr, x, x0, n), where x is the relevant variable, x0 is the point where the expansion is done (defaults to 0), and n is the order of expansion (defaults to 6): >>> series(cos(x), x)1 - x**2/2 + x**4/24 + O(x**6)>>> series(cos(x), x, n=10)1 - x**2/2 + x**4/24 - x**6/720 + x**8/40320 + O(x**10) The O(x**6) part in the result is a "big-O" object. Intuitively, it represents all the terms of order equal to or higher than 6. This object automatically absorbs or combines with powers of the variable, which makes simple arithmetic operations on expansions convenient: >>> O(x**2) + 2*x**3O(x**2)>>> O(x**2) * 2*x**3O(x**5)>>> expand(series(sin(x), x, n=6) * series(cos(x), x, n=4))x - 2*x**3/3 + O(x**5)>>> series(sin(x)*cos(x), x, n=5)x - 2*x**3/3 + O(x**5) If you want to use the expansion as an approximation of the function, the O() term prevents it from behaving like an ordinary expression, so you need to remove it. You can do so by using the aptly named .removeO() method: >>> series(cos(x), x).removeO()x**4/24 - x**2/2 + 1 Taylor series look better in the notebook, as shown in the following screenshot: Solving equations This section will teach you how to solve the different types of equations that SymPy handles. The main function to use for solving equations is solve(). Its interface is somewhat complicated as it accepts many different kinds of inputs and can output results in various forms depending on the input. In the simplest case, univariate equations, use the syntax solve(expr, x) to solve the equation expr = 0 for the variable x. If you want to solve an equation of the form A = B, simply put it under the preceding form, using solve(A - B, x). This can solve algebraic and transcendental equations involving rational fractions, square roots, absolute values, exponentials, logarithms, trigonometric functions, and so on. The result is then a list of the values of the variables satisfying the equation. The following commands show a few examples of equations that can be solved: >>> solve(x**2 - 1, x)[-1, 1]>>> solve(x*exp(x) - 1, x)[LambertW(1)]>>> solve(abs(x**2-4) - 3, x)[-1, 1, -sqrt(7), sqrt(7)] Note that the form of the result means that it can only return a finite set of solutions. In cases where the true solution is infinite, it can therefore be misleading. When the solution is an interval, solve() typically returns an empty list. For periodic functions, usually only one solution is returned: >>> solve(0, x) # all x are solutions[]>>> solve(x - abs(x), x) # all positive x are solutions[]>>> solve(sin(x), x) # all k*pi with k integer are solutions[0] The domain over which the equation is solved depends on the assumptions on the variable. Hence, if the variable is a real Symbol object, only real solutions are returned, but if it is complex, then all solutions in the complex plane are returned (subject to the aforementioned restriction on returning infinite solution sets). This difference is readily apparent when solving polynomials, as the following example demonstrates: >>> solve(x**2 + 1, x)[]>>> solve(z**2 + 1, z)[-I, I] There is no restriction on the number of variables appearing in the expression. Solving a multivariate expression for any of its variables allows it to be expressed as a function of the other variables, and to eliminate it from other expressions. The following example shows different ways of solving the same multivariate expression: >>> solve(x**2 - exp(a), x)[-exp(a/2), exp(a/2)]>>> solve(x**2 - exp(a), a)[log(x**2)]>>> solve(x**2 - exp(a), x, a)[{x: -exp(a/2)}, {x: exp(a/2)}]>>> solve(x**2 - exp(a), x, b)[{x: -exp(a/2)}, {x: exp(a/2)}] To solve a system of equations, pass a list of expressions to solve(): each one will be interpreted, as in the univariate case, as an equation of the form expr = 0. The result can be returned in one of two forms, depending on the mathematical structure of the input: either as a list of tuples, where each tuple contains the values for the variables in the order given to solve, or a single dictionary, suitable for use in subs(), mapping variables to their values. As you can see in the following example, it can be hard to predict what form the result will take: >>> solve([exp(x**2) - y, y - 3], x, y)[(-sqrt(log(3)), 3), (sqrt(log(3)), 3)]>>> solve([x**2 - y, y - 3], x, y)[(-sqrt(3), 3), (sqrt(3), 3)]>>> solve([x - y, y - 3], x, y){y: 3, x: 3} This variability in return types is fine for interactive use, but for library code, more predictability is required. In this case, you should use the dict=True option. The output will then always be a list of mappings of variables to value. Compare the following example to the previous one: >>> solve([x**2 - y, y - 3], x, y, dict=True)[{y: 3, x: -sqrt(3)}, {y: 3, x: sqrt(3)}]>>> solve([x - y, y - 3], x, y, dict=True)[{y: 3, x: 3}] Summary We successfully computed the various mathematical operations using the SymPy application, Calculus. Resources for Article : Further resources on this subject: Move Further with NumPy Modules [Article] Advanced Indexing and Array Concepts [Article] Running a simple game using Pygame [Article]
Read more
  • 0
  • 0
  • 1911

article-image-advanced-data-operations
Packt
30 Oct 2013
11 min read
Save for later

Advanced Data Operations

Packt
30 Oct 2013
11 min read
(For more resources related to this topic, see here.) Recipe 1 – handling multi-valued cells It is a common problem in many tables: what do you do if multiple values apply to a single cell? For instance, consider a Clients table with the usual name, address, and telephone fields. A typist is adding new contacts to this table, when he/she suddenly discovers that Mr. Thompson has provided two addresses with a different telephone number for each of them. There are essentially three possible reactions to this: Adding only one address to the table: This is the easiest thing to do, as it eliminates half of the typing work. Unfortunately, this implies that half of the information is lost as well, so the completeness of the table is in danger. Adding two rows to the table: While the table is now complete, we now have redundant data. Redundancy is also dangerous, because it leads to error: the two rows might accidentally be treated as two different Mr. Thompsons, which can quickly become problematic if Mr. Thompson is billed twice for his subscription. Furthermore, as the rows have no connection, information updated in one of them will not automatically propagate to the other. Adding all information to one row: In this case, two addresses and two telephone numbers are added to the respective fields. We say the field is overloaded with regard to its originally envisioned definition. At first sight, this is both complete yet not redundant, but a subtle problem arises. While humans can perfectly make sense of this information, automated processes cannot. Imagine an envelope labeler, which will now print two addresses on a single envelope, or an automated dialer, which will treat the combined digits of both numbers as a single telephone number. The field has indeed lost its precise semantics. Note that there are various technical solutions to deal with the problem of multiple values, such as table relations. However, if you are not in control of the data model you are working with, you'll have to choose any of the preceding solutions. Luckily, OpenRefine is able to offer the best of both worlds. Since it is also an automated piece of software, it needs to be informed whether a field is multi-valued before it can perform sensible operations on it. In the Powerhouse Museum dataset, the Categories field is multi-valued, as each object in the collection can belong to different categories. Before we can perform meaningful operations on this field, we have to tell OpenRefine to somehow treat it a little different. Suppose we want to give the Categories field a closer look to check how many different categories are there and which categories are the most prominent. First, let's see what happens if we try to create a text facet on this field by clicking on the dropdown next to Categories and navigating to Facet| Text Facet as shown in the following screenshot. This doesn't work as expected because there are too many combinations of individual categories. OpenRefine simply gives up, saying that there are 14,805 choices in total, which is above the limit for display. While you can increase the maximum value by clicking on Set choice count limit, we strongly advise against this. First of all, it would make OpenRefine painfully slow as it would offer us a list of 14,805 possibilities, which is too large for an overview anyway. Second, it wouldn't help us at all because OpenRefine would only list the combined field values (such as Hen eggs | Sectional models | Animal Samples and Products). This does not allow us to inspect the individual categories, which is what we're interested in. To solve this, leave the facet open, but go to the Categories dropdown again and select Edit Cells| Split multi-valued cells…as shown in the following screenshot: OpenRefine now asks What separator currently separates the values?. As we can see in the first few records, the values are separated by a vertical bar or pipe character, as the horizontal line tokens are called. Therefore, enter a vertical bar |in the dialog. If you are not able to find the corresponding key on your keyboard, try selecting the character from one of the Categories cells and copying it so you can paste it in the dialog. Then, click on OK. After a few seconds, you will see that OpenRefine has split the cell values, and the Categories facet on the left now displays the individual categories. By default, it shows them in alphabetical order, but we will get more valuable insights if we sort them by the number of occurrences. This is done by changing the Sort by option from name to count, revealing the most popular categories. One thing we can do now, which we couldn't do when the field was still multi-valued is changing the name of a single category across all records. For instance, to change the name of Clothing and Dress, hover over its name in the created Categories facet and click on the edit link, as you can see in the following screenshot: Enter a new name such as Clothing and click on Apply. OpenRefine changes all occurrences of Clothing and Dress into Clothing, and the facet is updated to reflect this modification. Once you are done editing the separate values, it is time to merge them back together. Go to the Categories dropdown, navigate to Edit cells| Join multi-valued cells…, and enter the separator of your choice. This does not need to be the same separator as before, and multiple characters are also allowed. For instance, you could opt to separate the fields with a comma followed by a space. Recipe 3 – clustering similar cells Thanks to OpenRefine, you don't have to worry about inconsistencies that slipped in during the creation process of your data. If you have been investigating the various categories after splitting the multi-valued cells, you might have noticed that the same category labels do not always have the same spelling. For instance, there is Agricultural Equipment and Agricultural equipment(capitalization differences), Costumes and Costume(pluralization differences), and various other issues. The good news is that these can be resolved automatically; well, almost. But, OpenRefine definitely makes it a lot easier. The process of finding the same items with slightly different spelling is called clustering. After you have split multi-valued cells, you can click on the Categories dropdown and navigate to Edit cells| Cluster and edit…. OpenRefine presents you with a dialog box where you can choose between different clustering methods, each of which can use various similarity functions. When the dialog opens, key collision and fingerprint have been chosen as default settings. After some time (this can take a while, depending on the project size), OpenRefine will execute the clustering algorithm on the Categories field. It lists the found clusters in rows along with the spelling variations in each cluster and the proposed value for the whole cluster, as shown in the following screenshot: Note that OpenRefine does not automatically merge the values of the cluster. Instead, it wants you to confirm whether the values indeed point to the same concept. This avoids similar names, which still have a different meaning, accidentally ending up as the same. Before we start making decisions, let's first understand what all of the columns mean. The Cluster Size column indicates how many different spellings of a certain concept were thought to be found. The Row Count column indicates how many rows contain either of the found spellings. In Values in Cluster, you can see the different spellings and how many rows contain a particular spelling. Furthermore, these spellings are clickable, so you can indicate which one is correct. If you hover over the spellings, a Browse this cluster link appears, which you can use to inspect all items in the cluster in a separate browser tab. The Merge column contains a checkbox. If you check it, all values in that cluster will be changed to the value in the New Cell Value column when you click on one of the Merge Selected buttons. You can also manually choose a new cell value if the automatic value is not the best choice. So, let's perform our first clustering operation. I strongly advise you to scroll carefully through the list to avoid clustering values that don't belong together. In this case, however, the algorithm hasn't acted too aggressively: in fact, all suggested clusters are correct. Instead of manually ticking the Merge? checkbox on every single one of them, we can just click on Select All at the bottom. Then, click on the Merge Selected & Re-Cluster button, which will merge all the selected clusters but won't close the window yet, so we can try other clustering algorithms as well. OpenRefine immediately reclusters with the same algorithm, but no other clusters are found since we have merged all of them. Let's see what happens when we try a different similarity function. From the Keying Function menu, click on ngram fingerprint. Note that we get an additional parameter, Ngram Size, which we can experiment with to obtain less or more aggressive clustering. We see that OpenRefine has found several clusters again. It might be tempting to click on the Select All button again, but remember we warned to carefully inspect all rows in the list. Can you spot the mistake? Have a closer look at the following screenshot: Indeed, the clustering algorithm has decided that Shirts and T-shirts are similar enough to be merged. Unfortunately, this is not true. So, either manually select all correct suggestions, or deselect the ones that are not. Then, click on the Merge Selected & Re-Cluster button. Apart from trying different similarity functions, we can also try totally different clustering methods. From the Method menu, click on nearest neighbor. We again see new clustering parameters appear (Radius and Block Chars, but we will use their default settings for now). OpenRefine again finds several clusters, but now, it has been a little too aggressive. In fact, several suggestions are wrong, such as the Lockets / Pockets / Rockets cluster. Some other suggestions, such as "Photocopiers" and "Photocopier", are fine. In this situation, it might be best to manually pick the few correct ones among the many incorrect clusters. Assuming that all clusters have been identified, click on the Merge Selected & Close button, which will apply merging to the selected items and take you back into the main OpenRefine window. If you look at the data now or use a text facet on the Categories field, you will notice that the inconsistencies have disappeared. What are clustering methods? OpenRefine offers two different clustering methods, key collision and nearest neighbor, which fundamentally differ in how they function. With key collision, the idea is that a keying function is used to map a field value to a certain key. Values that are mapped to the same key are placed inside the same cluster. For instance, suppose we have a keying function which removes all spaces; then, A B C, AB C, and ABC will be mapped to the same key: ABC. In practice, the keying functions are constructed in a more sophisticated and helpful way. Nearest neighbor, on the other hand, is a technique in which each unique value is compared to every other unique value using a distance function. For instance, if we count every modification as one unit, the distance between Boot and Bots is 2: one addition and one deletion. This corresponds to an actual distance function in OpenRefine, namely levenshtein. In practice, it is hard to predict which combination of method and function is the best for a given field. Therefore, it is best to try out the various options, each time carefully inspecting whether the clustered values actually belong together. The OpenRefine interface helps you by putting the various options in the order they are most likely to help: for instance, trying key collision before nearest neighbor. Summary In this article we learned about how to handle multi-valued cells and clustering of similar cells in OpenRefine. Multi-valued cells are a common problem in many tables. This article showed us what to do if multiple values apply to a single cell. Since OpenRefine is an automated piece of software, it needs to be informed whether a field is multi-valued before it can perform sensible operations on it. This article also showed an example of how to go about it. It also shed light on clustering methods. OpenRefine offers two different clustering methods, key collision and nearest neighbor , which fundamentally differ in how they function. With key collision, the idea is that a keying function is used to map a field value to a certain key. Values that are mapped to the same key are placed inside the same cluster. Resources for Article : Further resources on this subject: Business Intelligence and Data Warehouse Solution - Architecture and Design [Article] Self-service Business Intelligence, Creating Value from Data [Article] Oracle Business Intelligence : Getting Business Information from Data [Article]
Read more
  • 0
  • 0
  • 1891

article-image-getting-started-innodb
Packt
19 Feb 2013
9 min read
Save for later

Getting Started with InnoDB

Packt
19 Feb 2013
9 min read
(For more resources related to this topic, see here.) Basic features of InnoDB InnoDB is more than a fast disk-based relational database engine. It offers, at its core, the following features that separate it from other disk-based engines: MVCC ACID compliance Transaction support Row-level locking These features are responsible for providing what is known as Referential integrity; a core requirement for enterprise database applications. Referential integrity Referential integrity can be best thought of as the ability for the database application to store relational data in multiple tables with consistency. If a database lacks consistency between relational data, the data cannot be relied upon for applications. If, for example, an application stores financial transactions where monetary data is processed, referential integrity and consistency of transactional data is a key component. Financial data is not the only case where this is an important feature, as many applications store and process sensitive data that must be consistent Multiversion concurrency control A vital component is Multiversion concurrency control (MVCC), which is a control process used by databases to ensure that multiple concurrent connections can see and access consistent states of data over time. A common scenario relying on MVCC can be thought of as follows: data exists in a table and an application connection accesses that data, then a second connection accesses the same original data set while the first connection is making changes to it; since the first connection has not finalized its changes and committed its information we don't want the second connection to see the nonfinalized data. Thus two versions of the data exist at the same time—multiple versions—to allow the database to control the concurrent state of the data. MVCC also provides for the existence of point-in-time consistent views, where multiple versions of data are kept and are available for access based on their point-in-time existence. Transaction isolation Transaction support at the database level refers to the ability for units of work to be processed in separate units of execution from others. This isolation of data execution allows each database connection to manipulate, read, and write information at the same time without conflicting with each other. Transactions allow connections to operate on data on an all-or-nothing operation, so that if the transaction completes successfully it will be written to disk and recorded for upcoming transactions to then operate on. However, if the sequence of changes to the data in the transaction process do not complete then they can be rolled back, and no changes will be recorded to disk. This allows sequences of execution that contain multiple steps to fully succeed only if all of the changes complete, and to roll back any changed data to its original state if one or more of the sequence of changes in the transaction fail. This feature guarantees that the data remains consistent and referentially safe. ACID compliance An integral part of InnoDB is its ability to ensure that data is atomic, consistent, isolated, and durable; these features make up components of ACID compliance. Simply put, atomicity requires that if a transaction fails then the changes are rolled back and not committed. Consistency requires that each successfully executed transaction will move the database ahead in time from one state to the next in a consistent manner without errors or data integrity issues. Isolation defines that each transaction will see separate sets of data in time and not conflict with other transactional data access. Finally, the durability clause ensures that any data that has been committed in a successful transaction will be written to disk in its final state, without the risk of data loss from errors or system failure, and will then be available to transactions that come in the future. Locking characteristics Finally, InnoDB differs from other on-disk storage engines in that it offers row-level locking. This primarily differs, in the MySQL world, with the MyISAM storage engine which features table-level locking. Locking refers to an internal operation of the database that prohibits reading or writing of table data by connections if another is currently using that data. This prevents concurrent connections from causing data corruption or forcing data invalidation when data is in use. The primary difference between table- and row-level locking is that when a connection requests data from a table it can either lock the row of data being accessed or the whole table of data being accessed. For performance and concurrency benefits, row-level locking excels. System requirements and supported platforms InnoDB can be used on all platforms on which MySQL can be installed. These include: Linux: RPM, Deb, Tar BSDs: FreeBSD, OpenBSD, NetBSD Solaris and OpenSolaris / Illumos: SPARC + Intel IBM AIX HP-UX Mac OSX Windows 32 bit and 64 bit There are also custom ports of MySQL from the open source community for running MySQL on various embedded platforms and non-standard operating systems. Hardware-wise, MySQL and correspondingly InnoDB, will run on a wide variety of hardware, which at the time of this writing includes: Intel x86 32 bit AMD/Intel x 86_64 Intel Itanium IA-64 IBM Power architecture Apple's PPC PA-RISC 1.0 + 2.0 SPARC 32 + 64 bit Keep in mind when installing and configuring InnoDB, depending on the architecture in which it is installed, it will have certain options available and enabled that are not available on all platforms. In addition to the underlying hardware, the operating system will also determine whether certain configuration options are available and the range to which some variables can be set. One of the more decisively important differences to be considered while choosing an operating system for your database server is the manner in which the operating system and underlying filesystem handles write caching and write flushes to the disk storage subsystem. These operating system abilities can cause a dramatic difference in the performance of InnoDB, often to the order of 10 times the concurrency ability. When reading the MySQL documentation you may find that InnoDB has over fifty-eight configuration settings, more or less depending on the version, for tuning the performance and operational defaults. The majority of these default settings can be left alone for development and production server environments. However, there are several core settings that can affect great change, in either positive or negative directions depending on the application workload and hardware resource limits, with which every MySQL database administrator should be familiar and proficient. Keep in mind when setting values that some variables are considered dynamic while others are static; dynamic variables can be changed during runtime and do not require a process restart while static variables can only be changed prior to process start, so any changes made to static variables during runtime will only take effect upon the next restart of the database server process. Dynamic variables can be changed on the MySQL command line via the following command: mysql> SET GLOBAL [variable]=[value]; If a value is changed on the command line, it should also be updated in the global my.cnf configuration file so that the change is applied during each restart. MySQL memory allocation equations Before tuning any InnoDB configuration settings—memory buffers in particular—we need to understand how MySQL allocates memory to various areas of the application that handles RAM. There are two simple equations for referencing total memory usage that allocate memory based on incoming client connections: Per-thread buffers: Per-thread buffers, also called per-connection buffers since MySQL uses a separate thread for each connection, operate in contrast to global buffers in that per-thread buffers only allocate memory when a connection is made and in some cases will only allocate as much memory as the connection's workload requires, thus not necessarily utilizing the entire size of the allowable buffer. This memory utilization method is described in the MySQL manual as follows: Each client thread is associated with a connection buffer and a result buffer. Both begin with a size given by net_buffer_length but are dynamically enlarged up to max_allowed_packet bytes as needed. The result buffer shrinks to net_buffer_length after each SQL statement. Global buffers: Global buffers are allocated memory resources regardless of the number of connections being handled. These buffers request their memory requirements during the startup process and retain this reservation of resources until the server process has ended. When allocating memory to MySQL buffers we need to ensure that there is also enough RAM available for the operating system to perform its tasks and processes; in general it is a best practice to limit MySQL between 85 to 90 percent allocation of total system RAM. The memory utilization equations for each of the buffers is given as follows: Per-thread Buffer memory utilization equation: (read_buffer_size + read_rnd_buffer_size + sort_buffer_size + thread_stack + join_buffer_size + binlog_cache_size) * max_connections = total memory allocation for all connections, or MySQL Thread Buffers (MTB) Global Buffer memory utilization equation: innodb_buffer_pool_size + innodb_additional_mem_pool_size + innodb_ log_buffer_size + key_buffer_size + query_cache_size = total memory used by MySQL Global Buffers (MGB) Total memory allocation equation: MTB + MGB = Total Memory Used by MySQL If the total memory used by the combination of MTB and MGB is greater than 85 to 90 percent of the total system RAM then you may experience resource contention, a resource bottleneck, or in the worst case you will see memory pages swapping to on-disk resources (virtual memory) which results in performance degradation and, in some cases, process failure or connection timeouts. Therefore it is wise to check memory allocation via the equations mentioned previously before making changes to the memory buffers or increasing the value of max_connections to the database. More information about how MySQL manages memory and threads can be read about in the following pages of the MySQL documentation: http://dev.mysql.com/doc/refman/5.5/en/connection-threads.html http://dev.mysql.com/doc/refman/5.5/en/memory-use.html Summary This article provided a quick overview of the core terminology and basic features, system requirements, and a few memory allocation equations. Resources for Article : Further resources on this subject: Configuring MySQL [Article] Optimizing your MySQL Servers' performance using Indexes [Article] Indexing in MySQL Admin [Article]
Read more
  • 0
  • 0
  • 1891
article-image-set-mariadb
Packt
16 Jun 2015
8 min read
Save for later

Set Up MariaDB

Packt
16 Jun 2015
8 min read
In this article, by Daniel Bartholomew, author of Getting Started with MariaDB - Second Edition, you will learn to set up MariaDB with a generic configuration suitable for general use. This is perfect for giving MariaDB a try but might not be suitable for a production database application under heavy load. There are thousands of ways to tweak the settings to get MariaDB to perform just the way we need it to. Many books have been written on this subject. In this article, we'll cover enough of the basics so that we can comfortably edit the MariaDB configuration files and know our way around. The MariaDB filesystem layout A MariaDB installation is not a single file or even a single directory, so the first stop on our tour is a high-level overview of the filesystem layout. We'll start with Windows and then move on to Linux. The MariaDB filesystem layout on Windows On Windows, MariaDB is installed under a directory named with the following pattern: C:Program FilesMariaDB <major>.<minor> In the preceding command, <major> and <minor> refer to the first and second number in the MariaDB version string. So for MariaDB 10.1, the location would be: C:Program FilesMariaDB 10.1 The only alteration to this location, unless we change it during the installation, is when the 32-bit version of MariaDB is installed on a 64-bit version of Windows. In that case, the default MariaDB directory is at the following location: C:Program Files x86MariaDB <major>.<minor> Under the MariaDB directory on Windows, there are four primary directories: bin, data, lib, and include. There are also several configuration examples and other files under the MariaDB directory and a couple of additional directories (docs and Share), but we won't go into their details here. The bin directory is where the executable files of MariaDB are located. The data directory is where databases are stored; it is also where the primary MariaDB configuration file, my.ini, is stored. The lib directory contains various library and plugin files. Lastly, the include directory contains files that are useful for application developers. We don't generally need to worry about the bin, lib, and include directories; it's enough for us to be aware that they exist and know what they contain. The data directory is where we'll spend most of our time in this article and when using MariaDB. On Linux distributions, MariaDB follows the default filesystem layout. For example, the MariaDB binaries are placed under /usr/bin/, libraries are placed under /usr/lib/, manual pages are placed under /usr/share/man/, and so on. However, there are some key MariaDB-specific directories and file locations that we should know about. Two of them are locations that are the same across most Linux distributions. These locations are the /usr/share/mysql/ and /var/lib/mysql/ directories. The /usr/share/mysql/ directory contains helper scripts that are used during the initial installation of MariaDB, translations (so we can have error and system messages in different languages), and character set information. We don't need to worry about these files and scripts; it's enough to know that this directory exists and contains important files. The /var/lib/mysql/ directory is the default location for our actual database data and the related files such as logs. There is not much need to worry about this directory as MariaDB will handle its contents automatically; for now it's enough to know that it exists. The next directory we should know about is where the MariaDB plugins are stored. Unlike the previous two, the location of this directory varies. On Debian and Ubuntu systems, the directory is at the following location: /usr/lib/mysql/plugin/ In distributions such as Fedora, Red Hat, and CentOS, the location of the plugin directory varies depending on whether our system is 32 bit or 64 bit. If unsure, we can just look in both. The possible locations are: /lib64/mysql/plugin//lib/mysql/plugin/ The basic rule of thumb is that if we don't have a /lib64/ directory, we have the 32-bit version of Fedora, Red Hat, or CentOS installed. As with /usr/share/mysql/, we don't need to worry about the contents of the MariaDB plugin directory. It's enough to know that it exists and contains important files. Also, if in the future we install a new MariaDB plugin, this directory is where it will go. The last directory that we should know about is only found on Debian and the distributions based on Debian such as Ubuntu. Its location is as follows: /etc/mysql/ The /etc/mysql/ directory is where the configuration information for MariaDB is stored; specifically, in the following two locations: /etc/mysql/my.cnf/etc/mysql/conf.d/ Fedora, Red Hat, CentOS, and related systems don't have an /etc/mysql/ directory by default, but they do have a my.cnf file and a directory that serves the same purpose that the /etc/mysql/conf.d/ directory does on Debian and Ubuntu. They are at the following two locations: /etc/my.cnf/etc/my.cnf.d/ The my.cnf files, regardless of location, function the same on all Linux versions and on Windows, where it is often named my.ini. The /etc/my.cnf.d/ and /etc/mysql/conf.d/ directories, as mentioned, serve the same purpose. We'll spend the next section going over these two directories. Modular configuration on Linux The /etc/my.cnf.d/ and /etc/mysql/conf.d/ directories are special locations for the MariaDB configuration files. They are found on the MariaDB releases for Linux such as Debian, Ubuntu, Fedora, Red Hat, and CentOS. We will only have one or the other of them, never both, and regardless of which one we have, their function is the same. The basic idea behind these directories is to allow the package manager (APT or YUM) to be able to install packages for MariaDB, which include additions to MariaDB's configuration without needing to edit or change the main my.cnf configuration file. It's easy to imagine the harm that would be caused if we installed a new plugin package and it overwrote a carefully crafted and tuned configuration file. With these special directories, the package manager can simply add a file to the appropriate directory and be done. When the MariaDB server and the clients and utilities included with MariaDB start up, they first read the main my.cnf file and then any files that they find under the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directories that have the extension .cnf because of a line at the end of the default configuration files. For example, MariaDB includes a plugin called feedback whose sole purpose is to send back anonymous statistical information to the MariaDB developers. They use this information to help guide future development efforts. It is disabled by default but can easily be enabled by adding feedback=on to a [mysqld] group of the MariaDB configuration file (we'll talk about configuration groups in the following section). We could add the required lines to our main my.cnf file or, better yet, we can create a file called feedback.cnf (MariaDB doesn't care what the actual filename is, apart from the .cnf extension) with the following content: [mysqld]feedback=on All we have to do is put our feedback.cnf file in the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directory and when we start or restart the server, the feedback.cnf file will be read and the plugin will be turned on. Doing this for a single plugin on a solitary MariaDB server may seem like too much work, but suppose we have 100 servers, and further assume that since the servers are doing different things, each of them has a slightly different my.cnf configuration file. Without using our small feedback.cnf file to turn on the feedback plugin on all of them, we would have to connect to each server in turn and manually add feedback=on to the [mysqld] group of the file. This would get tiresome and there is also a chance that we might make a mistake with one, or several of the files that we edit, even if we try to automate the editing in some way. Copying a single file to each server that only does one thing (turning on the feedback plugin in our example) is much faster, and much safer. And, if we have an automated deployment system in place, copying the file to every server can be almost instant. Caution! Because the configuration settings in the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directory are read after the settings in the my.cnf file, they can override or change the settings in our main my.cnf file. This can be a good thing if that is what we want and expect. Conversely, it can be a bad thing if we are not expecting that behavior. Summary That's it for our configuration highlights tour! In this article, we've learned where the various bits and pieces of MariaDB are installed and about the different parts that make up a typical MariaDB configuration file. Resources for Article: Building a Web Application with PHP and MariaDB – Introduction to caching Installing MariaDB on Windows and Mac OS X Questions & Answers with MariaDB's Michael "Monty" Widenius- Founder of MySQL AB
Read more
  • 0
  • 0
  • 1861

article-image-text-recognition
Packt
04 Jan 2017
7 min read
Save for later

Text Recognition

Packt
04 Jan 2017
7 min read
In this article by Fábio M. Soares and Alan M.F. Souza, the authors of the book Neural Network Programming with Java - Second Edition, we will cover pattern recognition, neural networks in pattern recognition, and text recognition (OCR). We all know that humans can read and recognize images faster than any supercomputer; however we have seen so far that neural networks show amazing capabilities of learning through data in both supervised and unsupervised way. In this article we present an additional case of pattern recognition involving an example of Optical Character Recognition (OCR). Neural networks can be trained to strictly recognize digits written in an image file. The topics of this article are: Pattern recognition Defined classes Undefined classes Neural networks in pattern recognition MLP Text recognition (OCR) Preprocessing and Classes definition (For more resources related to this topic, see here.) Pattern recognition Patterns are a bunch of data and elements that look similar to each other, in such a way that they can occur systematically and repeat from time to time. This is a task that can be solved mainly by unsupervised learning by clustering; however, when there are labelled data or defined classes of data, this task can be solved by supervised methods. We as humans perform this task more often than we can imagine. When we see objects and recognize them as belonging to a certain class, we are indeed recognizing a pattern. Also when we analyze charts discrete events and time series, we might find an evidence of some sequence of events that repeat systematically under certain conditions. In summary, patterns can be learned by data observations. Examples of pattern recognition tasks include, not liming to: Shapes recognition Objects classification Behavior clustering Voice recognition OCR Chemical reactions taxonomy Defined classes In the existence of a list of classes that has been predefined for a specific domain, then each class is considered to be a pattern, therefore every data record or occurrence is assigned one of these predefined classes. The predefinition of classes can usually be performed by an expert or based on a previous knowledge of the application domain. Also it is desirable to apply defined classes when we want the data to be classified strictly into one of the predefined classes. One illustrated example for pattern recognition using defined classes is animal recognition by image, shown in the next figure. The pattern recognizer however should be trained to catch all the characteristics that formally define the classes. In the example eight figures of animals are shown, belonging to two classes: mammals and birds. Since this is a supervised mode of learning, the neural network should be provided with a sufficient number of images that allow it to properly classify new images: Of course, sometimes the classification may fail, mainly due to similar hidden patterns in the images that neural networks may catch and also due to small nuances present in the shapes. For example, the dolphin has flippers but it is still a mammal. Sometimes in order to obtain a better classification, it is necessary to apply preprocessing and ensure that the neural network will receive the appropriate data that would allow for classification. Undefined classes When data are unlabeled and there is no predefined set of classes, it is a scenario for unsupervised learning. Shapes recognition are a good example since they may be flexible and have infinite number of edges, vertices or bindings: In the previous figure, we can see some sorts of shapes and we want to arrange them, whereby the similar ones can be grouped into the same cluster. Based on the shape information that is present in the images, it is likely for the pattern recognizer to classify the rectangle, the square and the rectangular triangle in into the same group. But if the information were presented to the pattern recognizer, not as an image, but as a graph with edges and vertices coordinates, the classification might change a little. In summary, the pattern recognition task may use both supervised and unsupervised mode of learning, basically depending of the objective of recognition. Neural networks in pattern recognition For pattern recognition the neural network architectures that can be applied are the MLPs (supervised) and the Kohonen network (unsupervised). In the first case, the problem should be set up as a classification problem, that is. the data should be transformed into the X-Y dataset, where for every data record in X there should be a corresponding class in Y. The output of the neural network for classification problems should have all of the possible classes, and this may require preprocessing of the output records. For the other case, the unsupervised learning, there is no need to apply labels on the output, however, the input data should be properly structured as well. To remind the reader the schema of both neural networks are shown in the next figure: Data pre-processing We have to deal with all possible types of data, that is., numerical (continuous and discrete) and categorical (ordinal or unscaled). But here we have the possibility to perform pattern recognition on multimedia content, such as images and videos. So how could multimedia be handled? The answer of this question lies in the way these contents are stored in files. Images, for example, are written with a representation of small colored points called pixels. Each color can be coded in an RGB notation where the intensity of red, green and blue define every color the human eye is able to see. Therefore an image of dimension 100x100 would have 10,000 pixels, each one having 3 values for red, green and blue, yielding a total of 30,000 points. That is the challenge for image processing in neural networks. Some methods, may reduce this huge number of dimensions. Afterwards an image can be treated as big matrix of numerical continuous values. For simplicity in this article we are applying only gray-scaled images with small dimension. Text recognition (OCR) Many documents are now being scanned and stored as images, making necessary the task of converting these documents back into text, for a computer to apply edition and text processing. However, this feature involves a number of challenges: Variety of text font Text size Image noise Manuscripts In the spite of that, humans can easily interpret and read even the texts written in a bad quality image. This can be explained by the fact that humans are already familiar with the text characters and the words in their language. Somehow the algorithm must become acquainted with these elements (characters, digits, signalization, and so on), in order to successfully recognize texts in images. Digits recognition Although there are a variety of tools available in the market for OCR, this remains still a big challenge for an algorithm to properly recognize texts in images. So we will be restricting our application in a smaller domain, so we could face simpler problems. Therefore, in this article we are going to implement a Neural Network to recognize digits from 0 to 9 represented on images. Also the images will have standardized and small dimensions, for simplicity purposes. Summary In this article we have covered pattern recognition, neural networks in pattern recognition, and text recognition (OCR). Resources for Article: Further resources on this subject: Training neural networks efficiently using Keras [article] Implementing Artificial Neural Networks with TensorFlow [article] Training and Visualizing a neural network with R [article]
Read more
  • 0
  • 0
  • 1859

article-image-participating-business-process-intermediate
Packt
31 Jul 2013
5 min read
Save for later

Participating in a business process (Intermediate)

Packt
31 Jul 2013
5 min read
(For more resources related to this topic, see here.) The hurdles and bottlenecks for financial services from an IT point of view are: Silos of data Outdated IT system and many applications running on legacy and non-standard based systems Business process and reporting systems not in sync with each other Lack of real-time data visibility Automated decision making Ability to change and manage business processes in accordance with changes in business dynamics Partner management Customer satisfaction This is where BPM plays a key role in bridging the gap between key business requirements and technology or businesses hurdles. In a real-life scenario, a typical home loan use case would be tied up with Know Your Customer (KYC) regulatory requirement. In India for example, the Reserve Bank of India ( RBI) had passed on guidelines that make it mandatory for banks to properly know their customers. RBI mandates that banks collect their customers' proof of identity, recent photographs, and Income Tax PAN. Proof of residence can be a voter card, a driving license, or a passport copy. Getting ready We start with the source code from the previous recipe. We will add a re-usable e-mail or SMS notification process. It is always a best practice to add a new process if it is called multiple times in the same process. This can be a subprocess within the main process itself, or it can be a part of the same composite outside the main process. We will add a new regulatory requirement that allows the customer to add KYC requirements such as photo, proof of address, and Income Tax PAN copy as attachments that will be checked into the WebCenter Content repository. These checks become a part of the customer verification stage before finance approval. We will make KYC as a subprocess with a scope of expansion under a different scenario. We will also save the process data into a filesystem or in a JMS messaging queue at the end of the loan process completion. In a banking scenario, it can also be the integration stage for other applications such as a CRM application or any other application. How to do it… Let's perform the following steps: Launch JDeveloper and open the composite.xml of LoanApplicationProcess in the Design view. Drag-and-drop a new BPMN Process component from the Component Palette. Create the Send Notifications process next to the existing LoanApplicationProcess, and edit the new process. The Send Notifications process will take input parameters as To e-mail ID, From e-mail ID, Subject, CC, and send e-mail to the given e-mail ID. Similarly, we will drag-and-drop a File Adapter component from the Component Palette that saves the customer data into a file. We place this component the end of the LoanApplication process, just before the End activity. We will use this notification service to notify Verification Officers about the arrival of a new eligible application that needs to be verified. In the Application Verification Officer stage, we will add a subprocess, KYC , that will be assigned to the loan initiator—James Cooper in our case. This will be preceded by sending an e-mail notification to the applicant asking for KYC details such as PAN number, scanned photograph, and voter ID as requested by the Verification Officers. Now, let us implement Save Loan Application by invoking the File Adapter service. The Email notification services are also available out of the box. How it works… The outputs of this recipe are re-usable services that can be used across multiple service calls such as notification services. This recipe also demonstrates how to use subprocesses and change the process to meet regulatory requirements. Let's understand the output by taking our use case scenario: When the process is initiated, the e-mail notification gets triggered at appropriate stages of the process. Conan Doyle and John Steinbeck will get the e-mail, requesting them to process the application, with the required information of the applicant, along with the link to BPM Workspace. The KYC task also sends an e-mail to James Cooper, requesting him for the documents required for the KYC check. James Cooper logs in to the James Bank WebCenter Portal and sees there is a task assigned to him to upload his KYC details. James Cooper clicks on the task link and submits the required soft copy documents, and gets them checked into the content repository once the form is submitted.            The start-to-end process flow now looks as follows: Summary BPM Process Spaces, which is an extension template of BPM, allows process and task views to be exposed to WebCenter Portal. The advantage of having Process Spaces made available within the Portal is that the users can collaborate with others using out of the box Portal features such as wikis, discussion forums, blogs, and content management. This improves productivity as the user need not log in to different applications for different purposes, as all the required data and information will be made available within the Portal environment. It is also possible to expose some of the WSRP supported application portlets (for example, HR Portlets from PeopleSoft) into a corporate portal environment. All of this sums up to provide higher visibility of the entire business process, and a nature of working and collaborating together in an enterprise business environment. Resources for Article : Further resources on this subject: Managing Oracle Business Intelligence [Article] Oracle E-Business Suite: Creating Bank Accounts and Cash Forecasts [Article] Getting Started with Oracle Information Integration [Article]
Read more
  • 0
  • 0
  • 1853
article-image-hunt-data
Packt
25 Jun 2014
10 min read
Save for later

The Hunt for Data

Packt
25 Jun 2014
10 min read
(For more resources related to this topic, see here.) Examining a JSON file with the aeson package JavaScript Object Notation (JSON) is a way to represent key-value pairs in plain text. The format is described extensively in RFC 4627 (http://www.ietf.org/rfc/rfc4627). In this recipe, we will parse a JSON description about a person. We often encounter JSON in APIs from web applications. Getting ready Install the aeson library from hackage using Cabal. Prepare an input.json file representing data about a mathematician, such as the one in the following code snippet: $ cat input.json {"name":"Gauss", "nationality":"German", "born":1777, "died":1855} We will be parsing this JSON and representing it as a usable data type in Haskell. How to do it... Use the OverloadedStrings language extension to represent strings as ByteString, as shown in the following line of code: {-# LANGUAGE OverloadedStrings #-} Import aeson as well as some helper functions as follows: import Data.Aeson import Control.Applicative import qualified Data.ByteString.Lazy as B Create the data type corresponding to the JSON structure, as shown in the following code: data Mathematician = Mathematician { name :: String , nationality :: String , born :: Int , died :: Maybe Int } Provide an instance for the parseJSON function, as shown in the following code snippet: instance FromJSON Mathematician where parseJSON (Object v) = Mathematician <$> (v .: "name") <*> (v .: "nationality") <*> (v .: "born") <*> (v .:? "died") Define and implement main as follows: main :: IO () main = do Read the input and decode the JSON, as shown in the following code snippet: input <- B.readFile "input.json" let mm = decode input :: Maybe Mathematician case mm of Nothing -> print "error parsing JSON" Just m -> (putStrLn.greet) m Now we will do something interesting with the data as follows: greet m = (show.name) m ++ " was born in the year " ++ (show.born) m We can run the code to see the following output: $ runhaskell Main.hs "Gauss" was born in the year 1777 How it works... Aeson takes care of the complications in representing JSON. It creates native usable data out of a structured text. In this recipe, we use the .: and .:? functions provided by the Data.Aeson module. As the Aeson package uses ByteStrings instead of Strings, it is very helpful to tell the compiler that characters between quotation marks should be treated as the proper data type. This is done in the first line of the code which invokes the OverloadedStrings language extension. We use the decode function provided by Aeson to transform a string into a data type. It has the type FromJSON a => B.ByteString -> Maybe a. Our Mathematician data type must implement an instance of the FromJSON typeclass to properly use this function. Fortunately, the only required function for implementing FromJSON is parseJSON. The syntax used in this recipe for implementing parseJSON is a little strange, but this is because we're leveraging applicative functions and lenses, which are more advanced Haskell topics. The .: function has two arguments, Object and Text, and returns a Parser a data type. As per the documentation, it retrieves the value associated with the given key of an object. This function is used if the key and the value exist in the JSON document. The :? function also retrieves the associated value from the given key of an object, but the existence of the key and value are not mandatory. So, we use .:? for optional key value pairs in a JSON document. There's more… If the implementation of the FromJSON typeclass is too involved, we can easily let GHC automatically fill it out using the DeriveGeneric language extension. The following is a simpler rewrite of the code: {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE DeriveGeneric #-} import Data.Aeson import qualified Data.ByteString.Lazy as B import GHC.Generics data Mathematician = Mathematician { name :: String , nationality :: String , born :: Int , died :: Maybe Int } deriving Generic instance FromJSON Mathematician main = do input <- B.readFile "input.json" let mm = decode input :: Maybe Mathematician case mm of Nothing -> print "error parsing JSON" Just m -> (putStrLn.greet) m greet m = (show.name) m ++" was born in the year "++ (show.born) m Although Aeson is powerful and generalizable, it may be an overkill for some simple JSON interactions. Alternatively, if we wish to use a very minimal JSON parser and printer, we can use Yocto, which can be downloaded from http://hackage.haskell.org/package/yocto. Reading an XML file using the HXT package Extensible Markup Language (XML) is an encoding of plain text to provide machine-readable annotations on a document. The standard is specified by W3C (http://www.w3.org/TR/2008/REC-xml-20081126/). In this recipe, we will parse an XML document representing an e-mail conversation and extract all the dates. Getting ready We will first set up an XML file called input.xml with the following values, representing an e-mail thread between Databender and Princess on December 18, 2014 as follows: $ cat input.xml <thread> <email> <to>Databender</to> <from>Princess</from> <date>Thu Dec 18 15:03:23 EST 2014</date> <subject>Joke</subject> <body>Why did you divide sin by tan?</body> </email> <email> <to>Princess</to> <from>Databender</from> <date>Fri Dec 19 3:12:00 EST 2014</date> <subject>RE: Joke</subject> <body>Just cos.</body> </email> </thread> Using Cabal, install the HXT library which we use for manipulating XML documents: $ cabal install hxt How to do it... We only need one import, which will be for parsing XML, using the following line of code: import Text.XML.HXT.Core Define and implement main and specify the XML location. For this recipe, the file is retrieved from input.xml. Refer to the following code: main :: IO () main = do input <- readFile "input.xml" Apply the readString function to the input and extract all the date documents. We filter items with a specific name using the hasName :: String -> a XmlTree XmlTree function. Also, we extract the text using the getText :: a XmlTree String function, as shown in the following code snippet: dates <- runX $ readString [withValidate no] input //> hasName "date" //> getText We can now use the list of extracted dates as follows: print dates By running the code, we print the following output: $ runhaskell Main.hs ["Thu Dec 18 15:03:23 EST 2014", "Fri Dec 19 3:12:00 EST 2014"] How it works... The library function, runX, takes in an Arrow. Think of an Arrow as a more powerful version of a Monad. Arrows allow for stateful global XML processing. Specifically, the runX function in this recipe takes in IOSArrow XmlTree String and returns an IO action of the String type. We generate this IOSArrow object using the readString function, which performs a series of operations to the XML data. For a deep insight into the XML document, //> should be used whereas /> only looks at the current level. We use the //> function to look up the date attributes and display all the associated text. As defined in the documentation, the hasName function tests whether a node has a specific name, and the getText function selects the text of a text node. Some other functions include the following: isText: This is used to test for text nodes isAttr: This is used to test for an attribute tree hasAttr: This is used to test whether an element node has an attribute node with a specific name getElemName: This is used to select the name of an element node All the Arrow functions can be found on the Text.XML.HXT.Arrow.XmlArrow documentation at http://hackage.haskell.org/package/hxt/docs/Text-XML-HXT-Arrow-XmlArrow.html. Capturing table rows from an HTML page Mining Hypertext Markup Language (HTML) is often a feat of identifying and parsing only its structured segments. Not all text in an HTML file may be useful, so we find ourselves only focusing on a specific subset. For instance, HTML tables and lists provide a strong and commonly used structure to extract data whereas a paragraph in an article may be too unstructured and complicated to process. In this recipe, we will find a table on a web page and gather all rows to be used in the program. Getting ready We will be extracting the values from an HTML table, so start by creating an input.html file containing a table as shown in the following figure: The HTML behind this table is as follows: $ cat input.html <!DOCTYPE html> <html> <body> <h1>Course Listing</h1> <table> <tr> <th>Course</th> <th>Time</th> <th>Capacity</th> </tr> <tr> <td>CS 1501</td> <td>17:00</td> <td>60</td> </tr> <tr> <td>MATH 7600</td> <td>14:00</td> <td>25</td> </tr> <tr> <td>PHIL 1000</td> <td>9:30</td> <td>120</td> </tr> </table> </body> </html> If not already installed, use Cabal to set up the HXT library and the split library, as shown in the following command lines: $ cabal install hxt $ cabal install split How to do it... We will need the htx package for XML manipulations and the chunksOf function from the split package, as presented in the following code snippet: import Text.XML.HXT.Core import Data.List.Split (chunksOf) Define and implement main to read the input.html file. main :: IO () main = do input <- readFile "input.html" Feed the HTML data into readString, thereby setting withParseHTML to yes and optionally turning off warnings. Extract all the td tags and obtain the remaining text, as shown in the following code: texts <- runX $ readString [withParseHTML yes, withWarnings no] input //> hasName "td" //> getText The data is now usable as a list of strings. It can be converted into a list of lists similar to how CSV was presented in the previous CSV recipe, as shown in the following code: let rows = chunksOf 3 texts print $ findBiggest rows By folding through the data, identify the course with the largest capacity using the following code snippet: findBiggest :: [[String]] -> [String] findBiggest [] = [] findBiggest items = foldl1 (a x -> if capacity x > capacity a then x else a) items capacity [a,b,c] = toInt c capacity _ = -1 toInt :: String -> Int toInt = read Running the code will display the class with the largest capacity as follows: $ runhaskell Main.hs {"PHIL 1000", "9:30", "120"} How it works... This is very similar to XML parsing, except we adjust the options of readString to [withParseHTML yes, withWarnings no].
Read more
  • 0
  • 0
  • 1848

article-image-building-personal-community-liferay-portal-52
Packt
20 Oct 2009
7 min read
Save for later

Building Personal Community in Liferay Portal 5.2

Packt
20 Oct 2009
7 min read
Besides public web sites, it would be nice if we could provide a personal community, that is, My Community, for each registered user. In My Community, users can have a set of public and private pages. Here you can save your favorite games, videos and playlists, and the My Street theme — your background color. As shown in the following screenshot, there is a page my_street with a portlet myStreet, where an end user can sign up, log in, or handle an issue such as Forgot Your Password? o When you click on the SIGN UP button, this My Street portlet will allow the end users to set up his/her account such as creating a nickname, password hint question, password hint answer, your password, and verify password. Further, as the end user, you can set up favorite games, videos and playlists, and background color. You can have your own favorites: number of favorites displayed in My Street (for example, 4, 20, and 35) and My Street theme, (for example, default, Abby Cadabby, Bert, Big Bird, Cookie Monster, and so on). A set of default My Street themes is predefined in /cms_services/images/my_street/ under the folder $CATALINA_HOME/webapps/ and you can choose any one of them at any time. At the same time, you can upload a photo to your own Book Street web page. When logged in, the My Street theme will be applied on Home page, Games landing page, Videos landing page, and Playlist landing page. For example, current user's My Street theme could be applied on Playlist landing page. You may play videos, games, and playlists when you visit the web site www.bookpubstreet.com. When you find your favorite videos, games, and playlists, you can add them into My Street. As shown in the following screenshot, you could be playing a playlist Tickle Time. If you are interested in this playlist, just click on the Add to My Street button. The playlist Tickle Time will be added into My Street as your favorite playlist. In this section, we will show how to implement these features. Customizing user model First, let's customize user model in service.xml in order to support extended user and user preferences. To do so, use the following steps: Create a package com.ext.portlet.user in the /ext/ext-impl/src folder. Create an XML file service.xml in the package com.ext.portlet.comment and open it. Add the following lines in this file and save it: <?xml version="1.0"?><!DOCTYPE service-builder PUBLIC "-//Liferay//DTD Service Builder 5.2.0//EN" "http://www.liferay.com/dtd/liferay-service-builder_5_2_0.dtd"><service-builder package-path="com.ext.portlet.user"> <namespace>ExtUser</namespace> <entity name="ExtUser" uuid="false" local-service="true" remote-service="true" persistence-class="com.ext.portlet.user.service. persistence. ExtUserPersistenceImpl"> <column name="userId" type="long" primary="true" /> <column name="favorites" type="int" /> <column name="theme" type="String" /> <column name="printable" type="boolean" /> <column name="plainPassword" type="String" /> <column name="creator" type="String" /> <column name="modifier" type="String" /> <column name="created" type="Date" /> <column name="modified" type="Date" /> </entity> <entity name="ExtUserPreference" uuid="false" local-service="true" remote-service="true" persistence-class="com.ext.portlet.user.service. persistence.ExtUserPreferencePersistenceImpl"> <column name="userPreferenceId" type="long" primary="true" /> <column name="userId" type="long"/> <column name="favorite" type="String" /> <column name="favoriteType" type="String" /> <column name="date_" type="Date" /> </entity> <exceptions> <exception>ExtUser</exception> <exception>ExtUserPreference</exception> </exceptions></service-builder> The code above shows the customized user model, including userId associated with USER_ table, favourites, theme, printable, plainPassword, and so on. This code also shows user preferences model, including userPreferenceId, userId, favorite (for example, game/video/playlist UID), favoriteType (for example, game/video/playlist), and the updated date date_. Of course, these models are extensible. You can extend these models for your unique current needs or future requirements. Enter the following tables into database through command prompt: create table ExtUser ( userId bigint not null primary key, creator varchar(125), modifier varchar(125), created datetime null,modified datetime null, favorites smallint, theme varchar(125), plainPassword varchar(125), printable boolean ); create table ExtUserPreference ( usePreferenceId bigint not null primary key, userId bigint not null,favorite varchar(125), favoriteType varchar(125),date_ datetime null ); The preceding code shows the database SQL script for customized user model and user preference model. Similar to XML model ExtUser, this code shows the userId, favorites, theme, plainPassword, and printable table fields. Again, for the XML model ExtUserPreference, this code shows the userId, favorite, favoriteType, date_, and userPreferenceId table fields. Afterwards, we need to build a service with ServiceBuilder. After preparing service.xml, you can build services. To do so, locate the XML file /ext/ext-impl/buildparent.xml, open it, add the following lines between </target> and <target name="build-service-portlet-reports">, and save it: <target name="build-service-portlet-extUser"> <antcall target="build-service"> <param name="service.file" value="src/com/ext/portlet/user/service.xml" /> </antcall></target> When you are ready, just double-click on the Ant target build-service-portletextUser. ServiceBuilder will build related models and services for extUser and extUserPreference.   Building the portlet My Street Similar to how we had built portlet Ext Comment, we can build the portlet My Street as follows: Configure the portlet My Street in both portlet-ext.xml and liferay-portlet-ext.xml files. Set title mapping in the Language-ext.properties file. Add the My Street portlet to the Book category in the liferay-display.xml file. Finally, specify Struts actions and forward paths in the struts-config.xml and tiles-defs.xml files, respectively. Then, we need to create Struts actions as follows: Create a package com.ext.portlet.my_street.action in the folder /ext/ext-impl/src. Add the Java files ViewAction.java, EditUserAction.java, and CreateAccountAction.java in this package. Create a Java file AddUserLocalServiceUtil.java in this package and open it. Add the following methods in AddUserLocalServiceUtil.java and save it: public static ExtUser getUser(long userId){ ExtUser user = null; try{ user = ExtUserLocalServiceUtil.getExtUser(userId); } catch (Exception e){} if(user == null){ user = ExtUserLocalServiceUtil.createExtUser(userId); try{ ExtUserLocalServiceUtil.updateExtUser(user); } catch (Exception e) {} } return user;}public static void deleteUser(long userId) { try{ ExtUserLocalServiceUtil.deleteExtUser(userId); } catch (Exception e){} }public static void updateUser(ActionRequest actionRequest, long userId) { /* ignore details */}public static List<ExtUserPreference> getUserPreferences( long userId, int limit){ /* ignore details */}public static ExtUserPreference addUserPreference( long userId, String favorite, String favoriteType){ /* ignore details */ } As shown in the code above, it shows methods to get ExtUser and ExtUserPreference, to delete ExtUser, and to add and update ExtUser and ExtUserPreference. In addition, we need to provide default values for private page and friendly URL in portal-ext.properties as follows: ext.default_page.my_street.private_page=falseext.default_page.my_street.friend_url=/my_street The code above shows the default private page of my_street as false, default friendly URL of my_street as /my_street. Therefore, you can use VM service to generate a URL in order to add videos, games, and playlists into My Street. Adding Struts view page Now we need to build Struts views pages view.jsp, forget_password.jsp, create_account.jsp, congratulation.jsp, and congrates_uid.jsp. The following are main steps to do so: Create a folder my_street in /ext/ext-web/docroot/html/portlet/ext/. Create JSP file pages view.jsp, view_password.jsp, userQuestions.jsp, edit_account.jsp, create_account.jsp, congratulation.jsp, and congrates_uid.jsp in /ext/ext-web/docroot/html/portlet/ext/my_street/. Note that congrates_uid.jsp is used for the pop-up congratulation of Add to My Street. When you click on the Add to My Street button, a window with congrates_uid.jsp will pop up. userQuestions.jsp is used when the user has forgotten the password of My Street, view_password.jsp is for general view of My Street, congratulation.jsp is used to represent successful information after creating user account, and edit_account.jsp is used to create/update user account.
Read more
  • 0
  • 0
  • 1846
Modal Close icon
Modal Close icon