Data | Tech News, Tutorials & Expert Insights

article-image-microsoft-sql-server-2008-high-availability-installing-database-mirroring

11 Jan 2011

4 min read

Microsoft SQL Server 2008 High Availability: Installing Database Mirroring

11 Jan 2011

Microsoft SQL Server 2008 High Availability Minimize downtime, speed up recovery, and achieve the highest level of availability and reliability for SQL server applications by mastering the concepts of database mirroring,log shipping,clustering, and replication Install various SQL Server High Availability options in a step-by-step manner A guide to SQL Server High Availability for DBA aspirants, proficient developers and system administrators Learn the pre and post installation concepts and common issues you come across while working on SQL Server High Availability Tips to enhance performance with SQL Server High Availability External references for further study Introduction First let's briefly see what is Database Mirroring. Database Mirroring is an option that can be used to cater to the business need, in order to increase the availability of SQL Server database as standby, for it to be used as an alternate production server in the case of any emergency. As its name suggests, mirroring stands for making an exact copy of the data. Mirroring can be done onto a disk, website, or somewhere else. Now let's move on to the topic of this article – installation of Database Mirroring. Preparing for Database Mirroring Before we move forward, we shall prepare the database for the Database Mirroring. Here are the steps: The first step is to ensure that the database is in Full Recovery mode. You can set the mode to “Full Recovery” using the following code: Execute the backup command, followed by the transaction log backup command, and move the backups to the server we wish to have as a mirror. I have run the RESTORE VERIFYONLY command after backup completes. This command ensures the validity of a backup file. It is recommended to always verify the backup. As we have a full database and log backup file, move them over to the Mirror server that we have identified. We will now perform the database restoration, followed by the restore log command with NORECOVERY. It is necessary to use the NORECOVERY option so that additional log backups or transactions can be applied. Installing Database Mirroring As the database that we want to participate in the Database Mirroring is now ready, we can move on with the actual installation process. Right-click on the database we want to mirror and select Tasks | Mirror..... It will open the following screen. To start with the actual setup, click on the Configure Security... button. In this dialog box, select the No option as we are not including the Witness Server at this stage and will be performing this task later. In the next dialog box, connect to the Principal Server. Specify the Listener Port and Endpoint name, and click Next. We are now asked to configure the property for the Mirror Server, Listener port, and Endpoint name. In this step, the installation wizard asks us to specify the service account that will be used by the Database Mirroring operation. If a person is using local system account as a service account, he/she must use Certificates for authentication. Generally, these certificates are used by the websites to assure their users that the information is secured. Certificates are the digital documents that store digital signature or identity information of the holder for authenticity purpose. They ensure that every byte of information being sent over the internet/intranet/vpn, and stored at the server, is safe. Certificates are installed at the servers, either by obtaining them from the providers such as http://www.thwate.com or can be self-issued by Database Administrator or Chief Information Officer of the company using the httpcfg.exe utility. The same is true for SQL Server. SQL Server uses certificates to ensure that the information is secured and these certificates can be issued by self, using httpcfg.exe, or can be obtained from issuing authority. In the next dialog box, make sure that the configuration details we have furnished are valid. Ensure that the name of the Principal and Mirror Server, Endpoints, and port number are correct. Click Finish. Ensure that the setup wizard returns a success report at the end.

0
0
2212

Packt

06 Feb 2015

12 min read

Qlik Sense's Vision

Packt

06 Feb 2015

12 min read

In this article by Christopher Ilacqua, Henric Cronström, and James Richardson, authors of the book Learning Qlik® Sense, we will look at the evolving requirements that compel organizations to readdress how they deliver business intelligence and support data-driven decision-making. This is important as it supplies some of the reasons as to why Qlik® Sense is relevant and important to their success. The purpose of covering these factors is so that you can consider and plan for them in your organization. Among other things, in this article, we will cover the following topics: The ongoing data explosion The rise of in-memory processing Barrierless BI through Human-Computer Interaction The consumerization of BI and the rise of self-service The use of information as an asset The changing role of IT (For more resources related to this topic, see here.) Evolving market factors Technologies are developed and evolved in response to the needs of the environment they are created and used within. The most successful new technologies anticipate upcoming changes in order to help people take advantage of altered circumstances or reimagine how things are done. Any market is defined by both the suppliers—in this case, Qlik®—and the buyers, that is, the people who want to get more use and value from their information. Buyers' wants and needs are driven by a variety of macro and micro factors, and these are always in flux in some markets more than others. This is obviously and apparently the case in the world of data, BI, and analytics, which has been changing at a great pace due to a number of factors discussed further in the rest of this article. Qlik Sense has been designed to be the means through which organizations and the people that are a part of them thrive in a changed environment. Big, big, and even bigger data A key factor is that there's simply much more data in many forms to analyze than before. We're in the middle of an ongoing, accelerating data boom. According to Science Daily, 90 percent of the world's data was generated over the past two years. The fact is that with technologies such as Hadoop and NoSQL databases, we now have unprecedented access to cost-effective data storage. With vast amounts of data now storable and available for analysis, people need a way to sort the signal from the noise. People from a wider variety of roles—not all of them BI users or business analysts—are demanding better, greater access to data, regardless of where it comes from. Qlik Sense's fundamental design centers on bringing varied data together for exploration in an easy and powerful way. The slow spinning down of the disk At the same time, we are seeing a shift in how computation occurs and potentially, how information is managed. Fundamentals of the computing architectures that we've used for decades, the spinning disk and moving read head, are becoming outmoded. This means storing and accessing data has been around since Edison invented the cylinder phonograph in 1877. It's about time this changed. This technology has served us very well; it was elegant and reliable, but it has limitations. Speed limitations primarily. Fundamentals that we take for granted today in BI, such as relational and multidimensional storage models, were built around these limitations. So were our IT skills, whether we realized it at the time. With the use of in-memory processing and 64-bit addressable memory spaces, these limitations are gone! This means a complete change in how we think about analysis. Processing data in memory means we can do analysis that was impractical or impossible before with the old approach. With in-memory computing, analysis that would've taken days before, now takes just seconds (or much less). However, why does it matter? Because it allows us to use the time more effectively; after all, time is the most finite resource of all. In-memory computing enables us to ask more questions, test more scenarios, do more experiments, debunk more hypotheses, explore more data, and run more simulations in the short window available to us. For IT, it means no longer trying to second-guess what users will do months or years in advance and trying to premodel it in order to achieve acceptable response times. People hate watching the hourglass spin. Qlik Sense's predecessor QlikView® was built on the exploitation of in-memory processing; Qlik Sense has it at its core too. Ubiquitous computing and the Internet of Things You may know that more than a billion people use Facebook, but did you know that the majority of those people do so from a mobile device? The growth in the number of devices connected to the Internet is absolutely astonishing. According to Cisco's Zettabyte Era report, Internet traffic from wireless devices will exceed traffic from wired devices in 2014. If we were writing this article even as recently as a year ago, we'd probably be talking about mobile BI as a separate thing from desktop or laptop delivered analytics. The fact of the matter is that we've quickly gone beyond that. For many people now, the most common way to use technology is on a mobile device, and they expect the kind of experience they've become used to on their iOS or Android device to be mirrored in complex software, such as the technology they use for visual discovery and analytics. From its inception, Qlik Sense has had mobile usage in the center of its design ethos. It's the first data discovery software to be built for mobiles, and that's evident in how it uses HTML5 to automatically render output for the device being used, whatever it is. Plug in a laptop running Qlik Sense to a 70-inch OLED TV and the visual output is resized and re-expressed to optimize the new form factor. So mobile is the new normal. This may be astonishing but it's just the beginning. Mobile technology isn't just a medium to deliver information to people, but an acceleration of data production for analysis too. By 2020, pretty much everyone and an increasing number of things will be connected to the Internet. There are 7 billion people on the planet today. Intel predicts that by 2020, more than 31 billion devices will be connected to the Internet. So, that's not just devices used by people directly to consume or share information. More and more things will be put online and communicate their state: cars, fridges, lampposts, shoes, rubbish bins, pets, plants, heating systems—you name it. These devices will generate a huge amount of data from sensors that monitor all kinds of measurable attributes: temperature, velocity, direction, orientation, and time. This means an increasing opportunity to understand a huge gamut of data, but without the right technology and approaches it will be complex to analyze what is going on. Old methods of analysis won't work, as they don't move quickly enough. The variety and volume of information that can be analyzed will explode at an exponential rate. The rise of this type of big data makes us redefine how we build, deliver, and even promote analytics. It is an opportunity for those organizations that can exploit it through analysis; this can sort the signals from the noise and make sense of the patterns in the data. Qlik Sense is designed as just such a signal booster; it takes how users can zoom and pan through information too large for them to easily understand the product. Unbound Human-Computer Interaction We touched on the boundary between the computing power and the humans using it in the previous section. Increasingly, we're removing barriers between humans and technology. Take the rise of touch devices. Users don't want to just view data presented to them in a static form. Instead, they want to "feel" the data and interact with it. The same is increasingly true of BI. The adoption of BI tools has been too low because the technology has been hard to use. Adoption has been low because in the past BI tools often required people to conform to the tool's way of working, rather than reflecting the user's way of thinking. The aspiration for Qlik Sense (when part of the QlikView.Next project) was that the software should be both "gorgeous and genius". The genius part obviously refers to the built-in intelligence, the smarts, the software will have. The gorgeous part is misunderstood or at least oversimplified. Yes, it means cosmetically attractive (which is important) but much more importantly, it means enjoyable to use and experience. In other words, Qlik Sense should never be jarring to users but seamless, perhaps almost transparent to them, inducing a state of mental flow that encourages thinking about the question being considered rather than the tool used to answer it. The aim was to be of most value to people. Qlik Sense will empower users to explore their data and uncover hidden insights, naturally. Evolving customer requirements It is not only the external market drivers that impact how we use information. Our organizations and the people that work within them are also changing in their attitude towards technology, how they express ideas through data, and how increasingly they make use of data as a competitive weapon. Consumerization of BI and the rise of self-service The consumerization of any technology space is all about how enterprises are affected by, and can take advantage of, new technologies and models that originate and develop in the consumer marker, rather than in the enterprise IT sector. The reality is that individuals react quicker than enterprises to changes in technology. As such, consumerization cannot be stopped, nor is it something to be adopted. It can be embraced. While it's not viable to build a BI strategy around consumerization alone, its impact must be considered. Consumerization makes itself felt in three areas: Technology: Most investment in innovation occurs in the consumer space first, with enterprise vendors incorporating consumer-derived features after the fact. (Think about how vendors added the browser as a UI for business software applications.) Economics: Consumer offerings are often less expensive or free (to try) with a low barrier of entry. This drives prices down, including enterprise sectors, and alters selection behavior. People: Demographics, which is the flow of Millennial Generation into the workplace, and the blurring of home/work boundaries and roles, which may be seen from a traditional IT perspective as rogue users, with demands to BYOPC or device. In line with consumerization, BI users want to be able to pick up and just use the technology to create and share engaging solutions; they don't want to read the manual. This places a high degree of importance on the Human-Computer Interaction (HCI) aspects of a BI product (refer to the preceding list) and governed access to information and deployment design. Add mobility to this and you get a brand new sourcing and adoption dynamic in BI, one that Qlik engendered, and Qlik Sense is designed to take advantage of. Think about how Qlik Sense Desktop was made available as a freemium offer. Information as an asset and differentiator As times change, so do differentiators. For example, car manufacturers in the 1980s differentiated themselves based on reliability, making sure their cars started every single time. Today, we expect that our cars will start; reliability is now a commodity. The same is true for ERP systems. Originally, companies implemented ERPs to improve reliability, but in today's post-ERP world, companies are shifting to differentiating their businesses based on information. This means our focus changes from apps to analytics. And analytics apps, like those delivered by Qlik Sense, help companies access the data they need to set themselves apart from the competition. However, to get maximum return from information, the analysis must be delivered fast enough, and in sync with the operational tempo people need. Things are speeding up all the time. For example, take the fashion industry. Large mainstream fashion retailers used to work two seasons per year. Those that stuck to that were destroyed by fast fashion retailers. The same is true for old style, system-of-record BI tools; they just can't cope with today's demands for speed and agility. The rise of information activism A new, tech-savvy generation is entering the workforce, and their expectations are different than those of past generations. The Beloit College Mindset List for the entering class of 2017 gives the perspective of students entering college this year, how they see the world, and the reality they've known all their lives. For this year's freshman class, Java has never been just a cup of coffee and a tablet is no longer something you take in the morning. This new generation of workers grew up with the Internet and is less likely to be passive with data. They bring their own devices everywhere they go, and expect it to be easy to mash-up data, communicate, and collaborate with their peers. The evolution and elevation of the role of IT We've all read about how the role of IT is changing, and the question CIOs today must ask themselves is: "How do we drive innovation?". IT must transform from being gatekeepers (doers) to storekeepers (enablers), providing business users with self-service tools they need to be successful. However, to achieve this transformation, they need to stock helpful tools and provide consumable information products or apps. Qlik Sense is a key part of the armory that IT needs to provide to be successful in this transformation. Summary In this article, we looked at the factors that provide the wider context for the use of Qlik Sense. The factors covered arise out of both increasing technical capability and demands to compete in a globalized, information-centric world, where out-analyzing your competitors is a key success factor. Resources for Article: Further resources on this subject: Securing QlikView Documents [article] Conozca QlikView [article] Introducing QlikView elements [article]

0
0
2208

article-image-sap-netweaver-accessing-mdm-system

Packt

28 Sep 2011

4 min read

SAP Netweaver: Accessing the MDM System

Packt

28 Sep 2011

4 min read

(For more resources on SAP, see here.) Accessing an MDM server involves mounting and unmounting operations which we discuss in the following section. Mounting and unmounting an MDM server MDM server installations are accessible on the console only after they have been mounted. Multiple servers can be mounted within a single console session. We have a choice of mounting only those servers which need to be accessed. The server may or may not be in a running state when mounted in your console session. No password is required to mount a server in your console session even if it is password protected. The MDM Console provides the option of saving the list of currently mounted servers to an MDM Console Settings file. We can load this settings file in the console session and automatically get the previously saved server(s) list mounted as a group. An MDM server can be mounted by multiple MDM Consoles. Once an MDM server is started from any console, it runs on the machine where it is installed and is seen as running by all MDM Consoles that have mounted it. We can mount an MDM server as follows: Right-click on the root node (SAP MDM Servers) in the hierarchy pane tree and choose Mount MDM Server… from the context menu: Alternatively you many select the root node (SAP MDM Servers) and choose MDM Servers | Mount MDM Server… from the main menu: MDM opens the Mount MDM Server dialog prompting for the MDM Server input as displayed next: In the drop-down list input the region displaying the text Enter or select an MDM Server, type the name of the MDM server (typically the name of the machine on which the server is running) you want mounted or select it from the drop-down list. Alternatively (for non-Windows installations), type the name or IP address of any remote machine into the edit box in the Mount MDM Server dialog. Click on the OK button: The drop-down list of MDM Servers shows only those servers that you have previously mounted. If a specific server is not in the list, click on … (Browse) button to open the Select MDM Server dialog (see next) and select the machine on which the MDM Server has been installed from the list of Windows machines visible on the local network. On successful mounting of the MDM server, you will see a new server node added in the tree structure of the hierarchy pane. Depending on the state of the MDM server, the corresponding icon is displayed in the tree node. The different states and the respective icons of the server node are listed in the following table: Status icon State of MDM server MDM server is stopped MDM server is running MDM server is in one of the following states*: Server Inaccessible Communication Error Start Server Failed Invalid If the MDM server is inaccessible via the console even after the server has been started, you can try unmounting and remounting the MDM server in the console to restore connectivity. Next we see how to unmount an already mounted MDM server: In the hierarchy tree, right-click on the MDM server that you want to unmount and choose Unmount MDM Server from the context menu. Alternatively, you may unmount the server by first selecting its node in the tree and then clicking on MDM Servers | Unmount MDM Server from the main menu. Unmounting an MDM server is also possible by using the MDM Servers pane (top-right) when the root node (SAP MDM Servers) is selected in the hierarchy tree. Then you can right-click on the MDM Server in the objects pane and select Unmount MDM Server from the context menu. The MDM server node disappears from the tree in the hierarchy pane. Unmounting a running MDM server while it is still running keeps the MDM repositories mounted and loaded even while the unmounted server remains disconnected from the console session. Unmounting and again re-mounting an MDM server within the same MDM Console session requires the MDM server's password to be re-entered to perform any server-level operations (like starting and stopping the server).

0
0
2205

article-image-n-way-replication-oracle-11g-streams-part-1

Packt

05 Feb 2010

4 min read

N-Way Replication in Oracle 11g Streams: Part 1

Packt

05 Feb 2010

4 min read

N-way replication refers to a Streams environment where there are multiple sources. In this article, we will still use the STRM1 and STRM2 databases but with a little twist; making both databases the source. By making both STRM1 and STRM2 sources, we need to first consider a couple of unique situations and do a little more pre-planning, specifically for N-Way replication. The concepts and techniques used to configure a 2-way replication can then be used to scale to N-way replication. We all need to crawl before we run, the better you crawl (understand) this article, the easier it will be to scale up to N-way replication. Pay close attention and learn the technique so that you can implement it well. We need to repeat this—Streams is not Failover. We need to repeat this—Streams is not Failover. No, that is not a typo. The authors are passionate about Streams and want to see you successfully implement it. To successfully implement Streams, you need to know not to step into the trap of using it for Failover. Both authors have done some work where Failover was the requirement. Streams is not a Failover solution. Failover is handled by Oracle Data Guard, NOT Oracle Streams. Streams is about distributing the data to multiple locations. On more than one occasion, Streams was used as a Failover technology because it can distribute data to multiple locations. Do not fall into the trap of using the wrong tool for the wrong job. Streams distributes (replicates) data. As such, there will always be some difference between the databases in a Streams environment. All replication technology has this problem. The only time where all of the databases are in sync is, when there is no activity and all replication has been applied to all target locations. If you need Failover, then use the proper tool. Oracle Data Guard is for Failover. It has the necessary processes to guarantee a different level of failover from a primary site to a secondary site, whereas Streams is a Replication tool that distributes data. Just remember the following, when there is a discussion of Replication and Failover that comes up: Streams distributes data, it is built for replication Data Guard is built for Failover Pre-planning for N-way replication When we set up N-way replication, we must consider the possibility of a collision of data. Since we have multiple sources of data, it is possible for the exact same data to be inputted on any or all of the sources at the exact same time. When this happens, it is a conflict. This example is just one type of conflict that can happen in N-way replication environments. The types of conflict that can occur are as follows: Update conflict: When transactions from different databases try to update the same row at nearly the same time. Delete conflict: When one transaction deletes a row and the next transaction tries to update or delete the row. Transactions originate from different databases. Unique conflict: When transactions from different databases violate a primary or unique constraint, the first transaction is accepted. The second transaction obtains the conflict. Foreign key conflict : This happens when a transaction from a Source tries to insert a child record before the parent record exists. The good news is that Oracle has provided built-in conflict resolution in Streams that solves the most common situations. The built-in solutions are as follows: OVERWRITE DISCARD MAXIMUM MINIMUM We will provide an example of conflict resolution after we build our N-way replication. In our case, we will use MAXIMUM. As part of the pre-planning for N-way replication, we highly suggest creating a simple table such as the Setup Table. Avoiding Conflict As conflict requires additional pre-planning and configuration, one begins to wonder, "Are there techniques so that we can configure N-way replication without the possibility of conflict?" The simple answer to the question is "Yes". The not-so simple answer is that there is some configuration magic that needs to be done and the devil is in the details. Limiting who and what can be updated is one method of avoiding conflict. Think of it this way— there is no conflict if we agree to who and what can update the specific data. User 1 can only update his specific data and no one else can do that. Similarly, user 2 can only update his specific data. So, user 1 and user 2 can never cause a conflict. Now this may be a little bit difficult depending on the application. This can be implemented with the use of offset sequences. One sequence produces only odd values, and another produces only even values. We could also use a combination of sequence and some unique characteristics of the database.

0
0
2200

article-image-building-your-first-ireport

Packt

02 Mar 2010

4 min read

Building Your First iReport

Packt

02 Mar 2010

4 min read

So let's get on with it! Creating a connection/data source Before going to create the connection, a database should be set up. The SQL query for the database used for creating reports can be downloaded from the Packt website. Now, we are going to create a connection/data source in iReport and build our first report in some easy to follow steps: You need to create the connection/data source just once before developing the first report. This connection will be reused for the following reports. Start iReport. Press the Report Datasources button in the toolbar. You will see a dialog box similar to the following screenshot: Press the New button. Another dialog box will appear for selecting the data source type. There are several types to choose from, according to your requirement. For now, choose Database JDBC connection, and press Next >. Another dialog box will appear to set up the Database JDBC connection properties. Give a sensible name to the connection. In this case, it is inventory. Choose the JDBC Driver from the list, according to your connection type and/or your database. In this case, it is MySQL (com.mysql.jdbc.Driver). Write the JDBC URL, according to the driver you have chosen. For this tutorial, it is jdbc:mysql://localhost/inventory. In the previous code for connecting to a database from a Java program using JDBC—jdbc is the connection protocol, mysql is the subprotocol, localhost is the MySQL server if it runs on the same computer, and inventory is the database name. Enter the Username and Password. Generally, for a MySQL server, the username is root and you have set a customized password during the installation of the MySQL server. The screenshot is as follows: Press Test to confirm that you have set all the properties correctly. If all the settings are correct, then you will see a message that says Connection test successful!. You can save the password by checking the Save Password checkbox, but be warned that iReport stores passwords in clear text. Storing passwords in clear text is a bad thing for us, isn't it? If you do not specify a password now, iReport will ask you for one only when required and will not save it. Now save the connection. You will see that the newly created connection is listed in the Connections/Datasources window. If you have more than one connections, then you can set one as the default connection. In order to do this, select the connection and press Set as Default. Enter the Username and Password. Generally, for a MySQL server, the username is root and you have set a customized password during the installation of the MySQL server. The screenshot is as follows: When we execute the report with an active connection, the reports are filled with data from the database or other data sources. We can also see the report output with empty data sources, which has, by default, a single record with all fields set to null. An empty data source is used to print a static report. However, in order to choose the tables and columns from a database automatically using the Report Wizard, we need to connect to a database/data source first. To do this, we must create a connection/data source. Building your first report Having set up a connection, we are ready to build our first report. We will keep it very simple, just to be familiar with the steps required for building a report. We will create a report that lists out all the products; that is, we will show all the rows of the product table of our database. Follow the steps listed and build your first report: Go to the File menu and click New…. You will see a dialog box like the following screenshot: From the list of Report templates, select Simple Blue and press Launch Report Wizard. Enter Report name as List of Products and press Next >. Now you will specify the query to retrieve the report fields. Select your connection from the Connections / Data Sources drop-down list. Write the SQL query for the report you want to develop. In our case, it is SELECT ProductCode, Name, Description FROM Product.

0
0
2194

article-image-designing-target-structure-oracle-warehouse-builder-11g

Packt

01 Oct 2009

6 min read

Designing the Target Structure in Oracle Warehouse Builder 11g

Packt

01 Oct 2009

6 min read

We have our entire source structures defined in the Warehouse Builder. But before we can do anything with them, we need to design what our target data warehouse structure is going to look like. When we have that figured out, we can start mapping data from the source to the target. So, let's design our target structure. First, we're going to take a look at some design topics related to a data warehouse that are different from what we would use if we were designing a regular relational database. Data Warehouse Design When it comes to the design of a data warehouse, there is basically one option that makes the most sense for how we will structure our database and that is the dimensional model. This is a way of looking at the data from a business perspective that makes the data simple, understandable, and easy to query for the business end user. It doesn't require a database administrator to be able to retrieve data from it. We know the normalized method of modelling a database. A normalized model removes redundancies in data by storing information in discrete tables, and then referencing those tables when needed. This has an advantage for a transactional system because information needs to be entered at only one place in the database, without duplicating any information already entered. For example, in the ACME Toys and Gizmos transactional database, each time a transaction is recorded for the sale of an item at a register, a record needs to be added only to the transactions table. In the table, all details regarding the information to identify the register, the item information, and the employee who processed the transaction do not need to be entered because that information is already stored in separate tables. The main transaction record just needs to be entered with references to all that other information. This works extremely well for a transactional type of system concerned with daily operational processing where the focus is on getting data into the system. However, it does not work well for a data warehouse whose focus is on getting data out of the system. Users do not want to navigate through the spider web of tables that compose a normalized database model to extract the information they need. Therefore, dimensional models were introduced to provide the end user with a flattened structure of easily queried tables that he or she can understand from a business perspective. Dimensional Design A dimensional model takes the business rules of our organization and represents them in the database in a more understandable way. A business manager looking at sales data is naturally going to think more along the lines of "how many gizmos did I sell last month in all stores in the south and how does that compare to how many I sold in the same month last year?" Managers just want to know what the result is, and don't want to worry about how many tables need to be joined in a complex query to get that result. A dimensional model removes the complexity and represents the data in a way that end users can relate to it more easily from a business perspective. Users can intuitively think of the data for the above question as a cube, and the edges (or dimensions) of the cube labeled as stores, products, and time frame. So let's take a look at this concept of a cube with dimensions, and how we can use that to represent our data. Cube and Dimensions The dimensions become the business characteristics about the sales, for example: A time dimension—users can look back in time and check various time periods A store dimension—information can be retrieved by store and location A product dimension—various products for sale can be broken out Think of the dimensions as the edges of a cube, and the intersection of the dimensions as the measure we are interested in for that particular combination of time, store, and product. A picture is worth a thousand words, so let's look at what we're talking about in the following image: Notice what this cube looks like. How about a Rubik's Cube? We're doing a data warehouse for a toy store company, so we ought to know what a Rubik's cube is! If you have one, maybe you should go get it now because that will exactly model what we're talking about. Think of the width of the cube, or a row going across, as the product dimension. Every piece of information or measure in the same row refers to the same product, so there are as many rows in the cube as there are products. Think of the height of the cube, or a column going up and down, as the store dimension. Every piece of information in a column represents one single store, so there are as many columns as there are stores. Finally, think of the depth of the cube as the time dimension, so any piece of information in the rows and columns at the same depth represent the same point in time. The intersection of each of these three dimensions locates a single individual cube in the big cube, and that represents the measure amount we're interested in. In this case, it's dollar sales for a single product in a single store at a single point in time. But one might wonder if we are restricted to just three dimensions with this model. After all, a cube has only three dimensions—length, width, and depth. Well, the answer is no. We can have many more dimensions than just three. In our ACME example, we might want to know the sales each employee has accomplished for the day. This would mean we would need a fourth dimension for employees. But what about our visualization above using a cube? How is this fourth dimension going to be modelled? And no, the answer is not that we're entering the Twilight Zone here with that "dimension not only of sight and sound but of mind..." We can think of additional dimensions as being cubes within a cube. If we think of an individual intersection of the three dimensions of the cube as being another cube, we can see that we've just opened up another three dimensions to use—the three for that inner cube. The Rubik's Cube example used above is good because it is literally a cube of cubes and illustrates exactly what we're talking about. We do not need to model additional cubes. The concept of cubes within cubes was just to provide a way to visualize further dimensions. We just model our main cube, add as many dimensions as we need to describe the measures, and leave it for the implementation to handle. This is a very intuitive way for users to look at the design of the data warehouse. When it's implemented in a database, it becomes easy for users to query the information from it.

0
0
2193

article-image-introduction-latest-social-media-landscape-and-importance

Packt

14 Aug 2017

10 min read

Introduction to the Latest Social Media Landscape and Importance

Packt

14 Aug 2017

10 min read

In this article by Siddhartha Chatterjee and Michal Krystyanczuk, author of the book, Python Social Media Analytics, starts with a question to you: Have you seen the movie Social Network? If you have not, it could be a good idea to see it before you read this. If you have, you may have seen the success story around Mark Zuckerberg and his company Facebook. This was possible due to power of the platform in connecting, enabling, sharing, and impacting the lives of almost two billion people on this planet. The earliest Social Networks existed as far back as 1995; such as Yahoo (Geocities), theglobe.com, and tripod.com. These platforms were mainly to facilitate interaction among people through chat rooms. It was only at the end of the 90s that user profiles became the in thing in social networking platforms, allowing information about people to be discoverable, and therefore, providing a choice to make friends or not. Those embracing this new methodology were Makeoutclub, Friendster, SixDegrees.com, and so on. MySpace, LinkedIn, and Orkut were thereafter created, and the social networks were on the verge of becoming mainstream. However, the biggest impact happened with the creation of Facebook in 2004; a total game changer for people's lives, business, and the world. The sophistication and the ease of using the platform made it into mainstream media for individuals and companies to advertise and sell their ideas and products. Hence, we are in the age of social media that has changed the way the world functions. Since the last few years, there have been new entrants in the social media, which are essentially of different interaction models as compared to Facebook, LinkedIn, or Twitter. These are Pinterest, Instagram, Tinder, and others. Interesting example is Pinterest, which unlike Facebook, is not centered around people but is centered around interests and/or topics. It's essentially able to structure people based on their interest around these topics. CEO of Pinterest describes it as a catalog of ideas. Forums which are not considered as regular social networks, such as Facebook, Twitter, and others, are also very important social platforms. Unlike in Twitter or Facebook, Forum users are often anonymous in nature, which enables them to make in-depth conversations with communities. Other non-typical social networks are video sharing platforms, such as YouTube and Dailymotion. They are non-typical because they are centered around the user-generated content, and the social nature is generated by the sharing of these content on various social networks and also the discussion it generates around the user commentaries. Social media is gradually changing from platform centric to more experiences and features. In the future, we'll see more and more traditional content providers and services becoming social in nature through sharing and conversations. The term social media today includes not just social networks but every service that's social in nature with a wide audience. Delving into Social Data The data acquired from social media is called social data. The social data exists in many forms. The types of social media data can be information around the users of social networks, like name, city, interests, and so on. These types of data that are numeric or quantifiable are known as structured data. However, since Social Media are platforms for expression, hence, a lot of the data is in the form of texts, images, videos, and such. These sources are rich in information, but not as direct to analyze as structured data described earlier. These types of data are known as unstructured data. The process of applying rigorous methods to make sense of the social data is called social data analytics. We will go into great depth in social data analytics to demonstrate how we can extract valuable sense and information from these really interesting sources of social data. Since there are almost no restrictions on social media, there are lot of meaningless accounts, content, and interactions. So, the data coming out of these streams is quite noisy and polluted. Hence, a lot of effort is required to separate the information from the noise. Once the data is cleaned and we are focused on the most important and interesting aspects, we then require various statistical and algorithmic methods to make sense out of the filtered data and draw meaningful conclusions. Understanding the process Once you are familiar with the topic of social media data, let us proceed to the next phase. The first step is to understand the process involved in exploitation of data present on social networks. A proper execution of the process, with attention to small details, is the key to good results. In many computer science domains, a small error in code will lead to a visible or at least correctable dysfunction, but in data science, it will produce entirely wrong results, which in turn will lead to incorrect conclusions. The very first step of data analysis is always problem definition. Understanding the problem is crucial for choosing the right data sources and the methods of analysis. It also helps to realize what kind of information and conclusions we can infer from the data and what is impossible to derive. This part is very often underestimated while it is key to successful data analysis. Any question that we try to answer in a data science project has to be very precise. Some people tend to ask very generic questions, such as I want to find trends on Twitter. This is not a correct problem definition and an analysis based on such statement can fail in finding relevant trends. By a naïve analysis, we can get repeating Twitter ads and content generated by bots. Moreover, it raises more questions than it answers. In order to approach the problem correctly, we have to ask in the first step: what is a trend? what is an interesting trend for us? and what is the time scope? Once we answer these questions, we can break up the problem in multiple sub problems: I'm looking for the most frequent consumer reactions about my brand on Twitter in English over the last week and I want to know if they were positive or negative. Such a problem definition will lead to a relevant, valuable analysis with insightful conclusions. The next part of the process consists of getting the right data according to the defined problem. Many social media platforms allow users to collect a lot of information in an automatized way via APIs (Application Programming Interfaces), which is the easiest way to complete the task. Once the data is stored in a database, we perform the cleaning. This step requires a precise understanding of the project's goals. In many cases, it will involve very basic tasks such as duplicates removal, for example, retweets on Twitter, or more sophisticated such as spam detection to remove irrelevant comments, language detection to perform linguistic analysis, or other statistical or machine learning approaches that can help to produce a clean dataset. When the data is ready to be analyzed, we have to choose what kind of analysis and structure the data accordingly. If our goal is to understand the sense of the conversations, then it only requires a simple list of verbatims (textual data), but if we aim to perform analysis on different variables, like number of likes, dates, number of shares, and so on, the data should be combined in a structure such as data frame, where each row corresponds to an observation and each column to a variable. The choice of the analysis method depends on the objectives of the study and the type of data. It may require statistical or machine learning approach, or a specific approach to time series. Different approaches will be explained on the examples of Facebook, Twitter, YouTube, GitHub, Pinterest, and Forum data. Once the analysis is done, it's time to infer conclusions. We can derive conclusions based on the outputs from the models, but one of the most useful tools is visualization technique. Data and output can be presented in many different ways, starting from charts, plots, and diagrams through more complex 2D charts, to multidimensional visualizations. Project planning Analysis of content on social media can get very confusing due to difficulty of working on large amount of data and also trying to make sense out of it. For this reason, it's extremely important to ask the right questions in the beginning to get the right answers. Even though this is an exploratory approach, and getting exact answers may be difficult, the right questions allow you to define the scope, process and the time. The main questions that we will be working on are the following : What does Google post on Facebook ? How do people react to Google Posts ? (Likes, Shares and Comments) What do Google's audience say about Google and its ecosystem? What are the emotions expressed by Google's audience ? With the preceding questions in mind we will proceed to the next steps. Scope and process The analysis will consist of analyzing the feed of posts and comments on official Facebook page of Google. The process of information extraction is organized in a data flow. It starts with data extraction from API, data preprocessing and wrangling and is followed by a series of different analyses. The analysis becomes actionable only after the last step of results interpretation. In order to arrive at retrieving the above information we need to do the following : Extract all the posts of Google permitted by the Facebook API Extract the metadata for each posts : TimeStamp, Number of Likes, Number of Shares, Number of comments. Extract the user comments under each post and the metadata Process the posts to retrieve the most common keywords, bi-grams, hashtags Process the user comments using Alchemy API to retrieve the emotions Analyse the above information to derive conclusions Data type The main part of information extraction comes from an analysis of textual data (posts and comments). However, in order to add quantitative and temporal dimension, we process numbers (likes, shares) and dates (date of creation). Summary The avalanche of Social Network data is a result of communication platforms been developed since the last two decades. These are the platforms that evolved from chat rooms to personal information sharing and finally, social and professional networks. Among many Facebook, Twitter, Instagram, Pinterest and LinkedIn have emerged as the modern day Social Media. These platforms collectively have reach of more than a billon or more of individuals across the world sharing their activities and interaction with each other. Sharing of their data by these media through APIs and other technologies has given rise to a new field called Social Media Analytics. This has multiple applications such as in Marketing, Personalized recommendations, Research and Societal. Modern Data Science techniques such as Machine Learning and Text Mining are widely used for these applications. Python is one of the most widely used programming languages used for these techniques. However, manipulating the unstructured-data from Social Networks requires a lot of precise processing and preparation before coming to the most interesting bits. Resources for Article: Further resources on this subject: How to integrate social media with your WordPress website [article] Social Media for Wordpress: VIP Memberships [article] Social Media Insight Using Naive Bayes [article]

0
0
2192

article-image-identifying-big-data-evidence-hadoop

Packt

21 Aug 2015

13 min read

Identifying Big Data Evidence in Hadoop

Packt

21 Aug 2015

13 min read

In this article by Joe Sremack, author of the book Big Data Forensics, we will cover the following topics: An overview of how to identify Big Data forensic evidence Techniques for previewing Hadoop data (For more resources related to this topic, see here.) Hadoop and other Big Data systems pose unique challenges to forensic investigators. Hadoop clusters are distributed systems with voluminous data storage, complex data processing, and data that is split and made redundant at the data block level. Unlike with traditional forensics, performing forensics on Hadoop using the traditional methods is not always feasible. Instead, forensic investigators, experts, and legal professionals—such as attorneys and court officials—need to understand how forensics is performed against these complex systems. The first step in a forensic investigation is to identify the evidence. In this article, several of the concepts involved in identifying forensic evidence from Hadoop and its applications are covered. Identifying forensic evidence is a complex process for any type of investigation. It involves surveying a set of possible sources of evidence and determining which sources warrant collection. Data in any organization's systems is rarely well organized or documented. Investigators will need to take a set of investigation requirements and determine which data need to be collected. This requires a few first steps: Properly reviewing system and data documentation. Interviewing staff. Locating backup and non-centralized data repositories. Previewing data. The process of identifying Big Data evidence is made difficult by the large volume of data, distributed filesystems, the numerous types of data, and the potential for large-scale redundancy in evidence. Big Data solutions are also unique in that evidence can reside in different layers. Within Hadoop, evidence can take on multiple forms—such as file stored in the Hadoop Distributed File System (HDFS) to data extracted from application. To properly identify the evidence in Hadoop, multiple layers are examined. While all the data may reside in HDFS, the form may differ in a Hadoop application (for example, HBase), or the data may be more easily extracted to a viable format through HDFS using an application (such as Pig or Sqoop). Identifying Big Data evidence can also be complicated by redundancies caused by: Systems that input to or receive output from Big Data systems Archived systems that may have previously stored the evidence in the Big Data system A primary goal of identifying evidence is to capture all relevant evidence while minimizing redundant information. Outsiders looking at a company's data needs may assume that identifying information is as simple as asking several individuals where the data resides. In reality, the process is much more complicated for a number of possible reasons: The organization may be an adverse party and cannot be trusted to provide reliable information about the data The organization is large and no single person knows where all data is stored and what the contents of the data are The organization is divided into business units with no two business units knowing what data the other one stores Data is stored with a third-party data hosting provider IT staff may know where data and systems reside, but only the business users know the type of content the data stores For example, one might assume a pharmaceutical sales company would have an internal system structured with the following attributes: A division where the data is collected from a sales database An HR department database containing employee compensation, performance, and retention information A database of customer demographic information An accounting department database to assess what costs are associated with each sale In such a system, that data is then cleanly unified and compelling analyses are created to drive sales. In reality, an investigator will probably find that the Big Data sales system is actually comprised of a larger set of data that originates inside and outside the organization. There may be a collection of spreadsheets on sales employees' desktops and laptops, along with some of the older versions on backup tapes and file server shared folders. There may be a new Salesforce database implemented two years ago that is incomplete and is actually the replacement for a previous database, which was custom-developed and used by 75% of employees. A Hadoop instance running HBase for analysis receives a filtered set of data from social media feeds, the Salesforce database, and sales reports. All of these data sources may be managed by different teams, so identifying how to collect this information requires a series of steps to isolate the relevant information. The problem for large—or even midsize—companies is much more difficult than our pharmaceutical sales company example. Simply creating a map of every data source and the contents of those systems could require weeks of in-depth interviews with key business owners and staff. Several departments may have their own databases and Big Data solutions that may or may not be housed in a centralized repository. Backups for these systems could be located anywhere. Data retention policies will vary by department—and most likely by system. Data warehouses and other aggregators may contain important information that will not show themselves through normal interviews with staff. These data warehouses and aggregators may have previously generated reports that could serve as valuable reference points for future analysis; however, all data may not be available online, and some data may be inaccessible. In such cases, the company's data will most likely reside in off-site servers maintained by an outsourcing vendor, or worse, in a cloud-based solution. Big Data evidence can be intertwined with non-Big Data evidence. Email, document files, and other evidence can be extremely valuable for performing an investigation. The process for identifying Big Data evidence is very similar to the process for identifying other evidence, so the identification process described in this book can be carried out in conjunction with identifying other evidence. An important consideration for investigators to keep in mind is whether Big Data evidence should be collected (that is, determining whether it is relevant or if the same evidence can be collected more easily from other non-Big Data systems). Investigators must also consider whether evidence needs to be collected to meet the requirements of an investigation. The following figure illustrates the process for identifying Big Data evidence: Initial Steps The process for identifying evidence is: Examining requirements Examining the organization's system architecture Determining the kinds of data in each system Assessing which systems to collect In the book, Big Data Forensics, the topics of examining requirements and examining the organization's system architecture are covered in detail. The purpose of these two steps is to take the requirements of the investigation and match those to known data sources. From this, the investigator can begin to document which data sources should be examined and what types of data may be relevant. Assessing data viability Assessing the viability of data serves several purposes. It can: Allow the investigator to identify which data sources are potentially relevant Yield information that can corroborate the interview and documentation review information Highlight data limitations or gaps Provide the investigator with information to create a better data collection plan Up until this point in the investigation, the investigator has only gathered information about the data. Previewing and assessing samples of the data gives the investigator the chance to actually see what information is contained in the data and determine which data sources can meet the requirements of the investigation. Assessing the viability and relevance of data in a Big Data forensic investigation is different from that of a traditional digital forensic investigation. In a traditional digital forensic investigation, the data is typically not previewed out of fear of altering the data or metadata. With Big Data, however, the data can be previewed in some situations where metadata is not relevant or available. This factor opens up the opportunity for a forensic investigator to preview data when identifying which data should be collected. The main considerations for each source of data include the following: Data quality Data completeness Supporting documentation Validating the collected data Previous systems where the data resided How the data enter and leave the system The available formats for extraction How well the data meet the data requirements There are several methods for previewing data. The first is to review a data extract or the results of a query—or collect sample text files that are stored in Hadoop. This method allows the investigator to determine the types of information available and how the information is represented in the data. In highly complex systems consisting of thousands of data sources, this may not be feasible or requires a significant investment of time and effort. The second method is to review reports or canned query output that were derived from the data. Some Big Data solutions are designed with reporting applications connected to the Big Data system. These reports are a powerful tool, enabling an investigator to quickly gain an understanding of the contents of the system without requiring much up-front effort to gain access to the systems. Data retention policies and data purge schedules should be reviewed and considered in this step as well. Given the large volume of data involved, many organizations routinely purge data after a certain period of time. Data purging can mean the archival of data to near-line or offline storage, or it can mean the destruction of old data without backup. When data is archived, the investigator should also determine whether any of the data in near-line or offline backup media needs to be collected or if the live system data is sufficient. Regardless, the investigator should determine what the next purge cycle is and whether that necessitates an expedited collection to prevent loss of critical information. Additionally, the investigator should determine whether the organization should implement a litigation hold, which halts data purging during the investigation. When data is purged without backup, the investigator must determine: How the purge affects the investigation When the data needs to be collected Whether supplemental data sources must be collected to account for the lost data (for example, reports previously created from the purged data or other systems that created or received the purged data) Identifying HDFS evidence HDFS evidence can be identified in a number of ways. In some cases, the investigator does not want to preview the data to retain the integrity of the metadata. In other cases, the investigator is in only interested in collecting a subset of the data. Limiting the data can be necessary when the data volume prohibits a complete collection or forensically imaging the entire cluster is impossible. The primary methods for identifying HDFS evidence are to: Generate directory listings from the cluster Review the total data volume on each of the nodes Generating directory listings from the cluster is a straightforward process of accessing the cluster from a client and running the Hadoop directory listing command. The cluster is accessed from a client by either directly logging in through a cluster node or by logging in from a remote machine. The command to print a directory listing is as follows: # hdfs dfs –lsr / This generates a recursive directory listing of all HDFS files starting from the root directory. This command produces the filenames, directories, permissions, and file sizes of all files. The output can be piped to an output file, which should be saved to an external storage device for review. Identifying Hive evidence Hive evidence can be identified through HiveQL commands. The following table lists the commands that can be used to get a full listing of all databases and tables as well as the tables' formats: Command Description SHOW DATABASES; This lists all available databases SHOW TABLES; This lists all tables in current database USE databaseName; This makes databaseName the current database DESCRIBE (FORMATTED|EXTENDED) table; This lists the formatting details about the table Identifying all tables and their formats requires iterating through every database and generating a list of tables and each table's formats. This process can be performed either manually or through an automated HiveQL script file. These commands do not provide information about database and table metadata—such as number of records and last modified date—but they do give a full listing of all available, online Hive data. HiveQL can also be used to preview the data using subset queries. The following example shows how to identify the top ten rows in a Hive table: SELECT * FROM Table1 LIMIT 10 Identifying HBase evidence HBase evidence is stored in tables, and identifying the names of the tables and the properties of each is important for data collection. HBase stores metadata information in the -ROOT- and .META. tables. These tables can be queried using HBase shell commands to identify the information about all tables in the HBase cluster. Information about the HBase cluster can be gathered using the status command from the HBase shell: status This produces the following output: 2 servers, 0 dead, 1.5000 average load For additional information about the names and locations of the servers—as well as the total disk sizes for the memstores and HFiles—the status command can be given the detailed parameter. The list command outputs every HBase table. The one table created in HBase, testTable, is shown via the following command: list This produces the following output: TABLE testTable 1 row(s) in 0.0370 seconds => ["testTable"] Information about each table can be generated using the describe command: describe 'testTable' The following output is generated: 'testTable', {NAME => 'account', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'address', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'} 1 row(s) in 0.0300 seconds The describe command yields several useful pieces of information about each table. Each of the column families are listed, and for each family, the encoding, number of columns (represented as versions), and whether the deleted cells are retained are also listed. Security information about each table can be gathered using the user_permission command as follows: user_permission 'testTable' This command is useful for identifying the users who currently have access to the table. As mentioned before, user accounts are not as meaningful in Hadoop because of the distributed nature of Hadoop configurations, but in some cases, knowing who had access to tables can be tied back to system logs to identify individuals who accessed the system and data. Summary Hadoop evidence comes in many forms. The methods for identifying the evidence require the forensic investigator to understand the Hadoop architecture and the options for identifying the evidence within HDFS and Hadoop applications. In Big Data Forensics, these topics are covered in more depth—from the internals of Hadoop to conducting a collection of a distributed server. Resources for Article: Further resources on this subject: Introduction to Hadoop[article] Hadoop Monitoring and its aspects[article] Understanding Hadoop Backup and Recovery Needs [article]

0
0
2188

Packt

18 Sep 2014

18 min read

Caches

Packt

18 Sep 2014

18 min read

In this article, by Federico Razzoli, author of the book Mastering MariaDB, we will see that how in order to avoid accessing disks, MariaDB and storage engines have several caches that a DBA should know about. (For more resources related to this topic, see here.) InnoDB caches Since InnoDB is the recommended engine for most use cases, configuring it is very important. The InnoDB buffer pool is a cache that should speed up most read and write operations. Thus, every DBA should know how it works. The doublewrite buffer is an important mechanism that guarantees that a row is never half-written to a file. For heavy-write workloads, we may want to disable it to obtain more speed. InnoDB pages Tables, data, and indexes are organized in pages, both in the caches and in the files. A page is a package of data that contains one or two rows and usually some empty space. The ratio between the used space and the total size of pages is called the fill factor. By changing the page size, the fill factor changes inevitably. InnoDB tries to keep the pages 15/16 full. If a page's fill factor is lower than 1/2, InnoDB merges it with another page. If the rows are written sequentially, the fill factor should be about 15/16. If the rows are written randomly, the fill factor is between 1/2 and 15/16. A low fill factor represents a memory waste. With a very high fill factor, when pages are updated and their content grows, they often need to be reorganized, which negatively affects the performance. The columns with a variable length type (TEXT, BLOB, VARCHAR, or VARBIT) are written into separate data structures called overlow pages. Such columns are called off-page columns. They are better handled by the DYNAMIC row format, which can be used for most tables when backward compatibility is not a concern. A page never changes its size, and the size is the same for all pages. The page size, however, is configurable: it can be 4 KB, 8 KB, or 16 KB. The default size is 16 KB, which is appropriate for many workloads and optimizes full table scans. However, smaller sizes can improve the performance of some OLTP workloads involving many small insertions because of lower memory allocation, or storage devices with smaller blocks (old SSD devices). Another reason to change the page size is that this can greatly affect the InnoDB compression. The page size can be changed by setting the innodb_page_size variable in the configuration file and restarting the server. The InnoDB buffer pool On servers that mainly use InnoDB tables (the most common case), the buffer pool is the most important cache to consider. Ideally, it should contain all the InnoDB data and indexes to allow MariaDB to execute queries without accessing the disks. Changes to data are written into the buffer pool first. They are flushed to the disks later to reduce the number of I/O operations. Of course, if the data does not fit the server's memory, only a subset of them can be in the buffer pool. In this case, that subset should be the so-called working set: the most frequently accessed data. The default size of the buffer pool is 128 MB and should always be changed. On production servers, this value is too low. On a developer's computer, usually, there is no need to dedicate so much memory to InnoDB. The minimum size, 5 MB, is usually more than enough when developing a simple application. Old and new pages We can think of the buffer pool as a list of data pages that are sorted with a variation of the classic Last Recently Used (LRU) algorithm. The list is split into two sublists: the new list contains the most used pages, and the old list contains the less used pages. The first page in each sublist is called the head. The head of the old list is called the midpoint. When a page is accessed that is not in the buffer pool, it is inserted into the midpoint. The other pages in the old list shift by one position, and the last one is evicted. When a page from the old list is accessed, it is moved from the old list to the head of the new list. When a page in the new list is accessed, it goes to the head of the list. The following variables affect the previously described algorithm: innodb_old_blocks_pct: This variable defines the percentage of the buffer pool reserved to the old list. The allowed range is 5 to 95, and it is 37 (3/5) by default. innodb_old_blocks_time: If this value is not 0, it represents the minimum age (in milliseconds) the old pages must reach before they can be moved into the new list. If an old page is accessed that did not reach this age, it goes to the head of the old list. innodb_max_dirty_pages_pct: This variable defines the maximum percentage of pages that were modified in-memory. This mechanism will be discussed in the Dirty pages section later in this article. This value is not a hard limit, but InnoDB tries not to exceed it. The allowed range is 0 to 100, and the default is 75. Increasing this value can reduce the rate of writes, but the shutdown will take longer (because dirty pages need to be written onto the disk before the server can be stopped in a clean way). innodb_flush_neighbors: If set to 1, when a dirty page is flushed from memory to a disk, even the contiguous pages are flushed. If set to 2, all dirty pages from the same extent (the portion of memory whose size is 1 MB) are flushed. With 0, only dirty pages are flushed when their number exceeds innodb_max_dirty_pages_pct or when they are evicted from the buffer pool. The default is 1. This optimization is only useful for spinning disks. Write-incentive workloads may need an aggressive flushing strategy; however, if the pages are written too often, they degrade the performance. Buffer pool instances On MariaDB versions older than 5.5, InnoDB creates only one instance of the buffer pool. However, concurrent threads are blocked by a mutex, and this may become a bottleneck. This is particularly true if the concurrency level is high and the buffer pool is very big. Splitting the buffer pool into multiple instances can solve the problem. Multiple instances represent an advantage only if the buffer pool size is at least 2 GB. Each instance should be of size 1 GB. InnoDB will ignore the configuration and will maintain only one instance if the buffer pool size is less than 1 GB. Furthermore, this feature is more useful on 64-bit systems. The following variables control the instances and their size: innodb_buffer_pool_size: This variable defines the total size of the buffer pool (no single instances). Note that the real size will be about 10 percent bigger than this value. A percentage of this amount of memory is dedicated to the change buffer. innodb_buffer_pool_instances: This variable defines the number of instances. If the value is -1, InnoDB will automatically decide the number of instances. The maximum value is 64. The default value is 8 on Unix and depends on the innodb_buffer_pool_size variable on Windows. Dirty pages When a user executes a statement that modifies data in the buffer pool, InnoDB initially modifies the data that is only in memory. The pages that are only modified in the buffer pool are called dirty pages. Pages that have not been modified or whose changes have been written on the disk are called as clean pages. Note that changes to data are also written to the redo log. If a crash occurs before those changes are applied to data files, InnoDB is usually able to recover the data, including the last modifications, by reading the redo log and the doublewrite buffer. The doublewrite buffer will be discussed later, in the Explaining the doublewrite buffer section. At some point, the data needs to be flushed to the InnoDB data files (the .ibd files). In MariaDB 10.0, this is done by a dedicated thread called the page cleaner. In older versions, this was done by the master thread, which executes several InnoDB maintenance operations. The flushing is not only concerned with the buffer pool, but also with the InnoDB redo and undo log. The list of dirty pages is frequently updated when transactions write data at the physical level. It has its own mutex that does not lock the whole buffer pool. The maximum number of dirty pages is determined by innodb_max_dirty_pages_pct as a percentage. When this maximum limit is reached, dirty pages are flushed. The innodb_flush_neighbor_pages value determines how InnoDB selects the pages to flush. If it is set to none, only selected pages are written. If it is set to area, even the neighboring dirty pages are written. If it is set to cont, all contiguous blocks of the dirty pages are flushed. On shutdown, a complete page flushing is only done if innodb_fast_shutdown is 0. Normally, this method should be preferred, because it leaves data in a consistent state. However, if many changes have been requested but still not written to disk, this process could be very slow. It is possible to speed up the shutdown by specifying a higher value for innodb_fast_shutdown. In this case, a crash recovery will be performed on the next restart. The read ahead optimization The read ahead feature is designed to reduce the number of read operations from the disks. It tries to guess which data will be needed in the near future and reads it with one operation. Two algorithms are available to choose the pages to read in advance: linear read ahead random read ahead The linear read ahead is used by default. It counts the pages in the buffer pool that are read sequentially. If their number is greater than or equal to innodb_read_ahead_threshold, InnoDB will read all data from the same extent (a portion of data whose size is always 1 MB). The innodb_read_ahead_threshold value must be a number from 0 to 64. The value 0 disables the linear read ahead but does not enable the random read ahead. The default value is 56. The random read ahead is only used if the innodb_random_read_ahead server variable is set to ON. By default, it is set to OFF. This algorithm checks whether at least 13 pages in the buffer pool have been read to the same extent. In this case, it does not matter whether they were read sequentially. With this variable enabled, the full extent will be read. The 13-page threshold is not configurable. If innodb_read_ahead_threshold is set to 0 and innodb_random_read_ahead is set to OFF, the read ahead optimization is completely turned off. Diagnosing the buffer pool performance MariaDB provides some tools to monitor the activities of the buffer pool and the InnoDB main thread. By inspecting these activities, a DBA can tune the relevant server variables to improve the performance. In this section, we will discuss the SHOW ENGINE INNODB STATUS SQL statement and the INNODB_BUFFER_POOL_STATS table in the information_schema database. While the latter provides more information about the buffer pool, the SHOW ENGINE INNODB STATUS output is easier to read. The INNODB_BUFFER_POOL_STATS table contains the following columns: Column name Description POOL_ID Each InnoDB buffer pool instance has a different ID. POOL_SIZE Size (in pages) of the instance. FREE_BUFFERS Number of free pages. DATABASE_PAGES Total number of data pages. OLD_DATABASE_PAGES Pages in the old list. MODIFIED_DATABASE_PAGES Dirty pages. PENDING_DECOMPRESS Number of pages that need to be decompressed. PENDING_READS Pending read operations. PENDING_FLUSH_LRU Pages in the old or new lists that need to be flushed. PENDING_FLUSH_LIST Pages in the flush list that need to flushed. PAGES_MADE_YOUNG Number of pages moved into the new list. PAGES_NOT_MADE_YOUNG Old pages that did not become young. PAGES_MADE_YOUNG_RATE Pages made young per second. This value is reset each time it is shown. PAGES_MADE_NOT_YOUNG_RATE Pages read but not made young (this happens because they do not reach the minimum age) per second. This value is reset each time it is shown. NUMBER_PAGES_READ Number of pages read from disk. NUMBER_PAGES_CREATED Number of pages created in the buffer pool. NUMBER_PAGES_WRITTEN Number of pages written to disk. PAGES_READ_RATE Pages read from disk per second. PAGES_CREATE_RATE Pages created in the buffer pool per second. PAGES_WRITTEN_RATE Pages written to disk per second. NUMBER_PAGES_GET Requests of pages that are not in the buffer pool. HIT_RATE Rate of page hits. YOUNG_MAKE_PER_THOUSAND_GETS Pages made young per thousand physical reads. NOT_YOUNG_MAKE_PER_THOUSAND_GETS Pages that remain in the old list per thousand reads. NUMBER_PAGES_READ_AHEAD Number of pages read with a read ahead operation. NUMBER_READ_AHEAD_EVICTED The number of pages read with a read ahead operation that were never used and then were evicted. READ_AHEAD_RATE Similar to NUMBER_PAGES_READ_AHEAD, but this is a per second rate. READ_AHEAD_EVICTED_RATE Similar to NUMBER_READ_AHEAD_EVICTED, but this is a per-second rate. LRU_IO_TOTAL Total number of pages read or written to disk. LRU_IO_CURRENT Pages read or written to disk within the last second. UNCOMPRESS_TOTAL Pages that have been uncompressed. UNCOMPRESS_CURRENT Pages that have been uncompressed within the last second. The per-second values are reset after they are shown. The PAGES_MADE_YOUNG_RATE and PAGES_NOT_MADE_YOUNG_RATE values show us, respectively, how often old pages become new and how much old pages are never accessed in a reasonable amount of time. If the former value is too high, the old list is probably not big enough and vice versa. Comparing READ_AHEAD_RATE and READ_AHEAD_EVICTED_RATE is useful to tune the read ahead feature. The READ_AHEAD_EVICTED_RATE value should be low, because it indicates which pages read with the read ahead operations were not useful. If their ratio is good but READ_AHEAD_RATE is low, probably the read ahead should be used more often. In this case, if the linear read ahead is used, we can try to increase or decrease innodb_read_ahead_threshold. Or, we can change the used algorithm (linear or random read ahead). The columns whose names end with _RATE better describe the current server activities. They should be examined several times a day, and during the whole week or month, perhaps with the help of one of more monitoring tools. Good, free software monitoring tools include Cacti and Nagios. The Percona Monitoring Tools package includes MariaDB (and MySQL) plugins that provide an interface to these tools. Dumping and loading the buffer pool In some cases, one may want to save the current contents of the buffer pool and reload it later. The most common case is when the server is stopped. Normally, on startup, the buffer pool is empty, and InnoDB needs to fill it with useful data. This process is called warm-up. Until the warm-up is complete, the InnoDB performance is lower than usual. Two variables help avoid the warm-up phase: innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup. If their value is ON, InnoDB automatically saves the buffer pool into a file at shut down and restores it at startup. Their default value is OFF. Turning them ON can be very useful, but remember the caveats: The startup and shutdown time might be longer. In some cases, we might prefer MariaDB to start more quickly even if it is slower during warm-up. We need the disk space necessary to store the buffer pool. The user may also want to dump the buffer pool at any moment and restore it without restarting the server. This is advisable when the buffer pool is optimal and some statements are going to heavily change its contents. A common example is when a big InnoDB table is fully scanned. This happens, for example, during logical backups. A full table scan will fill the old list with non-frequently accessed data. A good way to solve the problem is to dump the buffer pool before the table scan and reload it later. This operation can be performed by setting two special variables: innodb_buffer_pool_dump_now and innodb_buffer_pool_load_now. Reading the values of these variables always returns OFF. Setting the first variable to ON forces InnoDB to immediately dump the buffer pool into a file. Setting the latter variable to ON forces InnoDB to load the buffer pool from that file. In both cases, the progress of the dump or load operation is indicated by the Innodb_buffer_pool_dump_status and Innodb_buffer_pool_load_status status variables. If loading the buffer pool takes too long, it is possible to stop it by setting innodb_buffer_pool_load_abort to ON. The name and path of the dump file is specified in the innodb_buffer_pool_filename server variable. Of course, we should be sure that the chosen directory can contain the file, but it is much smaller than the memory used by the buffer pool. InnoDB change buffer The change buffer is a cache that is a part of the buffer pool. It contains dirty pages related to secondary indexes (not primary keys) that are not stored in the main part of the buffer pool. If the modified data is read later, it will be merged into the buffer pool. In older versions, this buffer was called the insert buffer, but now it is renamed, because it can handle deletions. The change buffer speeds up the following write operations: insertions: When new rows are written. deletions: When existing row are marked for deletion but not yet physically erased for performance reasons. purges: The physical elimination of previously marked rows and obsolete index values. This is periodically done by a dedicated thread. In some cases, we may want to disable the change buffer. For example, we may have a working set that only fits the memory if the change buffer is discarded. In this case, even after disabling it, we will still have all the frequently accessed secondary indexes in the buffer pool. Also, DML statements may be rare for our database, or we may have just a few secondary indexes: in these cases, the change buffer does not help. The change buffer can be configured using the following variables: innodb_change_buffer_max_size: This is the maximum size of the change buffer, expressed as a percentage of the buffer pool. The allowed range is 0 to 50, and the default value is 25. innodb_change_buffering: This determines which types of operations are cached by the change buffer. The allowed values are none (to disable the buffer), all, inserts, deletes, purges, and changes (to cache inserts and deletes, but not purges). The all value is the default value. Explaining the doublewrite buffer When InnoDB writes a page to disk, at least two events can interrupt the operation after it is started: a hardware failure or an OS failure. In the case of an OS failure, this should not be possible if the pages are not bigger than the blocks written by the system. In this case, the InnoDB redo and undo logs are not sufficient to recover the half-written page, because they only contain pages ID's, not their data. This improves the performance. To avoid half-written pages, InnoDB uses the doublewrite buffer. This mechanism involves writing every page twice. A page is valid after the second write is complete. When the server restarts, if a recovery occurs, half-written pages are discarded. The doublewrite buffer has a small impact on performance, because the writes are sequential, and are flushed to disk together. However, it is still possible to disable the doublewrite buffer by setting the innodb_doublewrite variable to OFF in the configuration file or by starting the server with the --skip-innodb-doublewrite parameter. This can be done if data correctness is not important. If performance is very important, and we use a fast storage device, we may note the overhead caused by the additional disk writes. But if data correctness is important to us, we do not want to simply disable it. MariaDB provides an alternative mechanism called atomic writes. These writes are like a transaction: they completely succeed or they completely fail. Half-written data is not possible. However, MariaDB does not directly implement this mechanism, so it can only be used on FusionIO storage devices using the DirectFS filesystem. FusionIO flash memories are very fast flash memories that can be used as block storage or DRAM memory. To enable this alternative mechanism, we can set innodb_use_atomic_writes to ON. This automatically disables the doublewrite buffer. Summary In this article, we discussed the main MariaDB buffers. The most important ones are the caches used by the storage engine. We dedicated much space to the InnoDB buffer pool, because it is more complex and, usually, InnoDB is the most used storage engine. Resources for Article: Further resources on this subject: Building a Web Application with PHP and MariaDB – Introduction to caching [article] Installing MariaDB on Windows and Mac OS X [article] Using SHOW EXPLAIN with running queries [article]

0
0
2186

Packt

25 Oct 2010

12 min read

Introduction to PostgreSQL 9

Packt

25 Oct 2010

12 min read

PostgreSQL 9 Admin Cookbook Over 80 recipes to help you run an efficient PostgreSQL 9.0 database Administer and maintain a healthy database Monitor your database ensuring that it performs as quickly as possible Tips for backup and recovery of your database Introduction PostgreSQL is a feature-rich general purpose database management system. It's a complex piece of software, but every journey begins with the first step. We start with your first connection. Many people fall at the first hurdle, so we try not to skip too swiftly past that. We move on quickly to enabling remote users, and from there to access through GUI administration tools. We also introduce the psql query tool, which is the tool used for loading our sample database. Introducing PostgreSQL 9 PostgreSQL is an advanced SQL database server, available on a wide range of platforms. One of the clearest benefits of PostgreSQL is that it is open source, meaning that you have a very permissive license to install, use, and distribute PostgreSQL without paying anyone fees or royalties. On top of that, PostgreSQL is well-known as a database that stays up for long periods, and requires little or no maintenance in many cases. Overall, PostgreSQL provides a very low total cost of ownership. PostgreSQL is also noted for its huge range of advanced features, developed over the course of more than 20 years continuous development and enhancement. Originally developed by the Database Research group at the University of California, Berkeley, PostgreSQL is now developed and maintained by a huge army of developers and contributors. Many of those contributors have full-time jobs related to PostgreSQL, working as designers, developers, database administrators, and trainers. Some, but not many, of those contributors work for companies that specialize in services for PostgreSQL, such as Hannu and me. No single company owns PostgreSQL, nor are you required, or even encouraged, to register your usage. PostgreSQL has the following main features: Excellent SQL Standards compliance up to SQL 2008 Client-server architecture Highly concurrent design where readers and writers don't block each other Highly configurable and extensible for many types of application Excellent scalability and performance with extensive tuning features What makes PostgreSQL different? The PostgreSQL project focuses on the following objectives: Robust, high-quality software with maintainable, well-commented code Low maintenance administration for both embedded and enterprise use Standards-compliant SQL, interoperability, and compatibility Performance, security, and high availability What surprises many people is that PostgreSQL's feature set is more comparable with Oracle or SQL Server than it is with MySQL. The only connection between MySQL and PostgreSQL is that those two projects are open source; apart from that, the features and philosophies are almost totally different. One of the key features of Oracle since Oracle 7 has been "snapshot isolation", where readers don't block writers, and writers don't block readers. You may be surprised to learn that PostgreSQL was the first database to be designed with this feature, and offers a full and complete implementation. PostgreSQL names this Multi-Version Concurrency Control (MVCC). PostgreSQL is a general-purpose database management system. You defi ne the database that you would like to manage with it. PostgreSQL offers you many ways to work. You can use a "normalized database model", you can utilize extensions such as arrays and record subtypes, or you can use a fully dynamic schema using an extension named hstore . PostgreSQL also allows you to create your own server-side functions in one of a dozen different languages. PostgreSQL is highly extensible, so you can add your own datatypes, operators, index types, and functional languages. For example, you can override different parts of the system using plugins to alter the execution of commands or add a new optimizer. All of these features offer a huge range of implementation options to software architects. There are many ways out of trouble when building applications and maintaining them over long periods of time. In the early days, when PostgreSQL was still a research database, the focus was solely on cool new features. Over the last 15 years, enormous amounts of code have been rewritten and improved, giving us one of the most stable, large, software servers available for operational use. You may also read that PostgreSQL was, or is, slower than My Favorite DBMS, whichever one that is. It's been a personal mission of mine over the last six years to improve server performance and the team have been successful in making the server highly performant and very scalable. That gives PostgreSQL enormous headroom for growth. Who is using PostgreSQL? Prominent users include Apple, BASF, Genentech, IMDB.com, Skype, NTT, Yahoo, and The National Weather Service. PostgreSQL receives well in excess of 1 million downloads per year, according to data submitted to the European Commission, who concluded "...PostgreSQL, is considered by many database users to be a credible alternative... We need to mention one last thing. When PostgreSQL was fi rst developed, it was named Postgres, and so many aspects of the project still refer to the word "postgres". For example, the default database is named postgres, and the software is frequently installed using the postgres userid. As a result, people shorten the name PostgreSQL to simply Postgres, and in many cases people use the two names interchangeably. PostgreSQL is pronounced as "post-grez-q-l". Postgres is pronounced as "post-grez". Some people get confused, and refer to "Postgre", which is hard to say, and likely to confuse people. Two names are enough, so please don't use a third name! Getting PostgreSQL PostgreSQL is 100% open source software. PostgreSQL is freely available to use, alter, or redistribute in any way you choose. PostgreSQL's license is an approved open source license very similar to the BSD (Berkeley Distribution Software) license, though only just different enough that it is now known as TPL (The PostgreSQL License). How to do it... PostgreSQL is already in use by many different application packages, and so you may already find it installed on your servers. Many Linux distributions include PostgreSQL as part of the basic installation, or include it with the installation disk. One thing to be wary of is that the version of PostgreSQL included may not be the latest release. It will typically be the latest major release that was available when that operating system release was published. There is usually no good reason to stick at that level—there is no increased stability implied there and later production versions are just as well-supported by the various Linux distributions. If you don't yet have a copy, or you don't have the latest version, you can download the source code or download binary packages for a wide variety of operating systems from the following URL: http://www.postgresql.org/download/ Installation details vary significantly from platform-to-platform and there aren't any special tricks or recipes to mention. Please, just follow the installation guide, and away you go. We've consciously avoided describing the installation processes here to make sure we don't garble or override the information published to assist you. If you would like to receive e-mail updates of the latest news, then you can subscribe to the PostgreSQL announce mailing list, which contains updates from all the vendors that support PostgreSQL. You'll get a few e-mails each month about new releases of core PostgreSQL and related software, conferences, and user group information. It's worth keeping in touch with developments. For more information about the PostgreSQL announce mailing list, visit the following URL: http://archives.postgresql.org/pgsql-announce/ How it works... Many people ask questions, such as "How can this be free?", "Are you sure I don't have to pay someone?", or "Who gives this stuff away for nothing?" Open source applications such as PostgreSQL work on a community basis, where many contributors perform tasks that make the whole process work. For many of those people, their involvement is professional, rather a hobby, and they can do this because there is generally a great value for both contributors and their employers alike. You might not believe it. You don't have to because It Just Works. There's more... Remember that PostgreSQL is more than just the core software. There is a huge range of websites offering add-ons, extensions, and tools for PostgreSQL. You'll also fi nd an army of bloggers describing useful tricks and discoveries that will help you in your work. And, there is a range of professional companies able to offer you help when you need it. Connecting to PostgreSQL server How do we access PostgreSQL? Connecting to the database is most people's first experience of PostgreSQL, so we want to make it a good one. So, let's do it, and fix any problems we have along the way. Remember that a connection needs to be made securely, so there may be some hoops for us to jump through to ensure that the data we wish to access is secure. Before we can execute commands against the database, we need to connect to the database server, giving us a session. Sessions are designed to be long-lived, so you connect once, perform many requests, and then eventually disconnect. There is a small overhead during connection. That may become noticeable if you connect/disconnect repeatedly, so you may wish to investigate the use of connection pools. Connection pools allow pre-connected sessions to be served quickly to you when you wish to reconnect. Getting ready First, catch your database. If you don't know where it is, we'll probably have difficulty accessing it. There may be more than one, and you'll need to know the right database to access, and have the authority to connect to it. How to do it... You need to specify the following fi ve parameters to connect to PostgreSQL: host or host address port database name user password (or other means of authentication, if any) To connect, there must be a PostgreSQL server running on host, listening on port number port. On that server, a database named dbname and user must also exist. The host must explicitly allow connections from your client—this is explained in the next recipe, and you must also pass authentication using the method the server specifi es. For example, specifying a password won't work if the server has requested a different form of authentication. Almost all PostgreSQL interfaces use the libpq interface library . When using libpq, most of the connection parameter handling is identical, so we can just discuss that once. If you don't specify the preceding parameters, we look for values set through environment variables, which are as follows: PGHOST or PGHOSTADDR PGPORT (or set to 5432 if this is not set) PGDATABASE PGUSER PGPASSWORD (though this one is defi nitely not recommended) If you specify the first four parameters somehow, but not the password, then we look for a password file. Some PostgreSQL interfaces use the client-server protocol directly, so the way defaults are handled may differ. The information we need to supply won't vary signifi cantly, so please check the exact syntax for that interface. How it works... The PostgreSQL server is a client-server database. The system it runs on is known as the host. We can access the PostgreSQL server remotely through the network. However, we must specify the host, which is a hostname, or a hostaddr , which is an IP address. We can specify a host of "localhost" if we wish to make a TCP/IP connection to the same system. It is often better to use a Unix socket connection, which is attempted if the host begins with a slash (/) and the name is presumed to be a directory name (default is /tmp). On any system, there can be more than one database server. Each database server listens on exactly one "well-known" network port , which cannot be shared between servers on the same system. The default port number for PostgreSQL is 5432, which has been registered with IANA, and is uniquely assigned to PostgreSQL. (You can see it used in the /etc/services file on most *nix servers). The port number can be used to uniquely identify a specific database server if many exist. A database server is also sometimes known as a "database cluster", because the PostgreSQL server allows you to define one or more databases on each server. Each connection request must identify exactly one database identifi ed by its dbname. When you connect, you will only be able to see database objects created within that database. A database user is used to identify the connection. By default, there is no limit on the number of connections for a particular user. In more recent versions of PostgreSQL, users are referred to as login roles, though many clues remind us of the earlier naming, and it still makes sense in many ways. A login role is a role that has been assigned the CONNECT privilege. Each connection will typically be authenticated in some way. This is defined at the server, so is not optional at connection time if the administrator has confi gured the server to require authentication. Once you've connected, each connection can have one active transaction at a time and one fully active statement at any time. The server will have a defined limit on the number of connections it can serve, so a connection request can be refused if the server is oversubscribed. Inspecting your connection information If you want to confirm you've connected to the right place and in the right way, you can execute some or all of the following commands: SELECT inet_server_port(); This shows the port on which the server is listening. SELECT current_database(); Shows the current database. SELECT current_user; This shows the current userid. SELECT inet_server_addr(); Shows the IP address of the server that accepted the connection. A user's password is not accessible using general SQL for obvious reasons. You may also need the following: SELECT version(); See also There are many other snippets of information required to understand connections. Some of those are mentioned in this article. For further details, please consult the PostgreSQL server documentation.

0
0
2182

Packt

06 Apr 2017

15 min read

Synchronization – An Approach to Delivering Successful Machine Learning Projects

Packt

06 Apr 2017

15 min read

“In the midst of chaos, there is also opportunity” - Sun Tzu In this article, by Cory Lesmeister, the author of the book Mastering Machine Learning with R - Second Edition, Cory provides insights on ensuring the success and value of your machine learning endeavors. (For more resources related to this topic, see here.) Framing the problem Raise your hand if any of the following has happened or is currently happening to you: You’ve been part of a project team that failed to deliver anything of business value You attend numerous meetings, but they don’t seem productive; maybe they are even complete time wasters Different teams are not sharing information with each other; thus, you are struggling to understand what everyone else is doing, and they have no idea what you are doing or why you are doing it An unknown stakeholder, feeling threatened by your project, comes from out of nowhere and disparages you and/or your work The Executive Committee congratulates your team on their great effort, but decides not to implement it, or even worse, tells you to go back and do it all over again, only this time solve the real problem OK, you can put your hand down now. If you didn’t raise your hand, please send me your contact information because you are about as rare as a unicorn. All organizations, regardless of their size, struggle with integrating different functions, current operations, and other projects. In short, the real-world is filled with chaos. It doesn’t matter how many advanced degrees people have, how experienced they are, how much money is thrown at the problem, what technology is used, how brilliant and powerful the machine learning algorithm is, problems such as those listed above will happen. The bottom line is that implementing machine learning projects in the business world is complicated and prone to failure. However, out of this chaos you have the opportunity to influence your organization by integrating disparate people and teams, fostering a collaborative environment that can adapt to unforeseen changes. But, be warned, this is not easy. If it was easy everyone would be doing it. However, it works and, it works well. By it, I’m talking about the methodology I developed about a dozen years ago, a method I refer to as the “Synchronization Process”. If we ask ourselves, “what are the challenges to implementation”, it seems to me that the following blog post, clearly and succinctly sums it up: https://www.capgemini.com/blog/capping-it-off/2012/04/four-key-challenges-for-business-analytics It enumerates four challenges: Strategic alignment Agility Commitment Information maturity This blog addresses business analytics, but it can be extended to machine learning projects. One could even say machine learning is becoming the analytics tool of choice in many organizations. As such, I will make the case below that the Synchronization Process can effectively deal with the first three challenges. Not only that, the process can provide additional benefits. By overcoming the challenges, you can deliver an effective project, by delivering an effective project you can increase actionable insights and by increasing actionable insights, you will improve decision-making, and that is where the real business value resides. Defining the process “In preparing for battle, I have always found that plans are useless, but planning is indispensable.” - Dwight D. Eisenhower I adopted the term synchronization from the US Army’s operations manual, FM 3-0 where it is described as a battlefield tenet and force multiplier. The manual defines synchronization as, “…arranging activities in time, space and purpose to mass maximum relative combat power at a decisive place and time”. If we overlay this military definition onto the context of a competitive marketplace, we come up with a definition I find more relevant. For our purpose, synchronization is defined as, “arranging business functions and/or tasks in time and purpose to produce the proper amount of focus on a critical event or events”. These definitions put synchronization in the context of an “endstate” based on a plan and a vision. However, it is the process of seeking to achieve that endstate that the true benefits come to fruition. So, we can look at synchronization as not only an endstate, but also as a process. The military’s solution to synchronizing operations before implementing a plan is the wargame. Like the military, businesses and corporations have utilized wargaming to facilitate decision-making and create integration of different business functions. Following the synchronization process techniques explained below, you can take the concept of business wargaming to a new level. I will discuss and provide specific ideas, steps, and deliverables that you can implement immediately. Before we begin that discussion, I want to cover the benefits that the process will deliver. Exploring the benefits of the process When I created this methodology about a dozen years ago, I was part of a market research team struggling to commit our limited resources to numerous projects, all of which were someone’s top priority, in a highly uncertain environment. Or, as I like to refer to it, just another day at the office. I knew from my military experience that I had the tools and techniques to successfully tackle these challenges. It worked then and has been working for me ever since. I have found that it delivers the following benefits to an organization: Integration of business partners and stakeholders Timely and accurate measurement of performance and effectiveness Anticipation of and planning for possible events Adaptation to unforeseen threats Exploitation of unforeseen opportunities Improvement in teamwork Fostering a collaborative environment Improving focus and prioritization In market research, and I believe it applies to all analytical endeavors, including machine learning, we talked about focusing on three specific questions about what to measure: What are we measuring? When do we measure it? How will we measure it? We found that successfully answering those questions facilitated improved decision-making by informing leadership what STOP doing, what to START doing and what to KEEP doing. I have found myself in many meetings going nowhere when I would ask a question like, “what are you looking to stop doing?” Ask leadership what they want to stop, start, or continue to do and you will get to the core of the problem. Then, your job will be to configure the business decision as the measurement/analytical problem. The Synchronization Process can bring this all together in a coherent fashion. I’ve been asked often about what triggers in my mind that a project requires going through the Synchronization Process. Here are some of the questions you should consider, and if you answer “yes” to any of them, it may be a good idea to implement the process: Are resources constrained to the point that several projects will suffer poor results or not be done at all? Do you face multiple, conflicting priorities? Could the external environment change and dramatically influence project(s)? Are numerous stakeholders involved or influenced by a project’s result? Is the project complex and facing a high-level of uncertainty? Does the project involve new technology? Does the project face the actual or potential for organizational change? You may be thinking, “Hey, we have a project manager for all this?” OK, how is that working out? Let me be crystal clear here, this is not just project management! This is about improving decision-making! A Gannt Chart or task management software won’t do that. You must be the agent of change. With that, let’s turn our attention to the process itself. Exploring the process Any team can take the methods elaborated on below and incorporate them to their specific situation with their specific business partners. If executed properly, one can expect the initial investment in time and effort to provide substantial payoff within weeks of initiating the process. There are just four steps to incorporate with each having several tasks for you and your team members to complete. The four steps are as follows: Project kick-off Project analysis Synchronization exercise Project execution Let’s cover each of these in detail. I will provide what I like to refer to as a “Quad Chart” for each process step along with appropriate commentary. Project kick-off I recommend you lead the kick-off meeting to ensure all team members understand and agree to the upcoming process steps. You should place emphasis on the importance of completing the pre-work and understanding of key definitions, particularly around facts and critical assumptions. The operational definitions are as follows: Facts: Data or information that will likely have an impact on the project Critical assumptions: Valid and necessary suppositions in the absence of facts that, if proven false, would adversely impact planning or execution It is an excellent practice to link facts and assumptions. Here is an example of how that would work: It is a FACT that the Information Technology is beta-testing cloud-based solutions. We must ASSUME for planning purposes, that we can operate machine learning solutions on the cloud by the fourth quarter of this year. See, we’ve linked a fact and an assumption together and if this cloud-based solution is not available, let’s say it would negatively impact our ability to scale-up our machine learning solutions. If so, then you may want to have a contingency plan of some sort already thought through and prepared for implementation. Don’t worry if you haven’t thought of all possible assumptions or if you end up with a list of dozens. The synchronization exercise will help in identifying and prioritizing them. In my experience, identifying and tracking 10 critical assumptions at the project level is adequate. The following is the quad chart for this process step: Figure 1: Project kick-off quad chart Notice what is likely a new term, “Synchronization Matrix”. That is merely the tool used by the team to capture notes during the Synchronization Exercise. What you are doing is capturing time and events on the X-axis, and functions and terms on the Y-axis. Of course, this is highly customizable based on the specific circumstances and we will discuss more about it in process step number 3, that is Synchronization exercise, but here is an abbreviated example: Figure 2: Synchronization matrix example You can see in the matrix that I’ve included a row to capture critical assumptions. I can’t understate how important it is to articulate, capture, and track them. In fact, this is probably my favorite quote on the subject: … flawed assumptions are the most common cause of flawed execution. Harvard Business Review, The High Performance Organization, July-August 2005 OK, I think I’ve made my point, so let’s look at the next process step. Project analysis At this step, the participants prepare by analyzing the situation, collecting data, and making judgements as necessary. The goal is for each participant of the Synchronization Exercise to come to that meeting fully prepared. A good technique is to provide project participants with a worksheet template for them to use to complete the pre-work. A team can complete this step either individually, collectively or both. Here is the quad chart for the process step: Figure 3: Project analysis quad chart Let me expand on a couple of points. The idea of a team member creating information requirements is quite important. These are often tied back to your critical assumptions. Take the example above of the assumption around fielding a cloud-based capability. Can you think of some information requirements that might have as a potential end-user? Furthermore, can you prioritize them? OK, having done that, can you think of a plan to acquire that information and confirm or deny the underlying critical assumption? Notice also how that ties together with decision points you or others may have to make and how they may trigger contingency plans. This may sound rather basic and simplistic, but unless people are asked to think like this, articulate their requirements, share the information don’t expect anything to change anytime soon. It will be business as usual and let me ask again, “how is that working out for you?”. There is opportunity in all that chaos, so embrace it, and in the next step you will see the magic happen. Synchronization exercise The focus and discipline of the participants determine the success of this process step. This is a wargame-type exercise where team members portray their plan over time. Now, everyone gets to see how their plan relates to or even inhibits someone else’s plan and vice versa. I’ve done this step several different ways, including building the matrix on software, but the method that has consistently produced the best results is to build the matrix on large paper and put it along a conference room wall. Then, have the participants, one at a time, use post-it notes to portray their key events. For example, the marketing manager gets up to the wall and posts “Marketing Campaign One” in the first time phase, “Marketing Campaign Two” in the final time phase, along with “Propensity Models” in the information requirements block. Iterating by participant and by time/event leads to coordination and cooperation like nothing you’ve ever seen. Another method to facilitate the success of the meeting is to have a disinterested and objective third party “referee” the meeting. This will help to ensure that any issues are captured or resolved and the process products updated accordingly. After the exercise, team members can incorporate the findings to their individual plans. This is an example quad chart on the process step: Figure 4: Synchronization exercise quad chart I really like the idea of execution and performance metrics. Here is how to think about them: Execution metrics—are we doing things right? Performance metrics—are we doing the right things? As you see, execution is about plan implementation, while performance metrics are about determining if the plan is making a difference (yes, I know that can be quite a dangerous thing to measure). Finally, we come to the fourth step where everything comes together during the execution of the project plan. Project execution This is a continual step in the process where a team can utilize the synchronization products to maintain situational understanding of the itself, key stakeholders, and the competitive environment. It can determine and how plans are progressing and quickly react to opportunities and threats as necessary. I recommend you update and communicate changes to the documentation on a regular basis. When I was in pharmaceutical forecasting, it was imperative that I end the business week by updating the matrices on SharePoint, which were available to all pertinent team members. The following is the quad chart for this process step: Figure 5: Project execution quad chart Keeping up with the documentation is a quick and simple process for the most part, and by doing so you will keep people aligned and cooperating. Be aware that like everything else that is new in the world, initial exuberance and enthusiasm will start to wane after several weeks. That is fine as long as you keep the documentation alive and maintain systematic communication. You will soon find that behavior is changing without anyone even taking heed, which is probably the best way to actually change behavior. A couple of words of warning. Don’t expect everyone to embrace the process wholeheartedly, which is to say that office politics may create a few obstacles. Often, an individual or even an entire business function will withhold information as “information is power”, and by sharing information they may feel they are losing power. Another issue may rise where some people feel it is needlessly complex or unnecessary. A solution to these problems is to scale back the number of core team members and utilize stakeholder analysis and a communication plan to bring they naysayers slowly into the fold. Change is never easy, but necessary nonetheless. Summary In this article, I’ve covered, at a high-level, a successful and proven process to deliver machine learning projects that will drive business value. I developed it from my numerous years of planning and evaluating military operations, including a one-year stint as a strategic advisor to the Iraqi Oil Police, adapting it to the needs of any organization. Utilizing the Synchronization Process will help any team avoid the common pitfalls of projects and improve efficiency and decision-making. It will help you become an agent of change and create influence in an organization without positional power. Resources for Article: Further resources on this subject: Machine Learning with R [article] Machine Learning Using Spark MLlib [article] Welcome to Machine Learning Using the .NET Framework [article]

0
0
2181

article-image-getting-started-apache-spark

Packt

17 Jul 2015

7 min read

Getting Started with Apache Spark

Packt

17 Jul 2015

7 min read

In this article by Rishi Yadav, the author of Spark Cookbook, we will cover the following recipes: Installing Spark from binaries Building the Spark source code with Maven (For more resources related to this topic, see here.) Introduction Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics. Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark's development and future releases. Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects. Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for different big data compute tasks, such as machine learning, SQL processing, graph processing, and real-time streaming. These libraries make development faster and can be combined in an arbitrary fashion. Though Spark is written in Scala, and this book only focuses on recipes in Scala, Spark also supports Java and Python. Spark is an open source community project, and everyone uses the pure open source Apache distributions for deployments, unlike Hadoop, which has multiple distributions available with vendor enhancements. The following figure shows the Spark ecosystem: The Spark runtime runs on top of a variety of cluster managers, including YARN (Hadoop's compute framework), Mesos, and Spark's own cluster manager called standalone mode. Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users. Mesos is a cluster manager, which is evolving into a data center operating system. YARN is Hadoop's compute framework that has a robust resource management feature that Spark can seamlessly use. Installing Spark from binaries Spark can be either built from the source code or precompiled binaries can be downloaded from http://spark.apache.org. For a standard use case, binaries are good enough, and this recipe will focus on installing Spark using binaries. Getting ready All the recipes in this book are developed using Ubuntu Linux but should work fine on any POSIX environment. Spark expects Java to be installed and the JAVA_HOME environment variable to be set. In Linux/Unix systems, there are certain standards for the location of files and directories, which we are going to follow in this book. The following is a quick cheat sheet: Directory Description /bin Essential command binaries /etc Host-specific system configuration /opt Add-on application software packages /var Variable data /tmp Temporary files /home User home directories How to do it... At the time of writing this, Spark's current version is 1.4. Please check the latest version from Spark's download page at http://spark.apache.org/downloads.html. Binaries are developed with a most recent and stable version of Hadoop. To use a specific version of Hadoop, the recommended approach is to build from sources, which will be covered in the next recipe. The following are the installation steps: Open the terminal and download binaries using the following command: $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.4.tgz Unpack binaries: $ tar -zxf spark-1.4.0-bin-hadoop2.4.tgz Rename the folder containing binaries by stripping the version information: $ sudo mv spark-1.4.0-bin-hadoop2.4 spark Move the configuration folder to the /etc folder so that it can be made a symbolic link later: $ sudo mv spark/conf/* /etc/spark Create your company-specific installation directory under /opt. As the recipes in this book are tested on infoobjects sandbox, we are going to use infoobjects as directory name. Create the /opt/infoobjects directory: $ sudo mkdir -p /opt/infoobjects Move the spark directory to /opt/infoobjects as it's an add-on software package: $ sudo mv spark /opt/infoobjects/ Change the ownership of the spark home directory to root: $ sudo chown -R root:root /opt/infoobjects/spark Change permissions of the spark home directory, 0755 = user:read-write-execute group:read-execute world:read-execute: $ sudo chmod -R 755 /opt/infoobjects/spark Move to the spark home directory: $ cd /opt/infoobjects/spark Create the symbolic link: $ sudo ln -s /etc/spark conf Append to PATH in .bashrc: $ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc Open a new terminal. Create the log directory in /var: $ sudo mkdir -p /var/log/spark Make hduser the owner of the Spark log directory. $ sudo chown -R hduser:hduser /var/log/spark Create the Spark tmp directory: $ mkdir /tmp/spark Configure Spark with the help of the following command lines: $ cd /etc/spark$ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop">> spark-env.sh$ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop">> spark-env.sh$ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh$ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh Building the Spark source code with Maven Installing Spark using binaries works fine in most cases. For advanced cases, such as the following (but not limited to), compiling from the source code is a better option: Compiling for a specific Hadoop version Adding the Hive integration Adding the YARN integration Getting ready The following are the prerequisites for this recipe to work: Java 1.6 or a later version Maven 3.x How to do it... The following are the steps to build the Spark source code with Maven: Increase MaxPermSize for heap: $ echo "export _JAVA_OPTIONS="-XX:MaxPermSize=1G"" >> /home/hduser/.bashrc Open a new terminal window and download the Spark source code from GitHub: $ wget https://github.com/apache/spark/archive/branch-1.4.zip Unpack the archive: $ gunzip branch-1.4.zip Move to the spark directory: $ cd spark Compile the sources with these flags: Yarn enabled, Hadoop version 2.4, Hive enabled, and skipping tests for faster compilation: $ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package Move the conf folder to the etc folder so that it can be made a symbolic link: $ sudo mv spark/conf /etc/ Move the spark directory to /opt as it's an add-on software package: $ sudo mv spark /opt/infoobjects/spark Change the ownership of the spark home directory to root: $ sudo chown -R root:root /opt/infoobjects/spark Change the permissions of the spark home directory 0755 = user:rwx group:r-x world:r-x: $ sudo chmod -R 755 /opt/infoobjects/spark Move to the spark home directory: $ cd /opt/infoobjects/spark Create a symbolic link: $ sudo ln -s /etc/spark conf Put the Spark executable in the path by editing .bashrc: $ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc Create the log directory in /var: $ sudo mkdir -p /var/log/spark Make hduser the owner of the Spark log directory: $ sudo chown -R hduser:hduser /var/log/spark Create the Spark tmp directory: $ mkdir /tmp/spark Configure Spark with the help of the following command lines: $ cd /etc/spark$ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop">> spark-env.sh$ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop">> spark-env.sh$ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh$ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh Summary In this article, we learned what Apache Spark is, how we can install Spark from binaries, and how to build Spark source code with Maven. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [Article] YARN and Hadoop [Article] Hadoop and SQL [Article]

0
0
2159

article-image-performing-common-mdx-related-tasks

Packt

05 Aug 2011

6 min read

Performing Common MDX-related Tasks

Packt

05 Aug 2011

6 min read

MDX with Microsoft SQL Server 2008 R2 Analysis Services Cookbook More than 80 recipes for enriching your Business Intelligence solutions with high-performance MDX calculations and flexible MDX queries in this book and eBook Skipping axis There are situations when we want to display just a list of members and no data associated with them. Naturally, we expect to get that list on rows, so that we can scroll through them nicely. However, the rules of MDX say we can't skip axes. If we want something on rows (which is AXIS(1) by the way), we must use all previous axes as well (columns in this case, which is also known as AXIS(0)). The reason why we want the list to appear on axis 1 and not axis 0 is because a horizontal list is not as easy to read as a vertical one. Is there a way to display those members on rows and have nothing on columns? Sure! This recipe shows how. Getting ready Follow these steps to set up the environment for this recipe: Start SQL Server Management Studio (SSMS) or any other application you use for writing and executing MDX queries and connect to your SQL Server Analysis Services (SSAS) 2008 R2 instance (localhost or servernameinstancename). Click on the New Query button and check that the target database is Adventure Works DW 2008R2. How to do it... Follow these steps to get a one-dimensional query result with members on rows: Put an empty set on columns (AXIS(0)). Notation for empty set is this: {}. Put some hierarchy on rows (AXIS(1)). In this case we used the largest hierarchy available in this cube – Customer hierarchy of the same dimension. Run the following query: SELECT { } ON 0, { [Customer].[Customer].[Customer].MEMBERS } ON 1 FROM [Adventure Works] How it works... Although we can't skip axes, we are allowed to provide an empty set on them. This trick allows us to get what we need – nothing on columns and a set of members on rows. There's more… Notice that this type of query is very convenient for parameter selection of another query as well as for search. See how it can be modified to include only those customers whose name contains the phrase "John": SELECT { } ON 0, { Filter( [Customer].[Customer].[Customer].MEMBERS, InStr( [Customer].[Customer].CurrentMember.Name, 'John' ) > 0 ) } ON 1 FROM [Adventure Works] In the final result, you will notice the "John" phrase in various positions in member names: The idea behind If you put a cube measure or a calculated measure with a non-constant expression on axis 0 instead, you'll slow down the query. Sometimes it won't be so obvious, sometimes it will. It will depend on the measure's definition and the number of members in the hierarchy being displayed. For example, if you put the Sales Amount measure on columns, that measure will have to be evaluated for each member in the rows. Do we need those values? No, we don't. The only thing we need is a list of members; hence we've used an empty set. That way, the SSAS engine doesn't have to go into cube space. It can reside in dimension space which is much smaller and the query is therefore more efficient. Possible workarounds In case of a third-party application or a control which has problems with this kind of MDX statement (i.e. expects something on columns and is not working with an empty set), we can define a constant measure (a measure returning null, 0, 1 or any other constant) and place it on columns instead of that empty set. For example, we can define a calculated measure in the MDX script whose definition is 1, or any other constant value, and use that measure on the columns axis. It might not be as efficient as an empty set, but it is a much better solution than the one with a regular (non-constant) cube measure like the Sales Amount measure. Handling division by zero errors Another common task is handling errors, especially division by zero type of errors. This recipe offers a way to solve that problem. Not all versions of Adventure Works database have the same date range. If you're not using the recommended version of it, the one for the SSAS 2008 R2, you might have problems with queries. Older versions of Adventure Works database have dates up to the year 2006 or even 2004. If that's the case, make sure you adjust examples by offsetting years in the query with a fixed number. For example, the year 2006 should become 2002 and so on. Getting ready Start a new query in SQL Server Management Studio and check that you're working on Adventure Works database. Then write and execute this query: WITH MEMBER [Date].[Calendar Year].[CY 2006 vs 2005 Bad] AS [Date].[Calendar Year].[Calendar Year].&[2006] / [Date].[Calendar Year].[Calendar Year].&[2005], FORMAT_STRING = 'Percent' SELECT { [Date].[Calendar Year].[Calendar Year].&[2005], [Date].[Calendar Year].[Calendar Year].&[2006], [Date].[Calendar Year].[CY 2006 vs 2005 Bad] } * [Measures].[Reseller Sales Amount] ON 0, { [Sales Territory].[Sales Territory].[Country].MEMBERS } ON 1 FROM [Adventure Works] This query returns 6 rows with countries and 3 rows with years, the third row being the ratio of the previous two, as its definition says. The problem is that we get 1.#INFM on some cells. To be precise, that value (the formatted value of infinity), appears on rows where the CY 2005 is null. Here's a solution for that. How to do it... Follow these steps to handle division by zero errors: Copy the calculated member and paste it as another calculated member. During that, replace the term Bad with Good in its name, just to differentiate those two members. Copy the denominator. Wrap the expression in an outer IIF() statement. Paste the denominator in the condition part of the IIF() statement and compare it against 0. Provide null value for the True part. Your initial expression should be in the False part. Don't forget to include the new member on columns and execute the query: MEMBER [Date].[Calendar Year].[CY 2006 vs 2005 Good] AS IIF ([Date].[Calendar Year].[Calendar Year].&[2005] = 0, null, [Date].[Calendar Year].[Calendar Year].&[2006] / [Date].[Calendar Year].[Calendar Year].&[2005] ), FORMAT_STRING = 'Percent' The result shows that the new calculated measure corrects the problem – we don't get errors (the rightmost column, compared to the one on its left): How it works... A division by zero error occurs when the denominator is null or zero and the numerator is not null. In order to prevent this error, we must test the denominator before the division and handle the case when it is null or zero. That is done using an outer IIF() statement. It is enough to test just for zero because null = 0 returns True. There's more... SQLCAT's SQL Server 2008 Analysis Services Performance Guide has lots of interesting details regarding the IIF() function: http://tinyurl.com/PerfGuide2008 Additionally, you may find Jeffrey Wang's blog article useful in explaining the details of the IIF() function: http://tinyurl.com/IIFJeffrey Earlier versions of SSAS If you're using a version of SSAS prior to 2008 (that is, 2005), the performance will not be as good. See Mosha Pasumansky's article for more info: http://tinyurl.com/IIFMosha

0
0
2149

article-image-troubleshooting-intuit-quickbooks

Packt

08 May 2012

12 min read

Troubleshooting with Intuit Quickbooks

Packt

08 May 2012

12 min read

The following table is the Recipe Reference Card for the keyboard shortcuts included in this article: Find Ctrl+F Delete the line Ctrl+Del Save and close Alt+A Advance to the next field Tab Regress to the previous field Shift+Tab Customize the report Alt+M Clearing stale undeposited funds When the Undeposited Funds window includes customer payments, which you know have already been deposited, recorded, and reconciled, the Income or Unearned Income and Undeposited Funds accounts are overstated. You can use this recipe to efficiently combine the cleared deposit with the undeposited funds. Getting ready Verify that the appropriate bank account is reconciled for the period containing the stale undeposited funds. If not, this can be resolved simply by deleting the recorded deposit and recording the deposit of the undeposited funds. How to do it... With the Find tool (Edit | Find or Ctrl+F), use the Amount filter, along with the Date filter, if necessary, to bring up both the deposit and the customer payment already recorded. Open up the deposit, and click on the Payments button. Check off the appropriate transaction, and click on OK to add this to the deposit. Click on the line item for the deposit originally on the screen, that is, the duplicate of the payment that you just added to this screen. Click on Edit | Delete Line (Ctrl+Del), and then Save and Close (Alt+A). How it works… The only way to directly delete an item added to Undeposited Funds is to delete the underlying customer payment or sales receipt. However, this is not advisable, because these transactions are typically accurate representations of a real-world activity. Additionally, when the deposit was recorded, the related account duplicated the income or customer deposit from the original invoice or sales receipt. Therefore, the deposit itself needs to be modified to simultaneously remove the duplicate offset account, and resolve the outstanding Undeposited Funds item. There's more… For a printable and memorizable list of all outstanding items in the Undeposited Funds account, open the Undeposited Funds ledger. Click on Customize Report | Filters | Choose Filter | Cleared | No. On the Header/Footer tab, in the Report Title field, enter Undeposited Funds, and click on OK. Adjusting cash basis receivables or payables balances Does your cash basis balance sheet show balance in your receivable or payables accounts? This recipe will take you through the two-step process of resolving these items: Locate them Adjust them Getting ready To find out which customers and vendors are responsible for your cash basis accounts receivable or accounts payable balances, respectively, run the following report: Go to Reports | Custom Reports | Summary. Set Dates to All. If you desire a cut-off date, leave the From field blank, and enter your cut-off date in the To field. Set Report Basis to Cash. Set Display rows by to Customer or Vendor. Go to Advanced, and set Display Rows to Non-zero. Go to the Filters tab, and set Account to Accounts Receivable (or Accounts Payable). Go to the Header/Footer tab, and set Report Title to Cash Basis A/R by Customer or Cash Basis A/P by Vendor. The report total matches your balance sheet account total for the same cut-off date. How to do it... Double-click one of the account balances to reveal the detail, and remove the columns irrelevant to this effort. Scan the activity for patterns, unusual items, or clues about the cash basis balance. Gain an understanding of the transaction, and resolve the items by changing the accounts, changing a date, making a journal entry, noting that no adjustment is needed, or other action. Refresh the report, and confirm either a zero balance or an appropriate cash basis balance. How it works… Some of the most likely patterns to scan for include the following: A Balance column, which keeps returning to 0.00, and then stops returning to 0.00: An unusual transaction Type: A recurring figure in the Balance column. This lets you know that at least one culprit occurred before the recurrence began: A zero balance right before a transaction, which is also the aggregate balance sheet account balance: There's more… The most common reasons for a cash basis receivables or payable balance are: Payment date precedes bill or invoice date, and the report cut-off is between both the dates Offset account is a balance sheet account, and the bill or the invoice is unpaid Writing off stale receivables Making a journal entry to write off stale A/R in bulk is easy, but this makes it difficult to trace through the accounting records. The possible uses for more precise information include producing a trail for taxing authorities, internal or independent auditors, or banks. A separate spreadsheet may suffice, but it may be difficult to coordinate. This recipe focuses on straightforward ways to write off these balances in a detailed, but effi cient fashion. Getting ready Have your criteria ready for which invoices are to be written off. The A/R Aging Summary report may help (Reports | Customers & Receivables | A/R Aging Summary). To further analyze your oldest receivables: Go to Customize Report | Age through how many days?, and type 360. Go to Filters | Choose Filter | Aging | >=, and type 90. Go to Header/Footer, add the text: Older Than 90 days to the Report Title, and click on OK. How to do it... To write off a stale receivable: Go to the Customers page or Home Page | Receive Payments. From the Received From drop-down box, select the appropriate customer. If applicable, select the particular job instead. In the Date field, enter the effective date of the writeoff. Click on the Discount & Credits button. In the Discount and Credits pop-up window, fill in the amount to be written off, the writeoff account (generally Bad Debt), and the same class from the original invoice, if class tracking is used in the file: The completed screen should have 0.00 in the Amount and Payment fields. Include the amount written off in the Discount field: If an allowance for doubtful accounts is used against bad debt for writeoffs, then set up the Allowance account as an Accounts Receivable account type , and select the Allowance account from the drop-down box at the top of the Customer Payments screen. A set of journal entries can be used later, to remove the amounts from both Accounts Receivable and the Allowance account. There's more… This is the same procedure that can be used to record discounts, but the key is that an income or expense account must always be selected. This procedure is not appropriate for a balance sheet account to be selected, such as debiting a liability account while crediting A/R, or debiting the Allowance account while crediting A/R. This will cause a cash basis balance sheet report to be out of balance. If that combination of debits and credits is essential, then use a journal entry instead. Then, apply the journal entry to the original invoice, by opening the invoice, and clicking the Apply Credits button. When this recipe is used to write off receivables, Act. Revenue is reduced in the Job Profitability Summary report , and there is no effect on the Item Profitability Summary report. The same reporting results are attained if a journal entry is used to debit Bad Debt Expense and credit Accounts Receivable. If a Credit Memo is used instead, Act. Revenue is reduced in the Job Profitability Summary report as well as the Item Profitability Summary report. In order to increase the Act. Cost column in the Job Profitability Summary report instead, use the Write Checks screen in an unusual fashion: on the Items tab, use an Other Charge item called Bad Debt or Writeoffs. When you create this item, link it to the Bad Debt Expense account. On the Write Checks screen, be sure to enter the Customer:Job name as well as the writeoff amount. On the Expenses tab, select Accounts Receivable, and enter the writeoff amount as a negative number, so that the total amount of the check equals 0. Be sure that the check bears no check number, and clear it in the next bank reconciliation. This technique causes both the Job Profitability and Item Profitability reports to show the transaction as an expense, rather than as a reduction of revenue. It works because QuickBooks includes the Write Check transactions in the Act. Cost column of these reports. Writing off stale payables Making a journal entry to write off stale A/P in bulk is easy, but makes it difficult to trace through the accounting records. Possible uses for more precise information include producing a trail for taxing authorities, internal or independent auditors, or banks. A separate spreadsheet may suffice, but may be difficult to coordinate. This recipe focuses on straightforward ways to write off these balances in a detailed but efficient fashion. Getting ready Have your criteria ready for which bills are to be written off. The A/P Aging Summary report may help (Reports | Vendors & Payables | A/P Aging Summary). To further analyze your oldest payables: Go to Customize Report | Age through how many days?, and type 360. Go to Filters | Choose Filter | Aging | >=, and type 90. Go to Header/Footer, add the text: Older Than 90 days to the Report Title, and click on OK. How to do it... Go to Vendors or Home Page | Pay Bills. Consider using the Filter by drop-down list to only show bills from a particular vendor, and consider using the Sort by drop-down list to organize the payables list by Due Date. For one single vendor, check off the fi rst bill to be written off. Click on the Set Discount button . In the Discount and Credits pop-up window, fi ll in the amount to be written off, the writeoff account (generally the same expense account as the original bill), and the same class from the original bill, if class tracking is used in the file: Click on Done, and proceed to the next bill for the same vendor. Make sure the Payment Date field is the effective date of the write off. The completed screen should have 0.00 in the Amt. to Pay field. Include the amount written off in the Disc. Used field: When the writeoffs for that vendor are complete, click on Pay Selected Bills, followed by Pay More Bills for additional writeoffs. There's more… The advantage of this recipe is that the transaction is created and applied to the bill in a single step. However, the drawback is that it does not appear in the Job Profitability Summary or the Item Profitability Summary reports. For that to occur, create a vendor credit instead, by using the Enter Bills screen, and clicking on the Credit button. Then, use the Items tab to record the credit, using the same item that was used in the original bill. Additionally, use the Customer:Job field to apply the credit to a particular job. For a partial writeoff, after the Discount and Credits window is closed, be sure to manually input 0.00 into the Amt. to Pay field. The default is to include the remaining balance in that field, and this recipe assumes that the current action is only to record writeoffs, not payments to vendors. Balancing the balance sheet How can a balance sheet get out of balance in a software program? If you're reading this recipe, you may have already seen for yourself that the impossible can happen. The following is a procedure to root out the transaction which is causing this phenomenon. Getting ready A balance sheet prepared on the cash basis can be out of balance if certain transactions were saved, for example if the Discount feature was used with a balance sheet account. How to do it... Open a Balance Sheet Summary report. Click on the Customize Report button and in the Dates drop-down box, select All. If the report is on the accrual basis, change the Report Basis to Cash. In the Display columns by drop-down box, change the selection to Year, and click on OK. Look at the balance sheet, and identify the earliest year in which the balance sheet is out of balance. Click on the Customize Report button . In the From and To fields, enter the beginning and ending dates of the year identified in the previous step. In the Display columns by drop-down box, change the selection to Month, and click on OK. Look at the balance sheet, and identify the earliest month in which the balance sheet is out of balance. Click on the Customize Report button . In the From and To fields, enter the beginning and ending dates of the month identified in the previous step. In the Display columns by drop-down box , change the selection to Week, and click on OK. Look at the balance sheet, and identify the earliest week in which the balance sheet is out of balance. Click on the Customize Report button . In the From and To fields, enter the beginning and ending dates of the week identified in the previous step. In the Display columns by drop-down box, change the selection to Day, and click on OK. Look at the balance sheet, and identify the earliest day in which the balance sheet is out of balance: Run a transaction journal (Reports | Accountant & Taxes | Journal), limit the transactions to that day, and scan the report for the transaction responsible. Delete the transaction which caused the imbalance, which is usually a Customer Payment or other A/R or A/P data entry screen, and make a journal entry instead, to cover the appropriate debit and credit. There's more… If the Discount feature was used to reclassify an Accounts Receivable balance to Retainage Receivable, make a journal entry to achieve the same General Ledger effect instead, and apply the transaction to the original invoice, by opening the invoice and using the Apply Credits button.

0
0
2142

article-image-interacting-data-dashboards

Packt

23 May 2014

11 min read

Interacting with Data for Dashboards

Packt

23 May 2014

11 min read

0
0
2137

How-To Tutorials - Data

Microsoft SQL Server 2008 High Availability: Installing Database Mirroring

Qlik Sense's Vision

SAP Netweaver: Accessing the MDM System

N-Way Replication in Oracle 11g Streams: Part 1

Building Your First iReport

Designing the Target Structure in Oracle Warehouse Builder 11g

Introduction to the Latest Social Media Landscape and Importance

Identifying Big Data Evidence in Hadoop

Caches

Introduction to PostgreSQL 9

Trending Topics

Synchronization – An Approach to Delivering Successful Machine Learning Projects

Getting Started with Apache Spark

Performing Common MDX-related Tasks

Troubleshooting with Intuit Quickbooks

Interacting with Data for Dashboards

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access