Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-lucenenet-optimizing-and-merging-index-segments
Packt
20 Aug 2013
3 min read
Save for later

Lucene.NET: Optimizing and merging index segments

Packt
20 Aug 2013
3 min read
(For more resources related to this topic, see here.) How to do it… Index optimization is accomplished by calling the Optimize method on an instance of IndexWriter. The example for this recipe demonstrates the use of the Optimize method to clean up the storage of the index data on the physical disk. The general steps in the process to optimize and index segments are the following: Create/open an index. Add or delete documents from the index. Examine the MaxDoc and NumDocs properties of the IndexWriter class. If the index is deemed to be too dirty, call the Optimize method of the IndexWriter class. The following example for this recipe demonstrates taking these steps to create, modify, and then optimize an index. namespace Lucene.NET.HowTo._12_MergeAndOptimize {// ...// build facade and an initial index of 5 documentsvar facade = new LuceneDotNetHowToExamplesFacade().buildLexicographicalExampleIndex(maxDocs: 5).createIndexWriter();// report MaxDoc and NumDocsTrace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}",facade.IndexWriter.NumDocs()));// delete one documentfacade.IndexWriter.DeleteDocuments(new Term("filename", "0.txt"));facade.IndexWriter.Commit();// report MaxDoc and NumDocsTrace.WriteLine("After delete / commit");Trace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}", facade.IndexWriter.NumDocs()));// optimize the indexfacade.IndexWriter.Optimize();// report MaxDoc and NumDocsTrace.WriteLine("After Optimize");Trace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}", facade.IndexWriter.NumDocs()));Trace.Flush();// ...} How it works… When this program is run, you will see output similar to that in the following screenshot: This program first creates an index with five files. It then reports the values of the MaxDoc and NumDocs properties of the instance of IndexWriter. MaxDoc represents the maximum number of documents that have been stored in the index. It is possible to add more documents, but that may incur a performance penalty by needing to grow the index. NumDocs is the current number of documents stored in the index. At this point these values are 5 and 5, respectively. The next step deletes a single document named 0.txt from the index, and the changes are committed to disk. MaxDoc and NumDocs are written to the console again and now report 5 and 4 respectively. This makes sense as one file has been deleted and there is now "slop" in the index where space is being taken up from a previously deleted document. The reference to the document index information has been removed, but the space is still used on the disk. The final two steps are to call Optimize and to write MaxDoc and NumDocs values to the console, for the final time. These now are 4 and 4, respectively, as Lucene.NET has merged any index segments and removed any empty disk space formerly used by deleted document index information. Summary A Lucene.NET index physically contains one or more segments, each of which is its own index and holds a subset of the overall indexed content. As documents are added to the index, new segments are created as index writer's flush-buffered content into the index's directory and file structure. Over time this fragmentation will cause searches to slow, requiring a merge/optimization to be performed to regain performance. Resources for Article : Further resources on this subject: Extending Your Structure and Search [Article] Advanced Performance Strategies [Article] Creating your first collection (Simple) [Article]
Read more
  • 0
  • 0
  • 7341

article-image-emr-architecture
Packt
27 Oct 2014
6 min read
Save for later

The EMR Architecture

Packt
27 Oct 2014
6 min read
This article is written by Amarkant Singh and Vijay Rayapati, the authors of Learning Big Data with Amazon Elastic MapReduce. The goal of this article is to introduce you to the EMR architecture and EMR use cases. (For more resources related to this topic, see here.) Traditionally, very few companies had access to large-scale infrastructure to build Big Data applications. However, cloud computing has democratized the access to infrastructure allowing developers and companies to quickly perform new experiments without worrying about the need for setting up or scaling infrastructure. A cloud provides an infrastructure as a service platform to allow businesses to build applications and host them reliably with scalable infrastructure. It includes a variety of application-level services to help developers to accelerate their development and deployment times. Amazon EMR is one of the hosted services provided by AWS and is built on top of a scalable AWS infrastructure to build Big Data applications. The EMR architecture Let's get familiar with the EMR. This section outlines the key concepts of EMR. Hadoop offers distributed processing by using the MapReduce framework for execution of tasks on a set of servers or compute nodes (also known as a cluster). One of the nodes in the Hadoop cluster will be controlling the distribution of tasks to other nodes and it's called the Master Node. The nodes executing the tasks using MapReduce are called Slave Nodes: Amazon EMR is designed to work with many other AWS services such as S3 for input/output data storage, DynamoDB, and Redshift for output data. EMR uses AWS CloudWatch metrics to monitor the cluster performance and raise notifications for user-specified alarms. We can create on-demand Hadoop clusters using EMR while storing the input and output data in S3 without worrying about managing a 24*7 cluster or HDFS for data storage. The Amazon EMR job flow is shown in the following diagram: Types of nodes Amazon EMR provides three different roles for the servers or nodes in the cluster and they map to the Hadoop roles of master and slave nodes. When you create an EMR cluster, then it's called a Job Flow, which has been created to execute a set of jobs or job steps one after the other: Master node: This node controls and manages the cluster. It distributes the MapReduce tasks to nodes in the cluster and monitors the status of task execution. Every EMR cluster will have only one master node in a master instance group. Core nodes: These nodes will execute MapReduce tasks and provide HDFS for storing the data related to task execution. The EMR cluster will have core nodes as part of it in a core instance group. The core node is related to the slave node in Hadoop. So, basically these nodes have two-fold responsibility: the first one is to execute the map and reduce tasks allocated by the master and the second is to hold the data blocks. Task nodes: These nodes are used for only MapReduce task execution and they are optional while launching the EMR cluster. The task node is related to the slave node in Hadoop and is part of a task instance group in EMR. When you scale down your clusters, you cannot remove any core nodes. This is because EMR doesn't want to let you lose your data blocks. You can remove nodes from a task group while scaling down your cluster. You should also be using only task instance groups to have spot instances, as spot instances can be taken away as per your bid price and you would not want to lose your data blocks. You can launch a cluster having just one node, that is, with just one master node and no other nodes. In that case, the same node will act as both master and core nodes. For simplicity, you can assume a node as EC2 server in EMR. EMR use cases Amazon EMR can be used to build a variety of applications such as recommendation engines, data analysis, log processing, event/click stream analysis, data transformations (ETL), fraud detection, scientific simulations, genomics, financial analysis, or data correlation in various industries. The following section outlines some of the use cases in detail. Web log processing We can use EMR to process logs to understand the usage of content such as video, file downloads, top web URLs accessed by end users, user consumption from different parts of the world, and many more. We can process any web or mobile application logs using EMR to understand specific business insights relevant for your business. We can move all our web access application or mobile logs to Amazon S3 for analysis using EMR even if we are not using AWS for running our production applications. Clickstream analysis By using clickstream analysis, we can segment users into different groups and understand their behaviors with respect to advertisements or application usage. Ad networks or advertisers can perform clickstream analysis on ad-impression logs to deliver more effective campaigns or advertisements to end users. Reports generated from this analysis can include various metrics such as source traffic distribution, purchase funnel, lead source ROI, and abandoned carts among others. Product recommendation engine Recommendation engines can be built using EMR for e-commerce, retail, or web businesses. Many of the e-commerce businesses have a large inventory of products across different categories while regularly adding new products or categories. It will be very difficult for end users to search and identify the products quickly. With recommendation engines, we can help end users to quickly find relevant products or suggest products based on what they are viewing and so on. We may also want to notify users via an e-mail based on their past purchase behavior. Scientific simulations When you need distributed processing with large-scale infrastructure for scientific or research simulations, EMR can be of great help. We can quickly launch large clusters in a matter of minutes and install specific MapReduce programs for analysis using EMR. AWS also offers genomics datasets for free on S3. Data transformations We can perform complex extract, transform, and load (ETL) processes using EMR for either data analysis or data warehousing needs. It can be as simple as transforming XML file data into JSON data for further usage or moving all financial transaction records of a bank into a common date-time format for archiving purposes. You can also use EMR to move data between different systems in AWS such as DynamoDB, Redshift, S3, and many more. Summary In this article, we learned about the EMR architecture. We understood the concepts related to EMR for various node types in detail. Resources for Article: Further resources on this subject: Introduction to MapReduce [Article] Understanding MapReduce [Article] HDFS and MapReduce [Article]
Read more
  • 0
  • 0
  • 7334

article-image-oracle-tools-and-products
Packt
01 Aug 2011
8 min read
Save for later

Oracle Tools and Products

Packt
01 Aug 2011
8 min read
Readers in a DBA or database development role will most likely be familiar with SQL Loader, Oracle database external tables, Oracle GoldenGate, and Oracle Warehouse Builder. Application developers and architects will mostly likely be familiar with Oracle BPEL and the Oracle Service Bus. Database migration products and tools Data migration is the first step when moving your mission critical data to an Oracle database. The initial data loading is traditionally done using Oracle SQL Loader. As data volumes have increased and data quality has become an issue, Oracle Data Warehouse and Oracle Data Integrator have become more important, because of their capabilities to connect directly to source data stores, provide data cleansing and profiling support, and graphical drag and drop development. Now, the base addition of Oracle Data Warehouse Builder is a free, built-in feature of the Oracle 11g database, and price is no longer an issue. Oracle Warehouse Builder and Oracle Data Integrator have gained adoption as they are repository based, have built-in transformation functions, are multi-user, and avoid a proliferation of scripts throughout the enterprise that do the same or simpler data movement activity. These platforms provide a more repeatable, scalable, reusable, and model-based enterprise data migration architecture. SQL Loader SQL Loader is the primary method for quickly populating Oracle tables with data from external files. It has a powerful data parsing engine that puts little limitation on the format of the data in the data file. The tool is invoked, when you specify the sqlldr command or use the Oracle Enterprise Manager interface. SQL Loader has been around as long as the Oracle Database logon "scott/tiger" and is an integral feature of the Oracle database. It works the same on any hardware or software platform that Oracle supports. Therefore, it has become the de facto data migration and information integration tool for most Oracle partners and customers. This also makes it an Oracle legacy data migration and integration solution with all the issues associated with legacy tools, such as: difficult to move away from as the solution is embedded in the enterprise. The current solution has a lot of duplicated code, because it was written by many different developers before the use of structured programming and shared modules. The current solution is not built to support object-orientated development, Service Orientated Architecture products, or other new technologies such as web services and XML. The current solution is difficult and costly to maintain because the code is not structured, the application is not well documented, the original developers are no longer with the company, and any changes to the code cause other pieces of the application to either stop working or fail. SQL Loader is typically used in 'fat file' mode. This means the data is exported into a command-delimited fat file from the source database or arrives in an ASCII fat file. With the growth of data volumes, using SQL Loader with named pipes has become common practice. Named pipes eliminate the need to have temporary data storage mechanisms—instead data is moved in memory. It is interesting that Oracle does not have an SQL unload facility, as Sybase and SQL Server have the Bulk Copy Program (BCP). There are C, Perl, PL/SQL, and other SQL-based scripts to do this, but nothing official from Oracle. The SQL Loader source and target data sources along with development languages and tools supported are as follows: Data source - Any data source that can produce flat files. XML files can also be loaded using the Oracle XMLtype data type Data target - Oracle Development languages and tools - Proprietary SQL Loader control files and SQL Loader Command Line Interface (CLI) The most likely instances or use cases when Oracle SQL Loader would be the Oracle product or tool selected are: Bulk loading data into Oracle from any data source from mainframe to distributed systems. Quick, easy, one-time data migration using a free tool. Oracle external tables The external tables feature is a complement to the existing SQL Loader functionality. It enables you to access data in external sources as if it were in a table in the database. Therefore, standard SQL or Oracle PL/SQL can be used to load the external file (defined as an external table) into an Oracle database table. Customer benchmarks and performance tests have determined that in some cases the external tables are faster than the SQL Loader direct path load. In addition, if you know SQL well, then it is easier to code the external table load SQL than SQL Loader control files and load scripts. The external table source and target data sources along with development languages and tools supported are: Data source - Any data source that can produce flat files Data target - Oracle Development languages and tools -SQL, PL/SQL, Command Line Interface (CLI) The most likely instances or use cases when Oracle external tables would be the Oracle product or tool selected are: Migration of data from non-Oracle databases to the Oracle database. Fast loading of data into Oracle using SQL. Oracle Warehouse Builder Oracle Warehouse Builder (OWB) allows users to extract data from both Oracle and non-Oracle data sources and transform/load into a Data Warehouse, Operational Data Store (ODS) or simply to be used to migrate data to an Oracle database. It is part of the Oracle Business Intelligence suite and is the embedded Oracle Extract- Load-Transform (ELT) tool in this BI suite. With the usage of platform/product specific adapters it can extract data from mainframe/legacy data sources as well. Starting with Oracle Database 11g, the core OWB product is a free feature of the database. In a way, this is an attempt to address the free Microsoft entry level ELT tools like Microsoft Data Transformation Services (DTS) and SQL Server Integration Services (SSIS) from becoming de facto ELT standards, because they are easy to use and are cheap (free). The Oracle Warehouse Builder source and target data sources along with development languages and tools supported are: Data source - Can be used with the Oracle Gateways, so any data source that the Gateway supports Data target - Oracle, ODBC compliant data stores, and any data source accessible through Oracle Gateways, flat files, XML Development languages and tools -OWB GUI development tool, PL/SQL, SQL, CLI The most likely instances or use cases when OWB would be the Oracle product or tool selected are: Bulk loading data on a continuous, daily, monthly or yearly basis. Direct connection to ODBC compliant databases for data migration, consolidation and physical federation, including data warehouses and operational data stores. Low cost (free) data migration that offers a graphical interface, scheduled data movement, data quality, and cleansing. SQL Developer Migration Workbench Oracle SQL Developer Migration Workbench is a tool that enables you to migrate a database, including the schema objects, data, triggers, and stored procedures, to an Oracle Database 11g using a simple point-and-click process. It also generates scripts necessary to perform the migration in batch mode. Its tight integration into SQL Developer (an Oracle database development tool) provides the user with a single- stop tool to explore third-party databases, carry out migrations, and to manipulate the generated schema objects and migrated data. Oracle SQL Developer is provided free of charge and is the first tool used by Oracle employees to migrate Sybase, DB2, MySQL and SQL Server databases to Oracle. SQL Developer Migration Workbench 3.0 was released 2011 and includes support for C application code migration from Sybase and SQL Server DB-Library and CT- Library, a Command Line Interface (CLI), a host of reports that can be used for fixing items that did not migrate, estimating and scoping, and database analysis, and a pluggable framework to support identification and changes to SQL in Java, Powerbuilder, Visual Basic, Perl, or any programming language. SQL Developer Migration Workbench actually started off as a set of Unix scripts and a crude database procedural language parser based on SED and AWK. This solution was first made an official Oracle product in 1996. Since then, the parser has been totally rewritten in Java and the user interface integrated with SQL Developer. SQL Developer Migration Workbench source and target data sources along with development languages and tools supported are: Data source - DB2 LUW, MySQL, Informix, SQL Server, Sybase Data target - Oracle Development languages and tools - SQL Developer GUI development tool, Command Line Interface (CLI) The most likely instances or use cases when SQL Developer Migration Workbench would be the Oracle product or tool selected are: Data migration from popular LUW RDBMS systems to Oracle using fat files or JDBC connectivity. RDBMS object (stored procedures, triggers, views) translation from popular LUW RDBMS to Oracle.
Read more
  • 0
  • 0
  • 7298

article-image-linking-data-shapes
Packt
01 Jun 2016
7 min read
Save for later

Linking Data to Shapes

Packt
01 Jun 2016
7 min read
In this article by David J Parker, the author of the book Mastering Data Visualization with Microsoft Visio Professional 2016, discusses about that Microsoft introduced the current data-linking feature in the Professional edition of Visio Professional 2007. This feature is better than the database add-on that has been around since Visio 4 because it has greater importing capabilities and is part of the core product with its own API. This provides the Visio user with a simple method of surfacing data from a variety of data sources, and it gives the power user (or developer) the ability to create productivity enhancements in code. (For more resources related to this topic, see here.) Once data is imported in Visio, the rows of data can be linked to shapes and then displayed visually, or be used to automatically create hyperlinks. Moreover, if the data is edited outside of Visio, then the data in the Visio shapes can be refreshed, so the shapes reflect the updated data. This can be done in the Visio client, but some data sources can also refresh the data in Visio documents that are displayed in SharePoint web pages. In this way, Visio documents truly become operational intelligence dashboards. Some VBA knowledge will be useful, and the sample data sources are introduced in each section. In this chapter, we shall cover the following topics: The new Quick Import feature Importing data from a variety of sources How to link shapes to rows of data Using code for more linking possibilities A very quick introduction to importing and linking data Visio Professional 2016 added more buttons to the Data ribbon tab, and some new Data Graphics, but the functionality has basically been the same since Visio 2007 Professional. The new additions, as seen in the following screenshot, can make this particular ribbon tab quite wide on the screen. Thank goodness that wide screens have become the norm: The process to create data-refreshable shapes in Visio is simply as follows: Import data as recordsets. Link rows of data to shapes. Make the shapes display the data. Use any hyperlinks that have been created automatically. The Quick Import tool introduced in Visio 2016 Professional attempts to merge the first three steps into one, but it rarely gets it perfectly, and it is only for simple Excel data sources. Therefore, it is necessary to learn how to use the Custom Import feature properly. Knowing when to use the Quick Import tool The Data | External Data | Quick Import button is new in Visio 2016 Professional. It is part of the Visio API, so it cannot be called in code. This is not a great problem because it is only a wrapper for some of the actions that can be done in code anyway. This feature can only use an Excel workbook, but fortunately Visio installs a sample OrgData.xls file in the Visio Content<LCID> folder. The LCID (Location Code Identifier) for US English is 1033, as shown in the following screenshot: The screenshot shows a Visio Professional 2016 32-bit installation is on a Windows 10 64-bit laptop. Therefore, the Office16 applications are installed in the Program Files (x86)root folder. It would just be Program Filesroot if the 64-bit version of Office was installed. It is not possible to install a different bit version of Visio than the rest of the Office applications. There is no root folder in previous versions of Office, but the rest of the path is the same. The full path on this laptop is C:Program Files (x86)Microsoft OfficerootOffice16Visio Content1033ORGDATA.XLS, but it is best to copy this file to a folder where it can be edited. It is surprising that the Excel workbook is in the old binary format, but it is a simple process to open and save it in the new Open Packaging Convention file format with an xlsx extension. Importing to shapes without existing Shape Data rows The following example contains three Person shapes from the Work Flow Objects stencil, and each one contains the names of a person’s name, spelt exactly the same as in the key column on the Excel worksheet. It is not case sensitive, and it does not matter whether there are leading or trailing spaces in the text. When the Quick Import button is pressed, a dialog opens up to show the progress of the stages that the wizard feature is going through, as shown in the following screenshot: If the workbook contains more than one table of data, the user is prompted to select the range of cells within the workbook. When the process is complete, each of the Person shapes contains all of the data from the row in the External Data recordset, where the text matches the Name column, as shown in the following screenshot: The linked rows in the External Data window also display a chain icon, and the right-click menu has many actions, such as selecting the Linked Shapes for a row. Conversely, each shape now contains a right-mouse menu action to select the linked row in an External Data recordset. The Quick Import feature also adds some default data graphics to each shape, which will be ignored in this chapter because it is explored in detail in chapter 4, Using the Built-in Data Graphics. Note that the recordset in the External Data window is named Sheet1$A1:H52. This is not perfect, but the user can rename it through the right mouse menu actions of the tab. The Properties dialog, as seen in the following screenshot: The user can also choose what to do if a data link is added to a shape that already has one. A shape can be linked to a single row in multiple recordsets, and a single row can be linked to multiple shapes in a document, or even on the same page. However, a shape cannot be linked to more than one row in the same recordset. Importing to shapes with existing Shape Data rows The Person shape from the Resources stencil has been used in the following example, and as earlier, each shape has the name text. However, in this case, there are some existing Shape Data rows: When the Quick Import feature is run, the data is linked to each shape where the text matches the Name column value. This feature has unfortunately created a problem this time because the Phone Number, E-mail Alias, and Manager Shape Data rows have remained empty, but the superfluous Telephone, E-mail, and Reports_To have been added. The solution is to edit the column headers in the worksheet to match the existing Shape Data row labels, as shown in the following screenshot: Then, when Quick Import is used again, the column headers will match the Shape Data row names, and the data will be automatically cached into the correct places, as shown in the following screenshot: Using the Custom Import feature The user has more control using the Custom Import button on the Data | External Data ribbon tab. This button was called Link Data to Shapes in the previous versions of Visio. In either case, the action opens the Data Selector dialog, as shown in the following screenshot: Each of these data sources will be explained in this chapter, along with the two data sources that are not available in the UI (namely XML files and SQL Server Stored Procedures). Summary This article has gone through the many different sources for importing data in Visio and has shown how each can be done. Resources for Article: Further resources on this subject: Overview of Process Management in Microsoft Visio 2013[article] Data Visualization[article] Data visualization[article]
Read more
  • 0
  • 0
  • 7293

article-image-overview-certificate-management
Packt
18 Jul 2016
24 min read
Save for later

Overview of Certificate Management

Packt
18 Jul 2016
24 min read
In this article by David Steadman and Jeff Ingalls, the authors of Microsoft Identity Manager 2016 Handbook, we will look at certificate management in brief. Microsoft Identity Management (MIM)—certificate management (CM)—is deemed the outcast in many discussions. We are here to tell you that this is not the case. We see many scenarios where CM makes the management of user-based certificates possible and improved. If you are currently using FIM certificate management or considering a new certificate management deployment with MIM, we think you will find that CM is a component to consider. CM is not a requirement for using smart cards, but it adds a lot of functionality and security to the process of managing the complete life cycle of your smart cards and software-based certificates in a single forest or multiforest scenario. In this article, we will look at the following topics: What is CM? Certificate management components Certificate management agents The certificate management permission model (For more resources related to this topic, see here.) What is certificate management? Certificate management extends MIM functionality by adding management policy to a driven workflow that enables the complete life cycle of initial enrollment, duplication, and the revocation of user-based certificates. Some smart card features include offline unblocking, duplicating cards, and recovering a certificate from a lost card. The concept of this policy is driven by a profile template within the CM application. Profile templates are stored in Active Directory, which means the application already has a built-in redundancy. CM is based on the idea that the product will proxy, or be the middle man, to make a request to and get one from CA. CM performs its functions with user agents that encrypt and decrypt its communications. When discussing PKI (Public Key Infrastructure) and smart cards, you usually need to have some discussion about the level of assurance you would like for the identities secured by your PKI. For basic insight on PKI and assurance, take a look at http://bit.ly/CorePKI. In typical scenarios, many PKI designers argue that you should use Hardware Security Module (HSM) to secure your PKI in order to get the assurance level to use smart cards. Our personal opinion is that HSMs are great if you need high assurance on your PKI, but smart cards increase your security even if your PKI has medium or low assurance. Using MIM CM with HSM will not be covered in this article, but if you take a look at http://bit.ly/CMandLunSA, you will find some guidelines on how to use MIM CM and HSM Luna SA. The Financial Company has a low-assurance PKI with only one enterprise root CA issuing the certificates. The Financial Company does not use a HSM with their PKI or their MIM CM. If you are running a medium- or high-assurance PKI within your company, policies on how to issue smart cards may differ from the example. More details on PKI design can be found at http://bit.ly/PKIDesign. Certificate management components Before we talk about certificate management, we need to understand the underlying components and architecture: As depicted before, we have several components at play. We will start from the left to the right. From a high level, we have the Enterprise CA. The Enterprise CA can be multiple CAs in the environment. Communication from the CM application server to the CA is over the DCOM/RPC channel. End user communication can be with the CM web page or with a new REST API via a modern client to enable the requesting of smart cards and the management of these cards. From the CM perspective, the two mandatory components are the CM server and the CA modules. Looking at the logical architecture, we have the CA, and underneath this, we have the modules. The policy and exit module, once installed, control the communication and behavior of the CA based on your CM's needs. Moving down the stack, we have Active Directory integration. AD integration is the nuts and bolts of the operation. Integration into AD can be very complex in some environments, so understanding this area and how CM interacts with it is very important. We will cover the permission model later in this article, but it is worth mentioning that most of the configuration is done and stored in AD along with the database. CM uses its own SQL database, and the default name is FIMCertificateManagement. The CM application uses its own dedicated IIS application pool account to gain access to the CM database in order to record transactions on behalf of users. By default, the application pool account is granted the clmApp role during the installation of the database, as shown in the following screenshot:   In CM, we have a concept called the profile template. The profile template is stored in the configuration partition of AD, and the security permissions on this container and its contents determine what a user is authorized to see. As depicted in the following screenshot, CM stores the data in the Public Key Services (1) and the Profile Templates container. CM then reads all the stored templates and the permissions to determine what a user has the right to do (2): Profile templates are at the core of the CM logic. The three components comprising profile templates are certificate templates, profile details, and management policies. The first area of the profile template is certificate templates. Certificate templates define the extensions and data point that can be included in the certificate being requested. The next item is profile details, which determines the type of request (either a smart card or a software user-based certificate), where we will generate the certificates (either on the server or on the client side of the operations), and which certificate templates will be included in the request. The final area of a profile template is known as management policies. Management policies are the workflow engine of the process and contain the manager, the subscriber functions, and any data collection items. The e-mail function is initiated here and commonly referred to as the One Time Password (OTP) activity. Note the word "One". A trigger will only happen once here; therefore, multiple alerts using e-mail would have to be engineered through alternate means, such as using the MIM service and expiration activities. The permission model is a bit complex, but you'll soon see the flexibility it provides. Keep in mind that Service Connection Point (SCP) also has permissions applied to it to determine who can log in to the portal and what rights the user has within the portal. SCP is created upon installation during the wizard configuration. You will want to be aware of the SCP location in case you run into configuration issues with administrators not being able to perform particular functions. The SCP location is in the System container, within Microsoft, and within Certificate Lifecycle Manager, as shown here: Typical location CN=Certificate Lifecycle Manager,CN=Microsoft,CN=System,DC=THEFINANCIALCOMPANY,DC=NET Certificate management agents We covered several key components of the profile templates and where some of the permission model is stored. We now need to understand how the separation of duties is defined within the agent role. The permission model provides granular control, which promotes the separation of duties. CM uses six agent accounts, and they can be named to fit your organization's requiremensts. We will walk through the initial setup again later in this article so that you can use our setup or alter it based on your need. The Financial Company only requires the typical setup. We precreated the following accounts for TFC, but the wizard will create them for you if you do not use them. During the installation and configuration of CM, we will use the following accounts: Besides the separation of duty, CM offers enrollment by proxy. Proxy enrollment of a request refers to providing a middle man to provide the end user with a fluid workflow during enrollment. Most of this proxy is accomplished via the agent accounts in one way or another. The first account is MIM CM Agent (MIMCMAgent), which is used by the CM server to encrypt data from the smart card admin PINs to the data collection stored in the database. So, the agent account has an important role to protect data and communication to and from the certificate authorities. The last user agent role CMAgent has is the capability to revoke certificates. The agent certificate thumbprint is very important, and you need to make sure the correct value is updated in the three areas: CM, web.config, and the certificate policy module under the Signing Certificates tab on the CA. We have identified these areas in the following. For web.config: <add key="Clm.SigningCertificate.Hash" value <add key="Clm.Encryption.Certificate.Hash" value <add key="Clm.SmartCard.ExchangeCertificate.Hash" value The Signing Certificates tab is as shown in the following screenshot:   Now, when you run through the configuration wizard, these items are already updated, but it is good to know which locations need to be updated if you need to troubleshoot agent issues or even update/renew this certificate. The second account we want to look at is Key Recovery Agent (MIMCMKRAgent); this agent account is needed for CM to recover any archived private keys certificates. Now, let's look at Enrollment Agent (MIMCMEnrollAgent); the main purpose of this agent account is to provide the enrollment of smart cards. Enrollment Agent, as we call it, is responsible for signing all smart card requests before they are submitted to the CA. Typical permission for this account on the CA is read and request. Authorization Agent (MIMCMAuthAgent)—or as some folks call this, the authentication agent—is responsible for determining access rights for all objects from a DACL perspective. When you log in to the CM site, it is the authorization account's job to determine what you have the right to do based on all the core components that ACL has applied. We will go over all the agents accounts and rights needed later in this article during our setup. CA Manager Agent (MIMCMManagerAgent) is used to perform core CA functions. More importantly, its job is to issue Certificate Revocation Lists (CRLs). This happens when a smart card or certificate is retired or revoked. It is up to this account to make sure the CRL is updated with this critical information. We saved the best for last: Web Pool Agent (MIMCMWebAgent). This agent is used to run the CM web application. The agent is the account that contacts the SQL server to record all user and admin transactions. The following is a good depiction of all the accounts together and the high-level functions:   The certificate management permission model In CM, we think this part is the most complex because with the implementation, you can be as granular as possible. For this reason, this area is the most difficult to understand. We will uncover the permission model so that we can begin to understand how the permission model works within CM. When looking at CM, you need to formulate the type of management model you will be deploying. What we mean by this is will you have a centralized or delegated model? This plays a key part in deployment planning for CM and the permission you will need to apply. In the centralized model, a specific set of managers are assigned all the rights for the management policy. This includes permissions on the users. Most environments use this method as it is less complex for environments. Now, within this model, we have manager-initiated permission, and this is where CM permissions are assigned to groups containing the subscribers. Subscribers are the actual users doing the enrollment or participating in the workflow. This is the model that The Financial Company will use in its configuration. The delegated model is created by updating two flags in web.config called clm.RequestSecurity.Flags and clm.RequestSecurity.Groups. These two flags work hand in hand as if you have UseGroups, then it will evaluate all the groups within the forests to include universal/global security. Now, if you use UseGroups and define clm.RequestSecurity.Groups, then it will only look for these specific groups and evaluate via the Authorization Agent . The user will tell the Authorization Agent to only read the permission on the user and ignore any group membership permissions:   When we continue to look at the permission, there are five locations that permissions can be applied in. In the preceding figure is an outline of these locations, but we will go in more depth in the subsections in a bit. The basis of the figure is to understand the location and what permission can be applied. The following are the areas and the permissions that can be set: Service Connection Point: Extended Permissions Users or Groups: Extended Permissions Profile Template Objects: Container: Read or Write Template Object: Read/Write or Enroll Certificate Template: Read or Enroll CM Management Policy within the Web application: We have multiple options based on the need, such as Initiate Request Now, let's begin to discuss the core areas to understand what they can do. So, The Financial Company can design the enrollment option they want. In the example, we will use the main scenario we encounter, such as the helpdesk, manager, and user-(subscriber) based scenarios. For example, certain functions are delegated to the helpdesk to allow them to assist the user base without giving them full control over the environment (delegated model). Remember this as we look at the five core permission areas. Creating service accounts So far, in our MIM deployment, we have created quite a few service accounts. MIM CM, however, requires that we create a few more. During the configuration wizard, we will get the option of having the wizard create them for us, but we always recommend creating them manually in FIM/MIM CM deployments. One reason is that a few of these need to be assigned some certificates. If we use an HSM, we have to create it manually in order to make sure the certificates are indeed using the HSM. The wizard will ask for six different service accounts (agents), but we actually need seven. In The Financial Company, we created the following seven accounts to be used by FIM/MIM CM: MIMCMAgent MIMCMAuthAgent MIMCMCAManagerAgent MIMCMEnrollAgent MIMCMKRAgent MIMCMWebAgent MIMCMService The last one, MIMCMService, will not be used during the configuration wizard, but it will be used to run the MIM CM Update service. We also created the following security groups to help us out in the scenarios we will go over: MIMCM-Helpdesk: This is the next step in OTP for subscribers MIMCM-Managers: These are the managers of the CM environment MIMCM-Subscribers: This is group of users that will enroll Service Connection Point Service Connection Point (SCP)is located under the Systems folder within Active Directory. This location, as discussed in the earlier parts of the article, defines who functions as the user as it relates to logging in to the web application. As an example, if we just wanted every user to only log in, we would give them read rights. Again, authenticated users, have this by default, but if you only wanted a subset of users to access, you should remove authenticated users and add your group. When you run the configuration wizard, SCP is decided, but the default is the one shown in the following screenshot:   If a user is assigned to any of the MIM CM permissions available on SCP, the administrative view of the MIM CM portal will be shown. The MIM CM permissions are defined in a Microsoft TechNet article at http://bit.ly/MIMCMPermission. For your convenience, we have copied parts of the information here: MIM CM Audit: This generates and displays MIM CM policy templates, defines management policies within a profile template, and generates MIM CM reports. MIM CM Enrollment Agent: This performs certificate requests for the user or group on behalf of another user. The issued certificate's subject contains the target user's name and not the requester's name. MIM CM Request Enroll: This initiates, executes, or completes an enrollment request. MIM CM Request Recover: This initiates encryption key recovery from the CA database. MIM CM Request Renew: This initiates, executes, or completes an enrollment request. The renewal request replaces a user's certificate that is near its expiration date with a new certificate that has a new validity period. MIM CM Request Revoke: This revokes a certificate before the expiration of the certificate's validity period. This may be necessary, for example, if a user's computer or smart card is stolen. MIM CM Request Unblock Smart Card: This resets a smart card's user Personal Identification Number (PIN) so that he/she can access the key material on a smart card. The Active Directory extended permissions So, even if you have the SCP defined, we still need to set up the permissions on the user or group of users that we want to manage. As in our helpdesk example, if we want to perform certain functions, the most common one is offline unblock. This would require the MIMCM-HelpDesk group. We will create this group later in this article. It would contain all help desk users then on SCP; we would give them CM Request Unblock Smart Card and CM Enrollment Agent. Then, you need to assign the permission to the extended permission on MIMCM-Subscribers, which contains all the users we plan to manage with the helpdesk and offline unblock:   So, as you can see, we are getting into redundant permissions, but depending on the location, it means what the user can do. So, planning of the model is very important. Also, it is important to document what you have as with some slight tweak, things can and will break. The certificate templates permission In order for any of this to be possible, we still need to give permission to the manager of the user to enroll or read the certificate template, as this will be added to the profile template. For anyone to manage this certificate, everyone will need read and enroll permissions. This is pretty basic, but that is it, as shown in the following screenshot:   The profile template permission The profile template determines what a user can read within the template. To get to the profile template, we need to use Active Directory sites and services to manage profile templates. We need to activate the services node as this is not shown by default, and to do this, we will click on View | Show Services Node:   As an example if you want a user to enroll in the cert, he/she would need CM Enroll on the profile template, as shown in the following screenshot:   Now, this is for users, but let's say you want to delegate the creation of profile templates. For this, all you need to do is give the MIMCM-Managers delegate the right to create all child items on the profile template container, as follows:   The management policy permission For the management policy, we will break it down into two sections: a software-based policy and a smart card management policy. As we have different capabilities within CM based on the type, by default, CM comes with two sample policies (take a look at the following screenshot), which we use for duplication to create a new one. When configuring, it is good to know that you cannot combine software and smart card-based certificates in a policy:   The software management policy The software-based certificate policy has the following policies available through the CM life cycle:   The Duplicate Policy panel creates a duplicate of all the certificates in the current profile. Now, if the first profile is created for the user, all the other profiles created afterwards will be considered duplicate, and the first generated policy will be primary. The Enroll Policy panel defines the initial enrollment steps for certificates such as initiate enroll request and data collection during enroll initiation. The Online Update Policy panel is part of the automatic policy function when key items in the policy change. This includes certificates about to expire, when a certificate is added to the existing profile template or even removed. The Recover Policy panel allows for the recovery of the profile in the event that the user was deleted. This includes the cases where certs are deleted by accident. One thing to point out is if the certificate was a signing cert, the recovery policy would issue a new replacement cert. However, if the cert was used for encryption, you can recover the original using this policy. The Recover On Behalf Policy panel allows managers or helpdesk operations to be recovered on behalf the user in the event that they need any of the certificates. The Renew Policy panel is the workflow that defines the renew setting, such as revocation and who can initiate a request. The Suspend and Reinstate Policy panel enables a temporary revocation of the profile and puts a "certificate hold" status. More information about the CRL status can be found at http://bit.ly/MIMCMCertificateStatus. The Revoke Policy panel maintains the revocation policy and setting around being able to set the revocation reason and delay. Also, it allows the system to push a delta CRL. You also can define the initiators for this policy workflow. The smart card management policy The smart card policy has some similarities to the software-based policy, but it also has a few new workflows to manage the full life cycle of the smart card:   The Profile Details panel is by far the most commonly used part in this section of the policy as it defines all the smart card certificates that will be loaded in the policy along with the type of provider. One key item is creating and destroying virtual smart cards. One final key part is diversifying the admin key. This is best practice as this secures the admin PIN using diversification. So, before we continue, we want to go over this setting as we think it is an important topic. Diversifying the admin key is important because each card or batch of cards comes with a default admin key. Smart cards may have several PINs, an admin PIN, a PINunlock key (PUK), and a user PIN. This admin key, as CM refers to it, is also known as the administrator PIN. This PIN differs from the user's PIN. When personalizing the smart card, you configure the admin key, the PUK, and the user's PIN. The admin key and the PUK are used to reset the virtual smart card's PIN. However, you cannot configure both. You must use the PUK to unlock the PIN if you assign one during the virtual smart card's creation. It is important to note that you must use the PUK to reset the PIN if you provide both a PUK and an admin key. During the configuration of the profile template, you will be asked to enter this key as follows:   The admin key is typically used by smart card management solutions that enable a challenge response approach to PIN unlocking. The card provides a set of random data that the user reads (after the verification of identity) to the deployment admin. The admin then encrypts the data with the admin key (obtained as mentioned before) and gives the encrypted data back to the user. If the encrypted data matches that produced by the card during verification, the card will allow PIN resetting. As the admin key is never in the hands of anyone other than the deployment administrator, it cannot be intercepted or recorded by any other party (including the employee) and thus has significant security benefits beyond those in using a PUK—an important consideration during the personalization process. When enabled, the admin key is set to a card-unique value when the card is assigned to the user. The option to diversify admin keys with the default initialization provider allows MIM CM to use an algorithm to uniquely generate a new key on the card. The key is encrypted and securely transmitted to the client. It is not stored in the database or anywhere else. MIM CM recalculates the key as needed to manage the card:   The CM profile template contains a thumbprint for the certificate to be used in admin key diversification. CM looks in the personal store of the CM agent service account for the private key of the certificate in the profile template. Once located, the private key is used to calculate the admin key for the smart card. The admin key allows CM to manage the smart card (issuing, revoking, retiring, renewing, and so on). Loss of the private key prevents the management of cards diversified using this certificate. More detail on the control can be found at http://bit.ly/MIMCMDiversifyAdminKey. Continuing on, the Disable Policy panel defines the termination of the smart card before expiration, you can define the reason if you choose. Once disabled, it cannot be reused in the environment. The Duplicate Policy panel, similarly to the software-based one, produces a duplicate of all the certificates that will be on the smart card. The Enroll Policy panel, similarly to the software policy, defines who can initiate the workflow and printing options. The Online Update Policy panel, similarly to the software-based cert, allows for the updating of certificates if the profile template is updated. The update is triggered when a renewal happens or, similarly to the software policy, a cert is added or removed. The Offline Unblock Policy panel is the configuration of a process to allow offline unblocking. This is used when a user is not connected to the network. This process only supports Microsoft-based smart cards with challenge questions and answers via, in most cases, the user calling the helpdesk. The Recovery On Behalf Policy panel allows the recovery of certificates for the management or the business to recover if the cert is needed to decrypt information from a user whose contract was terminated or who left the company. The Replace Policy panel is utilized by being able to replace a user's certificate in the event of a them losing their card. If the card they had had a signing cert, then a new signing cert would be issued on this new card. Like with software certs, if the certificate type is encryption, then it would need to be restored on the replace policy. The Renew Policy panel will be used when the profile/certificate is in the renewal period and defines revocation details and options and initiates permission. The Suspend and Reinstate Policy panel is the same as the software-based policy for putting the certificate on hold. The Retire Policy panel is similar to the disable policy, but a key difference is that this policy allows the card to be reused within the environment. The Unblock Policy panel defines the users that can perform an actual unblocking of a smart card. More in-depth detail of these policies can be found at http://bit.ly/MIMCMProfiletempates. Summary In this article, we uncovered the basics of certificate management and the management components that are required to successfully deploy a CM solution. Then, we discussed and outlined, agent accounts and the roles they play. Finally, we looked into the management permission model from the policy template to the permissions and the workflow. Resources for Article: Further resources on this subject: Managing Network Devices [article] Logging and Monitoring [article] Creating Horizon Desktop Pools [article]
Read more
  • 0
  • 0
  • 7293

article-image-implementing-persistence-redis-intermediate
Packt
06 Jun 2013
10 min read
Save for later

Implementing persistence in Redis (Intermediate)

Packt
06 Jun 2013
10 min read
(For more resources related to this topic, see here.) Getting ready Redis provides configuration settings for persistence and for enabling durability of data depending on the project statement. If durability of data is critical If durability of data is not important You can achieve persistence of data using the snapshotting mode, which is the simplest mode in Redis. Depending on the configuration, Redis saves a dump of all the data sets in its memory into a single RDB file. The interval in which Redis dumps the memory can be configured to happen every X seconds or after Y operations. Consider an example of a moderately busy server that receives 15,000 changes every minute over its 1 GB data set in memory. Based on the snapshotting rule, the data will be stored every 60 seconds or whenever there are at least 15,000 writes. So the snapshotting runs every minute and writes the entire data of 1 GB to the disk, which soon turns ugly and very inefficient. To solve this particular problem, Redis provides another way of persistence, Append-only file (AOF), which is the main persistence option in Redis. This is similar to journal files, where all the operations performed are recorded and replayed in the same order to rebuild the exact state. Redis's AOF persistence supports three different modes: No fsync: In this mode, we take a chance and let the operating system decide when to flush the data. This is the fastest of the three modes. fsync every second: This mode is a compromised middle point between performance and durability. Data will be flushed using fsync every second. If the disk is not able to match the write speed, the fsync can take more than a second, in which case Redis delays the write up to another second. So this mode guarantees a write to be committed to OS buffers and transferred to the disk within 2 seconds in the worstcase scenario. fsync always: This is the last and safest mode. This provides complete durability of data at a heavy cost to performance. In this mode, the data needs to be written to the file and synced with the disk using fsync before the client receives an acknowledgment. This is the slowest of all three modes. How to do it... First let us see how to configure snapshotting, followed by the Append-only file method: In Redis, we can configure when a new snapshot of the data set will be performed. For example, Redis can be configured to dump the memory if the last dump was created more than 30 seconds ago and there are at least 100 keys that are modified or created. Snapshotting should be configured in the /etc/redis/6379.conf file. The configuration can be as follows: save 900 1save 60 10000 The first line translates to take a snapshot of data after 900 seconds if at least one key has changed, while the second line translates to snapshotting every 60 seconds if 10,000 keys have been modified in the meantime. The configuration parameter rdbcompression defines whether the RDB file is to be compressed or not. There is a trade-off between the CPU and RDB dump file size. We are interested in changing the dump's filename using the dbfilename parameter. Redis uses the current folder to create the dump files. For convenience, it is advised to store the RDB file in a separate folder. dbfilename redis-snapshot.rdbdir /var/lib/redis/ Let us run a small test to make sure the RDB dump is working. Start the server again. Connect to the server using redis-cli, as we did already. To test whether our snapshotting is working, issue the following commands: SET Key ValueSAVE After the SAVE command, a file should be created in the folder /var/lib/redis with the name redis-snapshot.rdb. This confirms that our installation is able to take a snapshot of our data into a file. Now let us see how to configure persistence in Redis using the AOF method: The configuration for persistence through AOF also goes into the same file located in /etc/redis/6379.conf. By default, the Append-only mode is not enabled. Enable it using the appendonly parameter. appendonly yes Also, if you would like to specify a filename for the AOF log, uncomment the line and change the filename. appendfilename redis-aof.aof The appendfsync everysec command provides a good balance between performance and durability. appendfsync everysec Redis needs to know when it has to rewrite the AOF file. This will be decided based on two configuration parameters, as follows: auto-aof-rewrite-percentage 100auto-aof-rewrite-min-size 64mb Unless the minimum size is reached and the percentage of the increase in size when compared to the last rewrite is less than 100 percent, the AOF rewrite will not be performed. How it works... First let us see how snapshotting works. When one of the criteria is met, Redis forks the process. The child process starts writing the RDB file to the disk at the folder specified in our configuration file. Meanwhile, the parent process continues to serve the requests. The problem with this approach is that the parent process stores the keys, which change during this snapshotting by the child, in the extra memory. In the worst-case scenario, if all the keys are modified, the memory usage spikes to roughly double. Caution Be aware that the bigger the RDB file, the longer it takes Redis to restore the data on startup. Corruption of the RDB file is not possible as it is created by the append-only method from the data in Redis's memory, by the child process. The new RDB file is created as a temporary file and is then renamed to the destination file using the atomic rename system call once the dump is completed. AOF's working is simple. Every time a write operation is performed, the command operation gets logged into a logfile. The format used in the logfile is the same as the format used by clients to communicate to the server. This helps in easy parsing of AOF files, which brings in the possibility of replaying the operation in another Redis instance. Only the operations that change the data set are written to the log. This log will be used on startup to reconstruct the exact data. As we are continuously writing the operations into the log, the AOF file explodes in size as compared to the amount of operations performed. So, usually, the size of the AOF file is larger than the RDB dump. Redis manages the increasing size of the data log by compacting the file in a non-blocking manner periodically. For example, say a specific key, key1, has changed 100 times using the SET command. In order to recreate the final state in the last minute, only the last SET command is required. We do not need information about the previous 99 SET commands. This might look simple in theory, but it gets complex when dealing with complex data structures and operations such as union and intersection. Due to this complexity, it becomes very difficult to compress the existing file. To reduce the complexity of compacting the AOF, Redis starts with the data in the memory and rewrites the AOF file from scratch. This is more similar to the snapshotting method. Redis forks a child process that recreates the AOF file and performs an atomic rename to swap the old file with a new one. The same problem, of the requirement of extra memory for operations performed during the rewrite, is present here. So the memory required can spike up to two times based on the operations while writing an AOF file. There's more... Both snapshotting and AOF have their own advantages and limitations, which makes it ideal to use both at the same time. Let us now discuss the major advantages and limitations in the snapshotting method. Advantages of snapshotting The advantages of configuring snapshotting in Redis are as follows: RDB is a single compact file that cannot get corrupted due to the way it is created. It is very easy to implement. This dump file is perfect to take backups and for disaster recovery of remote servers. The RDB file can just be copied and saved for future recoveries. In comparison, this approach has little or no influence over performance as the only work the parent process needs to perform is forking a child process. The parent process will never perform any disk operations; they are all performed by the child process. As an RDB file can be compressed, it provides a faster restart when compared to the append-only file method. Limitations of snapshotting Snapshotting, in spite of the advantages mentioned, has a few limitations that you should be aware of: The periodic background save can result in significant loss of data in case of server or hardware failure. The fork() process used to save the data might take a moment, during which the server will stop serving clients. The larger the data set to be saved, the longer it takes the fork() process to complete. The memory needed for the data set might double in the worst-case scenario, when all the keys in the memory are modified while snapshotting is in progress. What should we use? Now that we have discussed both the modes of persistence Redis provides us with, the big question is what should we use? The answer to this question is entirely based on our application and requirements. In cases where we expect good durability, both snapshotting and AOF can be turned on and be made to work in unison, providing us with redundant persistence. Redis always restores the data from AOF wherever applicable, as it is supposed to have better durability with little loss of data. Both RDB and AOF files can be copied and stored for future use or for recovering another instance of Redis. In a few cases, where performance is very critical, memory usage is limited, and persistence is also paramount, persistence can be turned off completely. In these cases, replication can be used to get durability. Replication is a process in which two Redis instances, one master and one slave, are in sync with the same data. Clients are served by the master, and the master server syncs the data with a slave. Replication setup for persistence Consider a setup as shown in the preceding image; that is: Master instance with no persistence Slave instance with AOF enabled In this case, the master does not need to perform any background disk operations and is fully dedicated to serve client requests, except for a trivial slave connection. The slave server configured with AOF performs the disk operations. As mentioned before, this file can be used to restore the master in case of a disaster. Persistence in Redis is a matter of configuration, balancing the trade-off between performance, disk I/O, and data durability. If you are looking for more information on persistence in Redis, you will find the article by Salvatore Sanfilippo at http://oldblog.antirez.com/post/redis-persistence-demystified.html interesting. Summary This article helps you to understand the persistence option available in Redis, which could ease your efforts of adding Redis to your application stack. Resources for Article : Further resources on this subject: Using Execnet for Parallel and Distributed Processing with NLTK [Article] Parsing Specific Data in Python Text Processing [Article] Python Text Processing with NLTK: Storing Frequency Distributions in Redis [Article]
Read more
  • 0
  • 3
  • 7259
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-learn-computer-vision-applications-open-cv
Packt
07 Jun 2011
16 min read
Save for later

Learn computer vision applications in Open CV

Packt
07 Jun 2011
16 min read
  OpenCV 2 Computer Vision Application Programming Cookbook Over 50 recipes to master this library of programming functions for real-time computer vision         Read more about this book       OpenCV (Open Source Computer Vision) is an open source library containing more than 500 optimized algorithms for image and video analysis. Since its introduction in 1999, it has been largely adopted as the primary development tool by the community of researchers and developers in computer vision. OpenCV was originally developed at Intel by a team led by Gary Bradski as an initiative to advance research in vision and promote the development of rich, vision-based CPU-intensive applications. In this article by Robert Laganière, author of OpenCV 2 Computer Vision Application Programming Cookbook, we will cover: Calibrating a camera Computing the fundamental matrix of an image pair Matching images using random sample consensus Computing a homography between two images (For more resources related to the article, see here.) Introduction Images are generally produced using a digital camera that captures a scene by projecting light onto an image sensor going through its lens. The fact that an image is formed through the projection of a 3D scene onto a 2D plane imposes the existence of important relations between a scene and its image, and between different images of the same scene. Projective geometry is the tool that is used to describe and characterize, in mathematical terms, the process of image formation. In this article, you will learn some of the fundamental projective relations that exist in multi-view imagery and how these can be used in computer vision programming. But before we start the recipes, let's explore the basic concepts related to scene projection and image formation. Image formation Fundamentally, the process used to produce images has not changed since the beginning of photography. The light coming from an observed scene is captured by a camera through a frontal aperture and the captured light rays hit an image plane (or image sensor) located on the back of the camera. Additionally, a lens is used to concentrate the rays coming from the different scene elements. This process is illustrated by the following figure: Here, do is the distance from the lens to the observed object, di is the distance from the lens to the image plane, and f is the focal length of the lens. These quantities are related by the so-called thin lens equation: In computer vision, this camera model can be simplified in a number of ways. First, we can neglect the effect of the lens by considering a camera with an infinitesimal aperture since, in theory, this does not change the image. Only the central ray is therefore considered. Second, since most of the time we have do>>di, we can assume that the image plane is located at the focal distance. Finally, we can notice from the geometry of the system, that the image on the plane is inverted. We can obtain an identical but upright image by simply positioning the image plane in front of the lens. Obviously, this is not physically feasible, but from a mathematical point of view, this is completely equivalent. This simplified model is often referred to as the pin-hole camera model and it is represented as follows: From this model, and using the law of similar triangles, we can easily derive the basic projective equation: The size (hi) of the image of an object (of height ho) is therefore inversely proportional to its distance (do) from the camera which is naturally true. This relation allows the position of the image of a 3D scene point to be predicted onto the image plane of a camera. Calibrating a camera From the introduction of this article, we learned that the essential parameters of a camera under the pin-hole model are its focal length and the size of the image plane (which defines the field of view of the camera). Also, since we are dealing with digital images, the number of pixels on the image plane is another important characteristic of a camera. Finally, in order to be able to compute the position of an image's scene point in pixel coordinates, we need one additional piece of information. Considering the line coming from the focal point that is orthogonal to the image plane, we need to know at which pixel position this line pierces the image plane. This point is called the principal point. It could be logical to assume that this principal point is at the center of the image plane, but in practice, this one might be off by few pixels depending at which precision the camera has been manufactured. Camera calibration is the process by which the different camera parameters are obtained. One can obviously use the specifications provided by the camera manufacturer, but for some tasks, such as 3D reconstruction, these specifications are not accurate enough. Camera calibration will proceed by showing known patterns to the camera and analyzing the obtained images. An optimization process will then determine the optimal parameter values that explain the observations. This is a complex process but made easy by the availability of OpenCV calibration functions. How to do it... To calibrate a camera, the idea is show to this camera a set of scene points for which their 3D position is known. You must then determine where on the image these points project. Obviously, for accurate results, we need to observe several of these points. One way to achieve this would be to take one picture of a scene with many known 3D points. A more convenient way would be to take several images from different viewpoints of a set of some 3D points. This approach is simpler but requires computing the position of each camera view, in addition to the computation of the internal camera parameters which fortunately is feasible. OpenCV proposes to use a chessboard pattern to generate the set of 3D scene points required for calibration. This pattern creates points at the corners of each square, and since this pattern is flat, we can freely assume that the board is located at Z=0 with the X and Y axes well aligned with the grid. In this case, the calibration process simply consists of showing the chessboard pattern to the camera from different viewpoints. Here is one example of a calibration pattern image: The nice thing is that OpenCV has a function that automatically detects the corners of this chessboard pattern. You simply provide an image and the size of the chessboard used (number of vertical and horizontal inner corner points). The function will return the position of these chessboard corners on the image. If the function fails to find the pattern, then it simply returns false: // output vectors of image pointsstd::vector<cv::Point2f> imageCorners;// number of corners on the chessboardcv::Size boardSize(6,4);// Get the chessboard cornersbool found = cv::findChessboardCorners(image, boardSize, imageCorners); Note that this function accepts additional parameters if one needs to tune the algorithm, which are not discussed here. There is also a function that draws the detected corners on the chessboard image with lines connecting them in sequence: //Draw the cornerscv::drawChessboardCorners(image, boardSize, imageCorners, found); // corners have been found The image obtained is seen here: The lines connecting the points shows the order in which the points are listed in the vector of detected points. Now to calibrate the camera, we need to input a set of such image points together with the coordinate of the corresponding 3D points. Let's encapsulate the calibration process in a CameraCalibrator class: class CameraCalibrator { // input points: // the points in world coordinates std::vector<std::vector<cv::Point3f>> objectPoints; // the point positions in pixels std::vector<std::vector<cv::Point2f>> imagePoints; // output Matrices cv::Mat cameraMatrix; cv::Mat distCoeffs; // flag to specify how calibration is done int flag; // used in image undistortion cv::Mat map1,map2; bool mustInitUndistort; public: CameraCalibrator() : flag(0), mustInitUndistort(true) {}; As mentioned previously, the 3D coordinates of the points on the chessboard pattern can be easily determined if we conveniently place the reference frame on the board. The method that accomplishes this takes a vector of the chessboard image filename as input: // Open chessboard images and extract corner pointsint CameraCalibrator::addChessboardPoints( const std::vector<std::string>& filelist, cv::Size & boardSize) { // the points on the chessboard std::vector<cv::Point2f> imageCorners; std::vector<cv::Point3f> objectCorners; // 3D Scene Points: // Initialize the chessboard corners // in the chessboard reference frame // The corners are at 3D location (X,Y,Z)= (i,j,0) for (int i=0; i<boardSize.height; i++) { for (int j=0; j<boardSize.width; j++) { objectCorners.push_back(cv::Point3f(i, j, 0.0f)); } } // 2D Image points: cv::Mat image; // to contain chessboard image int successes = 0; // for all viewpoints for (int i=0; i<filelist.size(); i++) { // Open the image image = cv::imread(filelist[i],0); // Get the chessboard corners bool found = cv::findChessboardCorners( image, boardSize, imageCorners); // Get subpixel accuracy on the corners cv::cornerSubPix(image, imageCorners, cv::Size(5,5), cv::Size(-1,-1), cv::TermCriteria(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS, 30, // max number of iterations 0.1)); // min accuracy //If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } return successes;} The first loop inputs the 3D coordinates of the chessboard, which are specified in an arbitrary square size unit here. The corresponding image points are the ones provided by the cv::findChessboardCorners function. This is done for all available viewpoints. Moreover, in order to obtain a more accurate image point location, the function cv::cornerSubPix can be used and as the name suggests, the image points will then be localized at sub-pixel accuracy. The termination criterion that is specified by the cv::TermCriteria object defines a maximum number of iterations and a minimum accuracy in sub-pixel coordinates. The first of these two conditions that is reached will stop the corner refinement process. When a set of chessboard corners has been successfully detected, these points are added to our vector of image and scene points: // Add scene points and corresponding image pointsvoid CameraCalibrator::addPoints(const std::vector<cv::Point2f>&imageCorners, const std::vector<cv::Point3f>& objectCorners) { // 2D image points from one view imagePoints.push_back(imageCorners); // corresponding 3D scene points objectPoints.push_back(objectCorners);} The vectors contains std::vector instances. Indeed, each vector element being a vector of points from one view. Once a sufficient number of chessboard images have been processed (and consequently a large number of 3D scene point/2D image point correspondences are available), we can initiate the computation of the calibration parameters: // Calibrate the camera// returns the re-projection errordouble CameraCalibrator::calibrate(cv::Size &imageSize){ // undistorter must be reinitialized mustInitUndistort= true; //Output rotations and translations std::vector<cv::Mat> rvecs, tvecs; // start calibration return calibrateCamera(objectPoints, // the 3D points imagePoints, // the image points imageSize, // image size cameraMatrix,// output camera matrix distCoeffs, // output distortion matrix rvecs, tvecs,// Rs, Ts flag); // set options} In practice, 10 to 20 chessboard images are sufficient, but these must be taken from different viewpoints at different depths. The two important outputs of this function are the camera matrix and the distortion parameters. The camera matrix will be described in the next section. For now, let's consider the distortion parameters. So far, we have mentioned that with the pin-hole camera model, we can neglect the effect of the lens. But this is only possible if the lens used to capture an image does not introduce too important optical distortions. Unfortunately, this is often the case with lenses of lower quality or with lenses having a very short focal length. You may have already noticed that in the image we used for our example, the chessboard pattern shown is clearly distorted. The edges of the rectangular board being curved in the image. It can also be noticed that this distortion becomes more important as we move far from the center of the image. This is a typical distortion observed with fish-eye lens and it is called radial distortion. The lenses that are used in common digital cameras do not exhibit such a high degree of distortion, but in the case of the lens used here, these distortions cannot certainly be ignored. It is possible to compensate for these deformations by introducing an appropriate model. The idea is to represent the distortions induced by a lens by a set of mathematical equations. Once established, these equations can then be reverted in order to undo the distortions visible on the image. Fortunately, the exact parameters of the transformation that will correct the distortions can be obtained together with the other camera parameter during the calibration phase. Once this is done, any image from the newly calibrated camera can be undistorted: // remove distortion in an image (after calibration)cv::Mat CameraCalibrator::remap(const cv::Mat &image) { cv::Mat undistorted; if (mustInitUndistort) { // called once per calibration cv::initUndistortRectifyMap( cameraMatrix, // computed camera matrix distCoeffs, // computed distortion matrix cv::Mat(), // optional rectification (none) cv::Mat(), // camera matrix to generate undistorted image.size(), // size of undistorted CV_32FC1, // type of output map map1, map2); // the x and y mapping functions mustInitUndistort= false; } // Apply mapping functions cv::remap(image, undistorted, map1, map2, cv::INTER_LINEAR); // interpolation type return undistorted;} Which results in the following image: As you can see, once the image is undistorted, we obtain a regular perspective image. How it works... In order to explain the result of the calibration, we need to go back to the figure in the introduction which describes the pin-hole camera model. More specifically, we want to demonstrate the relation between a point in 3D at position (X,Y,Z) and its image (x,y) on a camera specified in pixel coordinates. Let's redraw this figure by adding a reference frame that we position at the center of the projection as seen here: Note that the Y-axis is pointing downward to get a coordinate system compatible with the usual convention that places the image origin at the upper-left corner. We learned previously that the point (X,Y,Z) will be projected onto the image plane at (fX/Z,fY/Z). Now, if we want to translate this coordinate into pixels, we need to divide the 2D image position by, respectively, the pixel width (px) and height (py). We notice that by dividing the focal length f given in world units (most often meters or millimeters) by px, then we obtain the focal length expressed in (horizontal) pixels. Let's then define this term as fx. Similarly, fy =f/py is defined as the focal length expressed in vertical pixel unit. The complete projective equation is therefore: Recall that (u0,v0) is the principal point that is added to the result in order to move the origin to the upper-left corner of the image. These equations can be rewritten in matrix form through the introduction of homogeneous coordinates in which 2D points are represented by 3-vectors, and 3D points represented by 4-vectors (the extra coordinate is simply an arbitrary scale factor that need to be removed when a 2D coordinate needs to be extracted from a homogeneous 3-vector). Here is the projective equation rewritten: The second matrix is a simple projection matrix. The first matrix includes all of the camera parameters which are called the intrinsic parameters of the camera. This 3x3 matrix is one of the output matrices returned by the cv::calibrateCamera function. There is also a function called cv::calibrationMatrixValues that returns the value of the intrinsic parameters given a calibration matrix. More generally, when the reference frame is not at the projection center of the camera, we will need to add a rotation (a 3x3 matrix) and a translation vector (3x1 matrix). These two matrices describe the rigid transformation that must be applied to the 3D points in order to bring them back to the camera reference frame. Therefore, we can rewrite the projection equation in its most general form: Remember that in our calibration example, the reference frame was placed on the chessboard. Therefore, there is a rigid transformation (rotation and translation) that must be computed for each view. These are in the output parameter list of the cv::calibrateCamera function. The rotation and translation components are often called the extrinsic parameters of the calibration and they are different for each view. The intrinsic parameters remain constant for a given camera/lens system. The intrinsic parameters of our test camera obtained from a calibration based on 20 chessboard images are fx=167, fy=178, u0=156, v0=119. These results are obtained by cv::calibrateCamera through an optimization process aimed at finding the intrinsic and extrinsic parameters that will minimize the difference between the predicted image point position, as computed from the projection of the 3D scene points, and the actual image point position, as observed on the image. The sum of this difference for all points specified during the calibration is called the re-projection error. To correct the distortion, OpenCV uses a polynomial function that is applied to the image point in order to move them at their undistorted position. By default, 5 coefficients are used; a model made of 8 coefficients is also available. Once these coefficients are obtained, it is possible to compute 2 mapping functions (one for the x coordinate and one for the y) that will give the new undistorted position of an image point on a distorted image. This is computed by the function cv::initUndistortRectifyMap and the function cv::remap remaps all of the points of an input image to a new image. Note that because of the non-linear transformation, some pixels of the input image now fall outside the boundary of the output image. You can expand the size of the output image to compensate for this loss of pixels, but you will now obtain output pixels that have no values in the input image (they will then be displayed as black pixels). There's more... When a good estimate of the camera intrinsic parameters are known, it could be advantageous to input them to the cv::calibrateCamera function. They will then be used as initial values in the optimization process. To do so, you just need to add the flag CV_CALIB_USE_INTRINSIC_GUESS and input these values in the calibration matrix parameter. It is also possible to impose a fixed value for the principal point (CV_CALIB_ FIX_PRINCIPAL_POINT), which can often be assumed to be the central pixel. You can also impose a fixed ratio for the focal lengths fx and fy (CV_CALIB_FIX_RATIO) in which case you assume pixels of square shape.
Read more
  • 0
  • 1
  • 7175

article-image-aspects-data-manipulation-r
Packt
10 Jan 2014
6 min read
Save for later

Aspects of Data Manipulation in R

Packt
10 Jan 2014
6 min read
(For more resources related to this topic, see here.) Factor variables in R In any data analysis task, the majority of the time is dedicated to data cleaning and pre-processing. Sometimes, it is considered that about 80 percent of the effort is devoted for data cleaning before conducting actual analysis. Also, in real world data, we often work with categorical variables. A variable that takes only a limited number of distinct values is usually known as a categorical variable, and in R, this is known as a factor. Working with categorical variables in R is a bit technical, and in this article we have tried to demystify this process of dealing with categorical variables. During data analysis, sometimes the factor variable plays an important role, particularly in studying the relationship between two categorical variables. In this section, we will see some important aspects of factor manipulation. When a factor variable is first created, it stores all its levels along with the factor. But if we take any subset of that factor variable, it inherits all its levels from the original factor levels. This feature sometimes creates confusion in understanding the results. Numeric variables are convenient during statistical analysis, but sometimes we need to create categorical (factor) variables from numeric variables. We can create a limited number of categories from a numeric variable using a series of conditional statements, but this is not an efficient way to perform this operation. In R, cut is a generic command to create factor variables from numeric variables. Split-apply-combine strategy Data manipulation is an integral part of data cleaning and analysis. For large data it is always preferable to perform the operation within subgroup of a dataset to speed up the process. In R this type of data manipulation could be done with base functionality, but for large-scale data it requires considerable amount of coding and eventually it takes a longer time to process. In case of Big Data we could split the dataset, perform the manipulation or analysis, and then again combine the results into a single output. This type of split using base R is not efficient and to overcome this limitation, Wickham developed an R package plyr where he efficiently implemented the split-apply-combine strategy. Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break up a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we could compare this with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently. The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were previously unconnected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator. Reshaping a dataset Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping and we need to implement some reorientation to perform certain types of analyses. Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, whereas in wide-layout, a single row represents a single subject's record. Statistical analysis sometimes requires wide data and sometimes long data, and in such cases, we need to be able to fluently and fluidly reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. In this article, we will show you different layouts of the same dataset and see how they can be transferred from one layout to another. This article mainly highlights the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient. A single dataset can be rearranged in many different ways, but before going into rearrangement, let's look back at how we usually perceive a dataset. Whenever we think about any dataset, we think of a two-dimensional arrangement where a row represents a subject's (a subject could be a person and is typically the respondent in a survey) information for all the variables in a dataset and a column represents the information for each characteristic for all subjects. This means rows indicate records and columns indicate variables, characteristics, or attributes. This is the typical layout of a dataset. In this arrangement, one or more variables might play a role as an identifier and others are measured characteristics. For the purpose of reshaping, we could group the variables into two groups: identifier variables and measured variables. Identifier variables: These help to identify the subject from whom we took information on different characteristics. Typically, identifier variables are qualitative in nature and take a limited number of unique values. In database terms, an identifier is termed as the primary key, and this can be a single variable or a composite of multiple variables. Measured variables: These are those characteristics whose information we took from a subject of interest. These can be qualitative, quantitative, or a mix of both. Now beyond this typical structure of dataset, we could think differently, where we will have only identification variables and a value. The identification variable identifies a subject along with which measured variable the value represents. In this new paradigm, each row represents one observation of one variable. In the new paradigm this is termed as melting and it produces molten data. The difference between this new layout of the data and the typical layout is that it now contains only the ID variable and a new column, value, which represents the value of that observation. Summary This article explains briefly about factor variables, the split-apply-combine strategy, and reshaping a dataset in R. Resources for Article: Further resources on this subject: SQL Server 2008 R2: Multiserver Management Using Utility Explorer [Article] Working with Data Application Components in SQL Server 2008 R2 [Article] Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article]
Read more
  • 0
  • 0
  • 7172

article-image-quick-start-guide-flume
Packt
02 Mar 2015
15 min read
Save for later

A Quick Start Guide to Flume

Packt
02 Mar 2015
15 min read
In this article by Steve Hoffman, the author of the book, Apache Flume: Distributed Log Collection for Hadoop - Second Edition, we will learn about the basics that are required to be known before we start working with Apache Flume. This article will help you get started with Flume. So, let's start with the first step: downloading and configuring Flume. (For more resources related to this topic, see here.) Downloading Flume Let's download Flume from http://flume.apache.org/. Look for the download link in the side navigation. You'll see two compressed .tar archives available along with the checksum and GPG signature files used to verify the archives. Instructions to verify the download are on the website, so I won't cover them here. Checking the checksum file contents against the actual checksum verifies that the download was not corrupted. Checking the signature file validates that all the files you are downloading (including the checksum and signature) came from Apache and not some nefarious location. Do you really need to verify your downloads? In general, it is a good idea and it is recommended by Apache that you do so. If you choose not to, I won't tell. The binary distribution archive has bin in the name, and the source archive is marked with src. The source archive contains just the Flume source code. The binary distribution is much larger because it contains not only the Flume source and the compiled Flume components (jars, javadocs, and so on), but also all the dependent Java libraries. The binary package contains the same Maven POM file as the source archive, so you can always recompile the code even if you start with the binary distribution. Go ahead, download and verify the binary distribution to save us some time in getting started. Flume in Hadoop distributions Flume is available with some Hadoop distributions. The distributions supposedly provide bundles of Hadoop's core components and satellite projects (such as Flume) in a way that ensures things such as version compatibility and additional bug fixes are taken into account. These distributions aren't better or worse; they're just different. There are benefits to using a distribution. Someone else has already done the work of pulling together all the version-compatible components. Today, this is less of an issue since the Apache BigTop project started (http://bigtop.apache.org/). Nevertheless, having prebuilt standard OS packages, such as RPMs and DEBs, ease installation as well as provide startup/shutdown scripts. Each distribution has different levels of free and paid options, including paid professional services if you really get into a situation you just can't handle. There are downsides, of course. The version of Flume bundled in a distribution will often lag quite a bit behind the Apache releases. If there is a new or bleeding-edge feature you are interested in using, you'll either be waiting for your distribution's provider to backport it for you, or you'll be stuck patching it yourself. Furthermore, while the distribution providers do a fair amount of testing, such as any general-purpose platform, you will most likely encounter something that their testing didn't cover, in which case, you are still on the hook to come up with a workaround or dive into the code, fix it, and hopefully, submit that patch back to the open source community (where, at a future point, it'll make it into an update of your distribution or the next version). So, things move slower in a Hadoop distribution world. You can see that as good or bad. Usually, large companies don't like the instability of bleeding-edge technology or making changes often, as change can be the most common cause of unplanned outages. You'd be hard pressed to find such a company using the bleeding-edge Linux kernel rather than something like Red Hat Enterprise Linux (RHEL), CentOS, Ubuntu LTS, or any of the other distributions whose target is stability and compatibility. If you are a startup building the next Internet fad, you might need that bleeding-edge feature to get a leg up on the established competition. If you are considering a distribution, do the research and see what you are getting (or not getting) with each. Remember that each of these offerings is hoping that you'll eventually want and/or need their Enterprise offering, which usually doesn't come cheap. Do your homework. Here's a short, nondefinitive list of some of the more established players. For more information, refer to the following links: Cloudera: http://cloudera.com/ Hortonworks: http://hortonworks.com/ MapR: http://mapr.com/ An overview of the Flume configuration file Now that we've downloaded Flume, let's spend some time going over how to configure an agent. A Flume agent's default configuration provider uses a simple Java property file of key/value pairs that you pass as an argument to the agent upon startup. As you can configure more than one agent in a single file, you will need to additionally pass an agent identifier (called a name) so that it knows which configurations to use. In my examples where I'm only specifying one agent, I'm going to use the name agent. By default, the configuration property file is monitored for changes every 30 seconds. If a change is detected, Flume will attempt to reconfigure itself. In practice, many of the configuration settings cannot be changed after the agent has started. Save yourself some trouble and pass the undocumented --no-reload-conf argument when starting the agent (except in development situations perhaps). If you use the Cloudera distribution, the passing of this flag is currently not possible. I've opened a ticket to fix that at https://issues.cloudera.org/browse/DISTRO-648. If this is important to you, please vote it up. Each agent is configured, starting with three parameters: agent.sources=<list of sources>agent.channels=<list of channels>agent.sinks=<list of sinks> Each source, channel, and sink also has a unique name within the context of that agent. For example, if I'm going to transport my Apache access logs, I might define a channel named access. The configurations for this channel would all start with the agent.channels.access prefix. Each configuration item has a type property that tells Flume what kind of source, channel, or sink it is. In this case, we are going to use an in-memory channel whose type is memory. The complete configuration for the channel named access in the agent named agent would be: agent.channels.access.type=memory Any arguments to a source, channel, or sink are added as additional properties using the same prefix. The memory channel has a capacity parameter to indicate the maximum number of Flume events it can hold. Let's say we didn't want to use the default value of 100; our configuration would now look like this: agent.channels.access.type=memoryagent.channels.access.capacity=200 Finally, we need to add the access channel name to the agent.channels property so that the agent knows to load it: agent.channels=access Let's look at a complete example using the canonical "Hello, World!" example. Starting up with "Hello, World!" No technical article would be complete without a "Hello, World!" example. Here is the configuration file we'll be using: agent.sources=s1agent.channels=c1agent.sinks=k agent.sources.s1.type=netcatagent.sources.s1.channels=c1agent.sources.s1.bind=0.0.0.0agent.sources.s1.port=1234 agent.channels.c1.type=memory agent.sinks.k1.type=loggeragent.sinks.k1.channel=c1 Here, I've defined one agent (called agent) who has a source named s1, a channel named c1, and a sink named k1. The s1 source's type is netcat, which simply opens a socket listening for events (one line of text per event). It requires two parameters: a bind IP and a port number. In this example, we are using 0.0.0.0 for a bind address (the Java convention to specify listen on any address) and port 12345. The source configuration also has a parameter called channels (plural), which is the name of the channel(s) the source will append events to, in this case, c1. It is plural, because you can configure a source to write to more than one channel; we just aren't doing that in this simple example. The channel named c1 is a memory channel with a default configuration. The sink named k1 is of the logger type. This is a sink that is mostly used for debugging and testing. It will log all events at the INFO level using Log4j, which it receives from the configured channel, in this case, c1. Here, the channel keyword is singular because a sink can only be fed data from one channel. Using this configuration, let's run the agent and connect to it using the Linux netcat utility to send an event. First, explode the .tar archive of the binary distribution we downloaded earlier: $ tar -zxf apache-flume-1.5.2-bin.tar.gz$ cd apache-flume-1.5.2-bin Next, let's briefly look at the help. Run the flume-ng command with the help command: $ ./bin/flume-ng helpUsage: ./bin/flume-ng <command> [options]... commands:help                 display this help textagent                run a Flume agentavro-client           run an avro Flume clientversion               show Flume version info global options:--conf,-c <conf>     use configs in <conf> directory--classpath,-C <cp>   append to the classpath--dryrun,-d          do not actually start Flume, just print the command--plugins-path <dirs> colon-separated list of plugins.d directories. See the                       plugins.d section in the user guide for more details.                       Default: $FLUME_HOME/plugins.d-Dproperty=value     sets a Java system property value-Xproperty=value     sets a Java -X option agent options:--conf-file,-f <file> specify a config file (required)--name,-n <name>     the name of this agent (required)--help,-h             display help text avro-client options:--rpcProps,-P <file>   RPC client properties file with server connection params--host,-H <host>       hostname to which events will be sent--port,-p <port>       port of the avro source--dirname <dir>       directory to stream to avro source--filename,-F <file>   text file to stream to avro source (default: std input)--headerFile,-R <file> File containing event headers as key/value pairs on each new line--help,-h             display help text Either --rpcProps or both --host and --port must be specified. Note that if <conf> directory is specified, then it is always included first in the classpath. As you can see, there are two ways with which you can invoke the command (other than the simple help and version commands). We will be using the agent command. The use of avro-client will be covered later. The agent command has two required parameters: a configuration file to use and the agent name (in case your configuration contains multiple agents). Let's take our sample configuration and open an editor (vi in my case, but use whatever you like): $ vi conf/hw.conf Next, place the contents of the preceding configuration into the editor, save, and exit back to the shell. Now you can start the agent: $ ./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console The -Dflume.root.logger property overrides the root logger in conf/log4j.properties to use the console appender. If we didn't override the root logger, everything would still work, but the output would go to the log/flume.log file instead of being based on the contents of the default configuration file. Of course, you can edit the conf/log4j.properties file and change the flume.root.logger property (or anything else you like). To change just the path or filename, you can set the flume.log.dir and flume.log.file properties in the configuration file or pass additional flags on the command line as follows: $ ./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console -Dflume.log.dir=/tmp -Dflume.log.file=flume-agent.log You might ask why you need to specify the -c parameter, as the -f parameter contains the complete relative path to the configuration. The reason for this is that the Log4j configuration file should be included on the class path. If you left the -c parameter off the command, you'll see this error: Warning: No configuration directory set! Use --conf <dir> to override.log4j:WARN No appenders could be found for logger (org.apache.flume.lifecycle.LifecycleSupervisor).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info But you didn't do that so you should see these key log lines: 2014-10-05 15:39:06,109 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:140)] Post-validation flume configuration contains configuration foragents: [agent] This line tells you that your agent starts with the name agent. Usually you'd look for this line only to be sure you started the right configuration when you have multiple configurations defined in your configuration file. 2014-10-05 15:39:06,076 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)] Reloadingconfiguration file:conf/hw.conf This is another sanity check to make sure you are loading the correct file, in this case our hw.conf file. 2014-10-05 15:39:06,221 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:138)]Starting new configuration:{ sourceRunners:{s1=EventDrivenSourceRunner: { source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE} }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@442fbe47 counterGroup:{ name:null counters:{} } }}channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} } Once all the configurations have been parsed, you will see this message, which shows you everything that was configured. You can see s1, c1, and k1, and which Java classes are actually doing the work. As you probably guessed, netcat is a convenience for org.apache.flume.source.NetcatSource. We could have used the class name if we wanted. In fact, if I had my own custom source written, I would use its class name for the source's type parameter. You cannot define your own short names without patching the Flume distribution. 2014-10-05 15:39:06,427 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)] CreatedserverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345] Here, we see that our source is now listening on port 12345 for the input. So, let's send some data to it. Finally, open a second terminal. We'll use the nc command (you can use Telnet or anything else similar) to send the Hello World string and press the Return (Enter) key to mark the end of the event: % nc localhost 12345Hello WorldOK The OK message came from the agent after we pressed the Return key, signifying that it accepted the line of text as a single Flume event. If you look at the agent log, you will see the following: 2014-10-05 15:44:11,215 (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 48 65 6C 6C 6F 20 57 6F 72 6C 64Hello World } This log message shows you that the Flume event contains no headers (NetcatSource doesn't add any itself). The body is shown in hexadecimal along with a string representation (for us humans to read, in this case, our Hello World message). If I send the following line and then press the Enter key, you'll get an OK message: The quick brown fox jumped over the lazy dog. You'll see this in the agent's log: 2014-10-05 15:44:57,232 (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]Event: { headers:{} body: 54 68 65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20The quick brown } The event appears to have been truncated. The logger sink, by design, limits the body content to 16 bytes to keep your screen from being filled with more than what you'd need in a debugging context. If you need to see the full contents for debugging, you should use a different sink, perhaps the file_roll sink, which would write to the local filesystem. Summary In this article, we covered how to download the Flume binary distribution. We created a simple configuration file that included one source writing to one channel, feeding one sink. The source listened on a socket for network clients to connect to and to send it event data. These events were written to an in-memory channel and then fed to a Log4j sink to become the output. We then connected to our listening agent using the Linux netcat utility and sent some string events to our Flume agent's source. Finally, we verified that our Log4j-based sink wrote the events out. Resources for Article: Further resources on this subject: About Cassandra [article] Introducing Kafka [article] Transformation [article]
Read more
  • 0
  • 0
  • 7160

article-image-implementing-artificial-neural-networks-tensorflow
Packt
08 Jul 2016
12 min read
Save for later

Implementing Artificial Neural Networks with TensorFlow

Packt
08 Jul 2016
12 min read
In this article by Giancarlo Zaccone, the author of Getting Started with TensorFlow, we will learn about artificial neural networks (ANNs), an information processing system whose operating mechanism is inspired by biological neural circuits. Thanks to their characteristics, neural networks are the protagonists of a real revolution in machine learning systems and more generally in the context of Artificial Intelligence. An artificial neural network possesses many simple processing units variously connected to each other, according to various architectures. If we look at the schema of an ANN report, it can be seen that the hidden units communicate with the external layer, both in input and output, while the input and output units communicate only with the hidden layer of the network Each unit or node simulates the role of the neuron in biological neural networks, a node, said artificial neuron, plays a very simple operation: becomes active if the total quantity of signal, which it receives exceeds its activation threshold, defined by the so-called activation function. If a node becomesactive, it emits a signal that is transmitted along the transmission channels up to the other unit to which it is connected. A connection point acts as a filter that converts the message into an inhibitory or excitatory signal increasing or decreasing the intensity, according to their individual characteristics. The connection points simulate the biological synapses and have the fundamental function to weigh the intensity of the transmitted signals, by multiplying them by the weights whose value depends on the connection itself. ANN schematic diagram Neural network architectures The way to connect the nodes, the total number of layers, that is, the levels of nodes between input and output, define the architecture of a neural network. For example, in a multilayer networks, one can identify the artificial neurons of layers such that: Each neuron is connected with all those of the next layer There are no connections between neurons belonging to the same layer The number of layers and of neurons per layer depends on the problem to be solved Now we start our exploration of neural network models, introducing the most simple neural network model: the Single Layer Perceptron or the so-called Rosenblatt's Perceptron. Single Layer Perceptron The Single Layer Perceptron was the first neural network model proposed in 1958 by Frank Rosenblatt. In this model, the content of the local memory of the neuron consists of a vector of weights, W = (w1, w2,……, wn). The computation is performed over the calculation of a sum of the input vector X =(x1, x2,……, xn), each of which is multiplied by the corresponding element of the vector of the weights; then the value provided in the output (that is, a weighted sum) will be the input of an activation function. This function returns 1 if the result is greater than a certain threshold, otherwise it returns -1. In the following figure, the activation function is the so-called sign function:         +1        x > 0 sign(x) =         −1        otherwise It is possible to use other activation functions, preferably non-linear (such as the sigmoid function that we will see in the next section). The learning procedure of the net is iterative: it slightly modifies for each learning cycle (called epoch) the synaptic weights by using a selected set called training set. At each cycle, the weights must be modified so as to minimize a cost function, which is specific to the problem under consideration. Finally, when the perceptron will be trained on the training set, it will be able to be tested on other inputs (the test set) in order to verify its capacity of generalization. Schema of a Rosemblatt's Perceptron Let's now see how to implement a single layer neural network for an image classification problem using TensorFlow. The logistic regression This algorithm has nothing to do with the canonical linear regression, but it is an algorithm that allows us to solve supervised classification problems. In fact to estimate the dependent variable, now we make use of the so-called logistic function or sigmoid. It is precisely because of this feature that we call this algorithm logistic regression.The sigmoid function has this pattern: As we can see, the dependent variable takes values strictly between 0 and 1 that is precisely what serves us. In the case of the logistic regression we want, then, that our function tell us what's the probability of belonging to a particular element of our class. We recall again that the supervised learning by the neural network is configured as an iterative process of optimization of the weights; these are then modified on the basis of the network's performance of the training set. Indeed the aim is to the loss function which indicates the degree to which the behavior of the network deviates from the desired one. The performance of the network is then verified on a test set, consisting of images other than those of train. The basic steps of training that we're going to implement are as follows: The weights are initialized with random values at the beginning of the training. For each element of the training set is calculated the error, that is, the difference between the desired output and the actual output. This error is used to adjust the weights The process is repeated resubmitting to the network, in a random order, all the examples of the training set until the error made on the entire training set is not less than a certain threshold or until the number of iterations are over. Let's now see in detail how to implement the logistic regression with TensorFlow. The problem we want to solve is yet to classify images from the MNIST dataset. The TensorFlow implementation First of all, we have to import all the necessary libraries: import input_data import tensorflow as tf import matplotlib.pyplot as plt We use the input_data.readfunction, to upload the images to our problem: mnist = input_data.read_data_sets("/tmp/data/", one_hot=True) Then we set the total number of epochs for the training phase: training_epochs = 25 Also we must define other parameters necessary for the model building: learning_rate = 0.01 batch_size = 100 display_step = 1 Now we move to the construction of the model Building the model Define x as the input tensor, it represents the MNIST data image of shape 28 x 28 = 784 pixels x = tf.placeholder("float", [None, 784]) We recall that our problem consists in assigning a probability value for each of the possible classes of membership (the numbers from 0 to 9). At the end of this calculation, we will use a probability distribution, which gives us the value of what is confident with our prediction. So the output we're going to get will be an output tensor with 10 probabilities each one corresponding to a digit (of course the sum of probabilities must be one): y = tf.placeholder("float", [None, 10]) To assign probabilities to each image, we will use the so-called softmax activation function. The softmax function is specified in two main steps: Calculate the evidence that a certain image belongs to a particular class. Convert the evidence into probabilities of belonging to each of the 10 possible classes. To evaluate the evidence, we first define the weights input tensor asW: W = tf.Variable(tf.zeros([784, 10])) For a given image, we could evaluate the evidence for each class isimply multiplying the tensorWwith the input tensorx. Using TensorFlow, we should have something like this: evidence = tf.matmul(x, W) In general, the models include an extra parameter representing the bias that indicates a certain degree of uncertainty; in our case, the final formula for the evidence is: evidence = tf.matmul(x, W) + b It means that for everyi(from 0 to 9) we have aWimatrix elements784  (28 × 28), where each elementjof the matrix is multiplied by the correspondingcomponentjof the input image (784 parts) that is added and the corresponding bias elementbi. So to define the evidence, we must define the following tensor of biases: b = tf.Variable(tf.zeros([10])) The second step is finally to use thesoftmaxfunction to obtain the output vector of probabilities, namelyactivation: activation = tf.nn.softmax(tf.matmul(x, W) + b) The TensorFlow's functiontf.nn.softmaxprovides a probability based output from the input evidence tensor. Once we implement the model, we can proceed to specify the necessary code to find the W weights and biases b network through the iterative training algorithm. In each iteration, the training algorithm takes the training data, applies the neural network, and compares the result with the expected. In order to train our model and to know when we have a good one, we must know how to define the accuracy of our model. Our goal is to try to get valuesof parameters W and b that minimize the value of the metric that indicates how bad the model is. Different metrics calculate the degree of error between the desired output and output of the training data. A common measure of error is the mean squared error or the Squared Euclidean Distance. However, there are some research findings that suggest to use other metrics to a neural network like this. In this example, we use the so-called cross-entropy error function, it is defined as follows: cross_entropy = y*tf.lg(activation) In order to minimize the cross_entropy, we could use the following combination of tf.reduce_mean and tf.reduce_sum to build the cost function: cost = tf.reduce_mean          (-tf.reduce_sum            (cross_entropy, reduction_indices=1)) Then we must minimize it using the gradient descent optimization algorithm: optimizer = tf.train.GradientDescentOptimizer  (learning_rate).minimize(cost) Few lines of code to build a neural network model! Launching the session It's the moment to build the session and launch our neural net model.We fix these lists to visualize the training session: avg_set = [] epoch_set=[] Then we initialize the TensorFlow variables: init = tf.initialize_all_variables() Start the session: with tf.Session() as sess: sess.run(init) As explained, each epoch is a training cycle:     for epoch in range(training_epochs): avg_cost = 0. total_batch = int(mnist.train.num_examples/batch_size) Then we loop over all batches:         for i in range(total_batch): batch_xs, batch_ys = mnist.train.next_batch(batch_size) Fit training using batch data: sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys}) Compute the average loss running the train_step function with the given image values (x) and the real output (y_): avg_cost += sess.run                         (cost, feed_dict={x: batch_xs,                                  y: batch_ys})/total_batch During the computation, we display a log per epoch step:         if epoch % display_step == 0:             print "Epoch:", '%04d' % (epoch+1), "cost=","{:.9f}".format(avg_cost)     print " Training phase finished" Let's get the accuracy of our mode.It is correct if the index with the highest y value is the same as in the real digit vector the mean of correct_prediction gives us the accuracy. We need to run the accuracy function with our test set (mnist.test). We use the keys images and labelsfor x and y_: correct_prediction = tf.equal                            (tf.argmax(activation, 1), tf.argmax(y, 1))       accuracy = tf.reduce_mean                        (tf.cast(correct_prediction, "float"))    print "MODEL accuracy:", accuracy.eval({x: mnist.test.images,                                       y: mnist.test.labels}) Test evaluation We have seen the training phase in the preceding sections; for each epoch we have printed the relative cost function: Python 2.7.10 (default, Oct 14 2015, 16:09:02) [GCC 5.2.1 20151010] on linux2 Type "copyright", "credits" or "license()" for more information. >>> ======================= RESTART ============================ >>> Extracting /tmp/data/train-images-idx3-ubyte.gz Extracting /tmp/data/train-labels-idx1-ubyte.gz Extracting /tmp/data/t10k-images-idx3-ubyte.gz Extracting /tmp/data/t10k-labels-idx1-ubyte.gz Epoch: 0001 cost= 1.174406662 Epoch: 0002 cost= 0.661956009 Epoch: 0003 cost= 0.550468774 Epoch: 0004 cost= 0.496588717 Epoch: 0005 cost= 0.463674555 Epoch: 0006 cost= 0.440907706 Epoch: 0007 cost= 0.423837747 Epoch: 0008 cost= 0.410590841 Epoch: 0009 cost= 0.399881751 Epoch: 0010 cost= 0.390916621 Epoch: 0011 cost= 0.383320325 Epoch: 0012 cost= 0.376767031 Epoch: 0013 cost= 0.371007620 Epoch: 0014 cost= 0.365922904 Epoch: 0015 cost= 0.361327561 Epoch: 0016 cost= 0.357258660 Epoch: 0017 cost= 0.353508228 Epoch: 0018 cost= 0.350164634 Epoch: 0019 cost= 0.347015593 Epoch: 0020 cost= 0.344140861 Epoch: 0021 cost= 0.341420144 Epoch: 0022 cost= 0.338980592 Epoch: 0023 cost= 0.336655581 Epoch: 0024 cost= 0.334488012 Epoch: 0025 cost= 0.332488823 Training phase finished As wesaw, during the training phase, the cost function is minimized.At the end of the test, we show how accurately the model is implemented: Model Accuracy: 0.9475 >>> Finally, using these lines of code, we could visualize the the training phase of the net: plt.plot(epoch_set,avg_set, 'o',  label='Logistic Regression Training phase') plt.ylabel('cost') plt.xlabel('epoch') plt.legend() plt.show() Training phase in logistic regression Summary In this article, we learned the implementation of artificial neural networks, Single Layer Perceptron, TensorFlow. We also learned how to build the model and launch the session.
Read more
  • 0
  • 0
  • 7135
article-image-introducing-sql-developer-data-modeler-part-2
Packt
07 Jan 2010
9 min read
Save for later

Introducing SQL Developer Data Modeler: Part 2

Packt
07 Jan 2010
9 min read
Working with diagrams and their components You can waste away many hours laying out the elements on a diagram. Therefore, this aspect of modeling can be time consuming. However, a model serves as documentation and a communication device. Therefore, taking the time to make sure it is well annotated and clearly designed is important. Most of the controls for the models are on the context menu, allowing you to modify individual aspects of the diagram. The context menu changes depending on whether you have an object or line selected, or you're just clicking in the open space. You can also set general defaults using Tools | General Options | Diagram. In this section, we'll look at the various options available when working with the diagrams. Formatting the elements Before moving a relationship line, entity, or table, you can dramatically change the impact and readability of a large diagram just by changing the colors. This is readily demonstrated when importing from two or more schemas. Using the previous example where we imported from two schemas, open one of the subviews and select all of the tables. With the objects selected, invoke the Format Object dialog using the context menu: If this is the first time you are adjusting the colors, the dialog does not display any colors as you open it. The colors used in the diagram are the default settings. Deselect Use Default Color and click on the Background and Border Color items to select and set the new color. When you are done, click on OK and note the changes applied to the subview. Switch to the main relational model to review the impact there. The color applied to the subview is also applied to the main model as shown. This is very useful when illustrating how tables in different schemas relate to each other. For example, take the HR and OE sample schema, all of the tables related to human resources are maintained in the HR schema, while those related to the order entry system are maintained in the OE schema. You may well have applications designed around the HR schema and others tied to the OE schema, but some may involve both. In the following relational model, the OE tables are now colored green, so we're able to identify them, but we can also see where the schemas link. We can see that a CUSTOMER may deal with one EMPLOYEE and has many ORDERS: Selecting all of the tables in a modelSelect any table and click on Select Neighbors from the context menu. Select All Zones, to select all of the tables. Use this instead of Ctrl+A, which selects all tables and all lines. Changing the default format settings Instead of changing individual items or even a set of items, you can change the default color for each of the element types displayed on a diagram. The Tools | General Options | Diagram | Format provides the ability to control the color of each of the elements displayed such as tables, views, and entities: To edit any of the elements in the dialog, double-click on the object, or select the object and the edit icon. This allows you to adjust the color of the item and to format the font. You can use the font color to highlight mandatory, Unique, or Foreign Keys. Setting general diagram properties Use the same Tools | General Options | Diagram to set model properties,which include: Displaying the grid Controlling the background color of the diagram Controlling the Auto Route feature which is on by default Set display properties for certain items on each of the models, including the control of: The diagram notation for the logical model, which supports the Barker and Bachman notations The display of the relationship names for either the logical or relational models The flow names for process models For example, to display the relationship names on an Entity Relationship Diagram (as seen below), check the display property on the Tools | General Options | Model | Logical, and ensure that the Relation Cardinality properties for the relationships are also set. Creating subviews and displays Adding subviews and displays offers you alternative ways of laying out elements on a diagram and for working with subsets of items. You can create multiple subviews and displays for either logical or relational models, and remove them as easily, without impacting the main models. Adding subviews to your design You have already encountered a subview by importing from a number of schemas in the data dictionary. Subviews are not only a reflection of the different schemas in a design, but they can also represent any subset of elements in the design, allowing you to work with a smaller, more manageable set of elements. You can create a subview from the object browser by selecting: The SubViews node and using the New SubView context menu. In this case, you have a new empty diagram that you can populate by dragging tables or entities (depending on the subview in question) onto the page. Any of the model tabs and then selecting the Create SubView menu. This creates a new and empty subview. An element or elements on an existing model and using the Create SubView from selected context menu on the diagram. In this case, the new subview will contain the tables or entities you selected: The layout of the subview is not linked to the main model in any way. What is linked is how you format the items on the subview and any structural changes you make to the objects. You can continue to add new items to the subview by dragging them onto the surface from the object browser. When deleting items from the subview, you should choose whether the item is deleted: From the view (Delete View) From the complete design (Delete Object) Adding displays A display is an alternative view of a diagram, whether a main model or a subview, and is linked directly to that model. If you delete the model, the display is also deleted. Any items that you add or remove from displays are also automatically added or removed from the main model they are linked to. To create a new display, select the tab of any model and select Create Display from the context menu. The new display created is, initially, a replica of the model you selected in both layout and items available. All additional displays are listed at the bottom of the model. In the following example, the HR subview has two displays created, as highlighted at the bottom of the screenshot, the main HR display and the new Display_1. The Create Display context menu is also illustrated: Use the new display to change the layout of the model and to adjust the level of detail displayed on the diagram. A second display of the same model is useful when you want to show more or less detail on a model. You can, for example, create a display which only displays the entity or table names. Right-click in the space on a diagram and select View Details | Names Only. We'll discuss how to layout the diagram elements later. Creating a composite view If you create a number of subviews, create a new diagram showing the composite models of each of these on a single layout. This serves as a useful reminder of the number of subviews or models you have by having a thumbnail of the various layouts. Alternatively, you can add a composite view of one subview and place it on another. To create a composite view, select the model in the browser and drag it onto the diagram surface. You can drag any model onto any other diagram surface, except its own: Once you have the composite folder displayed on the diagram, display the composite model within that folder by selecting Composite View from the context menu. If the model you want to view has a selection of displays, then you can also select the display you want to see within that composite. The following screenshot shows the, subview, displaying the composite models of the HR subview, the main logical model, and both displays of the logical model: Controlling the layout When working with a large number of items in a model, it's important to keep the layout organized. A variety of tools to help with the process are explained in the following sections. Adjusting the level of detail displayed Change the amount of detail displayed in a table (or entity) using the View Details menu. It is invoked with a right-click on the white space of any diagram. The View Details menu has options for displaying: All Details Names Only Columns Datatype Keys Adjusting the width and height across the model If you have a large diagram and want to see how tables or entities relate to each other, you can create a more compact model using a display, without impacting the main model. This can be done by setting the details to display the name only and then resizing and repositioning the objects. In the following screenshot, we have set the model to display only the name of the tables. Create a more compact diagram by resizing one of the tables to a more fitting set of dimensions, select the rest, and then resize them all to the same width and height: Controlling alignment Once you have positioned the items, align them to complete the model. Use the Edit menu with the required items for top and left alignment as shown in the following screenshot: Resizing and alignmentThe first item you select is the one that drives the position for left or top alignment, and the item that controls the width and height of all subsequent items selected.
Read more
  • 0
  • 0
  • 7108

article-image-article-creating-your-first-heat-map-r
Packt
26 Jun 2013
10 min read
Save for later

Creating your first heat map in R

Packt
26 Jun 2013
10 min read
(For more resources related to this topic, see here.) The following image shows one of the heat maps that we are going to create in this recipe from the total count of air passengers: Image Getting ready Download the script 5644_01_01.r from your account at http://www.packtpub.com and save it to your hard disk. The first section of the script, below the comment line starting with ### loading packages, will automatically check for the availability of the R packages gplots and lattice, which are required for this recipe. If those packages are not already installed, you will be prompted to select an official server from the Comprehensive R Archive Network (CRAN) to allow the automatic download and installation of the required packages. If you have already installed those two packages prior to executing the script, I recommend you to update them to the most recent version by calling the following function in the R command line: code Use the source() function in the R command-line to execute an external script from any location on your hard drive. If you start a new R session from the same directory as the location of the script, simply provide the name of the script as an argument in the function call as follows: code   You have to provide the absolute or relative path to the script on your hard drive if you started your R session from a different directory to the location of the script. Refer to the following example: code   You can view the current working directory of your current R session by executing the following command in the R command-line: code   How to do it... Run the 5644OS_01_01.r script in R to execute the following code, and take a look at the output printed on the screen as well as the PDF file, first_heatmaps.pdf that will be created by this script: code How it works... There are different functions for drawing heat maps in R, and each has its own advantages and disadvantages. In this recipe, we will take a look at the levelplot() function from the lattice package to draw our first heat map. Furthermore, we will use the advanced heatmap.2() function from gplots to apply a clustering algorithm to our data and add the resulting dendrograms to our heat maps. The following image shows an overview of the different plotting functions that we are using throughout this book: Image Now let us take a look at how we read in and process data from different data files and formats step-by-step: Loading packages: The first eight lines preceding the ### loading data section will make sure that R loads the lattice and gplots package, which we need for the two heat map functions in this recipe: levelplot() and heatmap.2(). Each time we start a new session in R, we have to load the required packages in order to use the levelplot() and heatmap.2() functions. To do so, enter the following function calls directly into the R command-line or include them at the beginning of your script: library(lattice) library(gplots)   Loading the data set: R includes a package called data, which contains a variety of different data sets for testing and exploration purposes. More information on the different data sets that are contained in the data package can be found at http:// stat.ethz.ch/ROmanual/ROpatched/library/datasets/. For this recipe, we are loading the AirPassenger data set, which is a collection of the total count of air passengers (in thousands) for international airlines from 1949- 1960 in a time-series format. code Converting the data set into a numeric matrix: Before we can use the heat map functions, we need to convert the AirPassenger time-series data into a numeric matrix first. Numeric matrices in R can have characters as row and column labels, but the content itself must consist of one single mode: numerical. We use the matrix() function to create a numeric matrix consisting of 12 columns to which we pass the AirPassenger time-series data row-by-row. Using the argument dimnames = rowcolNames, we provide row and column names that we assigned previously to the variable rowColNames, which is a list of two vectors: a series of 12 strings representing the years 1949 to 1960, and a series of strings for the 12 three-letter abbreviations of the months from January to December, respectively. code A simple heat map using levelplot(): Now that we have converted the AirPassenger data into a numeric matrix format and assigned it to the variable air_data, we can go ahead and construct our first heat map using the levelplot() function from the lattice package: code The levelplot() function creates a simple heat map with a color key to the righthand side of the map. We can use the argument col.regions = heat.colors to change the default color transition to yellow and red. X and y axis labels are specified by the xlab and ylab parameters, respectively, and the main parameter gives our heat map its caption. In contrast to most of the other plotting functions in R, the lattice package returns objects, so we have to use the print() function in our script if we want to save the plot to a data file. In an interactive R session, the print() call can be omitted. Typing the name of the variable will automatically display the referring object on the screen. Creating enhanced heat maps with heatmap.2(): Next, we will use the heatmap.2() function to apply a clustering algorithm to the AirPassenger data and to add row and column dendrograms to our heat map: code Hierarchical clustering is especially popular in gene expression analyses. It is a very powerful method for grouping data to reveal interesting trends and patterns in the data matrix. Another neat feature of heatmap.2() is that you can display a histogram of the count of the individual values inside the color key by including the argument density.info = NULL in the function call. Alternatively, you can set density. info = "density" for displaying a density plot inside the color key. By adding the argument keysize = 1.8, we are slightly increasing the size of the color key—the default value of keysize is 1.5: code Did you notice the missing row dendrogram in the resulting heat map? This is due to the argument dendrogram = "column" that we passed to the heat map function. Similarly, you can type row instead of column to suppress the column dendrogram, or use neither to draw no dendrogram at all. There's more... By default, levelplot() places the color key on the right-hand side of the heat map, but it can be easily moved to the top, bottom, or left-hand side of the map by modifying the space parameter of colorkey: Replacing top by left or bottom will place the color key on the left-hand side or on the bottom of the heat map, respectively. Moving around the color key for heatmap.2() can be a little bit more of a hassle. In this case we have to modify the parameters of the layout() function. By default, heatmap.2() passes a matrix, lmat, to layout(), which has the following content: code The numbers in the preceding matrix specify the locations of the different visual elements on the plot (1 implies heat map, 2 implies row dendrogram, 3 implies column dendrogram, and 4 implies key). If we want to change the position of the key, we have to modify and rearrange those values of lmat that heatmap.2() passes to layout(). For example, if we want to place the color key at the bottom left-hand corner of the heat map, we need to create a new matrix for lmat as follows: code We can construct such a matrix by using the rbind() function and assigning it to lmat: code Furthermore, we have to pass an argument for the column height parameter lhei to heatmap.2(), which will allow us to use our modified lmat matrix for rearranging the color key: code If you don't need a color key for your heat map, you could turn it off by using the argument key = FALSE for heatmap.2() and colorkey = FALSE for levelplot(), respectively. R also has a base function for creating heat maps that does not require you to install external packages and is most advantageous if you can go without a color key. The syntax is very similar to the heatmap.2() function, and all options for heatmap.2() that we have seen in this recipe also apply to heatmap(): code More information on dendrograms and clustering By default, the dendrograms of heatmap.2() are created by a hierarchical agglomerate clustering method, also known as bottom-up clustering. In this approach, all individual objects start as individual clusters and are successively merged until only one single cluster remains. The distance between a pair of clusters is calculated by the farthest neighbor method, also called the complete linkage method, which is based by default on the Euclidean distance of the two points from both clusters that are farthest apart from each other. The computed dendrograms are then reordered based on the row and column means. By modifying the default parameters of the dist() function, we can use another distance measure rather than the Euclidean distance. For example, if we want to use the Manhattan distance measure (based on a grid-like path rather than a direct connection between two objects), we would modify the method parameter of the dist() function and assign it to a variable distance first: code Other options for the method parameter are: euclidean (default), maximum, canberra, binary, or minkowski. To use other agglomeration methods than the complete linkage method, we modify the method parameter in the hclust() function and assign it to another variable cluster. Note the first argument distance that we pass to the hclust() function, which comes from our previous assignment: code By setting the method parameter to ward, R will use Joe H. Ward's minimum variance method for hierarchical clustering. Other options for the method parameter that we can pass as arguments to hclust() are: complete (default), single, average, mcquitty, median, or centroid. To use our modified clustering parameters, we simply call the as.dendrogram() function within heatmap.2() using the variable cluster that we assigned previously: code We can also draw the cluster dendrogram without the heat map by using the plot() function: code To turn off row and column reordering, we need to turn off the dendrograms and set the parameters Colv and Rowv to NA: code Summary This article has helped us create our first heat maps from a small data set provided in R. We have used different heat map functions in R to get a first impression of their functionalities. Resources for Article :   Further resources on this subject: Getting started with Leaflet [Article] Moodle 1.9: Working with Mind Maps [Article] Joomla! with Flash: Showing maps using YOS amMap [Article]
Read more
  • 0
  • 0
  • 7099

article-image-machine-learning-using-spark-mllib
Packt
01 Apr 2015
22 min read
Save for later

Machine Learning Using Spark MLlib

Packt
01 Apr 2015
22 min read
This Spark machine learning tutorial is by Krishna Sankar, the author of Fast Data Processing with Spark Second Edition. One of the major attractions of Spark is the ability to scale computation massively, and that is exactly what you need for machine learning algorithms. But the caveat is that all machine learning algorithms cannot be effectively parallelized. Each algorithm has its own challenges for parallelization, whether it is task parallelism or data parallelism. Having said that, Spark is becoming the de-facto platform for building machine learning algorithms and applications. For example, Apache Mahout is moving away from Hadoop MapReduce and implementing the algorithms in Spark (see the first reference at the end of this article). The developers working on the Spark MLlib are implementing more and more machine algorithms in a scalable and concise manner in the Spark framework. For the latest information on this, you can refer to the Spark site at https://spark.apache.org/docs/latest/mllib-guide.html, which is the authoritative source. This article covers the following machine learning algorithms: Basic statistics Linear regression Classification Clustering Recommendations The Spark machine learning algorithm table The Spark machine learning algorithms implemented in Spark 1.1.0 org.apache.spark.mllib for Scala and Java, and in pyspark.mllib for Python is shown in the following table: Algorithm Feature Notes Basic statistics Summary statistics Mean, variance, count, max, min, and numNonZeros   Correlations Spearman and Pearson correlation   Stratified sampling sampleBykey, sampleByKeyExact—With and without replacement   Hypothesis testing Pearson's chi-squared goodness of fit test   Random data generation RandomRDDs Normal, Poisson, and so on Regression Linear models Linear regression—least square, Lasso, and ridge regression Classification Binary classification Logistic regression, SVM, decision trees, and naïve Bayes   Multi-class classification Decision trees, naïve Bayes, and so on Recommendation Collaborative filtering Alternating least squares Clustering k-means   Dimensionality reduction SVD PCA   Feature extraction TF-IDF Word2Vec StandardScaler Normalizer   Optimization SGD L-BFGS   Spark MLlib examples Now, let's look at how to use the algorithms. Naturally, we need interesting datasets to implement the algorithms; we will use appropriate datasets for the algorithms shown in the next section. The code and data files are available in the GitHub repository at https://github.com/xsankar/fdps-vii. We'll keep it updated with corrections. Basic statistics Let's read the car mileage data into an RDD and then compute some basic statistics. We will use a simple parse class to parse a line of data. This will work if you know the type and the structure of your CSV file. We will use this technique for the examples in this article: import org.apache.spark.SparkContext import org.apache.spark.mllib.stat. {MultivariateStatisticalSummary, Statistics} import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.rdd.RDD   object MLlib01 { // def getCurrentDirectory = new java.io.File( "." ).getCanonicalPath // def parseCarData(inpLine : String) : Array[Double] = {    val values = inpLine.split(',')    val mpg = values(0).toDouble    val displacement = values(1).toDouble    val hp = values(2).toInt    val torque = values(3).toInt    val CRatio = values(4).toDouble    val RARatio = values(5).toDouble    val CarbBarrells = values(6).toInt    val NoOfSpeed = values(7).toInt    val length = values(8).toDouble    val width = values(9).toDouble    val weight = values(10).toDouble    val automatic = values(11).toInt    return Array(mpg,displacement,hp,    torque,CRatio,RARatio,CarbBarrells,    NoOfSpeed,length,width,weight,automatic) } // def main(args: Array[String]) {    println(getCurrentDirectory)    val sc = new SparkContext("local","Chapter 9")    println(s"Running Spark Version ${sc.version}")    //    val dataFile = sc.textFile("/Users/ksankar/fdps-vii/data/car-     milage-no-hdr.csv")    val carRDD = dataFile.map(line => parseCarData(line))    //    // Let us find summary statistics    //    val vectors: RDD[Vector] = carRDD.map(v => Vectors.dense(v))    val summary = Statistics.colStats(vectors)    carRDD.foreach(ln=> {ln.foreach(no => print("%6.2f | "     .format(no))); println()})    print("Max :");summary.max.toArray.foreach(m => print("%5.1f |     ".format(m)));println    print("Min :");summary.min.toArray.foreach(m => print("%5.1f |     ".format(m)));println    print("Mean :");summary.mean.toArray.foreach(m => print("%5.1f     | ".format(m)));println    } } This program will produce the following output: Let's also run some correlations, as shown here: // // correlations // val hp = vectors.map(x => x(2)) val weight = vectors.map(x => x(10)) var corP = Statistics.corr(hp,weight,"pearson") // default println("hp to weight : Pearson Correlation = %2.4f".format(corP)) var corS = Statistics.corr(hp,weight,"spearman") // Need to   specify println("hp to weight : Spearman Correlation = %2.4f" .format(corS)) // val raRatio = vectors.map(x => x(5)) val width = vectors.map(x => x(9)) corP = Statistics.corr(raRatio,width,"pearson") // default println("raRatio to width : Pearson Correlation = %2.4f" .format(corP)) corS = Statistics.corr(raRatio,width,"spearman") // Need to   specify println("raRatio to width : Spearman Correlation = %2.4f" .format(corS)) // This will produce interesting results as shown in the next screenshot: While this might seem too much work to calculate the correlation of a tiny dataset, remember that this will scale to datasets consisting of 1,000,000 rows or even a billion rows! Linear regression Linear regression takes a little more work than statistics. We need the LabeledPoint class as well as a few more parameters such as the learning rate, that is, the step size. We will also split the dataset into training and test, as shown here:    //    // def carDataToLP(inpArray : Array[Double]) : LabeledPoint = {    return new LabeledPoint( inpArray(0),Vectors.dense (       inpArray(1), inpArray(2), inpArray(3),       inpArray(4), inpArray(5), inpArray(6), inpArray(7),       inpArray(8), inpArray(9), inpArray(10), inpArray(11) ) )    } // Linear Regression    //    val carRDDLP = carRDD.map(x => carDataToLP(x)) // create a     labeled point RDD    println(carRDDLP.count())    println(carRDDLP.first().label)    println(carRDDLP.first().features)    //    // Let us split the data set into training & test set using a     very simple filter    //    val carRDDLPTrain = carRDDLP.filter( x => x.features(9) <=     4000)    val carRDDLPTest = carRDDLP.filter( x => x.features(9) > 4000)    println("Training Set : " + "%3d".format     (carRDDLPTrain.count()))    println("Training Set : " + "%3d".format(carRDDLPTest.count()))    //    // Train a Linear Regression Model    // numIterations = 100, stepsize = 0.000000001    // without such a small step size the algorithm will diverge    //    val mdlLR = LinearRegressionWithSGD.train     (carRDDLPTrain,100,0.000000001)    println(mdlLR.intercept) // Intercept is turned off when using     LinearRegressionSGD object, so intercept will always be 0 for     this code      println(mdlLR.weights)    //    // Now let us use the model to predict our test set    //    val valuesAndPreds = carRDDLPTest.map(p => (p.label,     mdlLR.predict(p.features)))    val mse = valuesAndPreds.map( vp => math.pow( (vp._1 - vp._2),2     ) ).        reduce(_+_) / valuesAndPreds.count()    println("Mean Squared Error     = " + "%6.3f".format(mse))    println("Root Mean Squared Error = " + "%6.3f"     .format(math.sqrt(mse)))    // Let us print what the model predicted    valuesAndPreds.take(20).foreach(m => println("%5.1f | %5.1f |"     .format(m._1,m._2))) The run result will be as expected, as shown in the next screenshot: The prediction is not that impressive. There are a couple of reasons for this. There might be quadratic effects; some of the variables might be correlated (for example, length, width, and weight, and so we might not need all three to predict the mpg value). Finally, we might not need all the 10 features anyways. I leave it to you to try with different combinations of features. (In the parseCarData function, take only a subset of the variables; for example take hp, weight, and number of speed and see which combination minimizes the mse value.) Classification Classification is very similar to linear regression. The algorithms take labeled points, and the train process has various parameters to tweak the algorithm to fit the needs of an application. The returned model can be used to predict the class of a labeled point. Here is a quick example using the titanic dataset: For our example, we will keep the same structure as the linear regression example. First, we will parse the full dataset line and then later keep it simple by creating a labeled point with a set of selected features, as shown in the following code: import org.apache.spark.SparkContext import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.tree.DecisionTree   object Chapter0802 { // def getCurrentDirectory = new java.io.File( "."     ).getCanonicalPath // // 0 pclass,1 survived,2 l.name,3.f.name, 4 sex,5 age,6 sibsp,7       parch,8 ticket,9 fare,10 cabin, // 11 embarked,12 boat,13 body,14 home.dest // def str2Double(x: String) : Double = {    try {      x.toDouble    } catch {      case e: Exception => 0.0    } } // def parsePassengerDataToLP(inpLine : String) : LabeledPoint = {    val values = inpLine.split(',')    //println(values)    //println(values.length)    //    val pclass = str2Double(values(0))    val survived = str2Double(values(1))    // skip last name, first name    var sex = 0    if (values(4) == "male") {      sex = 1    }    var age = 0.0 // a better choice would be the average of all       ages    age = str2Double(values(5))    //    var sibsp = 0.0    age = str2Double(values(6))    //    var parch = 0.0    age = str2Double(values(7))    //    var fare = 0.0    fare = str2Double(values(9))    return new LabeledPoint(survived,Vectors.dense     (pclass,sex,age,sibsp,parch,fare)) } Now that we have setup the routines to parse the data, let's dive into the main program: // def main(args: Array[String]): Unit = {    println(getCurrentDirectory)    val sc = new SparkContext("local","Chapter 8")    println(s"Running Spark Version ${sc.version}")    //    val dataFile = sc.textFile("/Users/ksankar/bdtc-2014     /titanic/titanic3_01.csv")    val titanicRDDLP = dataFile.map(_.trim).filter( _.length > 1).      map(line => parsePassengerDataToLP(line))    //    println(titanicRDDLP.count())    //titanicRDDLP.foreach(println)    //    println(titanicRDDLP.first().label)    println(titanicRDDLP.first().features)    //    val categoricalFeaturesInfo = Map[Int, Int]()    val mdlTree = DecisionTree.trainClassifier(titanicRDDLP, 2, //       numClasses        categoricalFeaturesInfo, // all features are continuous        "gini", // impurity        5, // Maxdepth        32) //maxBins    //    println(mdlTree.depth)    println(mdlTree) The tree is interesting to inspect. Check it out here:    //    // Let us predict on the dataset and see how well it works.    // In the real world, we should split the data to train & test       and then predict the test data:    //    val predictions = mdlTree.predict(titanicRDDLP.     map(x=>x.features))    val labelsAndPreds = titanicRDDLP.     map(x=>x.label).zip(predictions)    //    val mse = labelsAndPreds.map( vp => math.pow( (vp._1 -       vp._2),2 ) ).        reduce(_+_) / labelsAndPreds.count()    println("Mean Squared Error = " + "%6f".format(mse))    //    // labelsAndPreds.foreach(println)    //    val correctVals = labelsAndPreds.aggregate(0.0)((x, rec) => x       + (rec._1 == rec._2).compare(false), _ + _)    val accuracy = correctVals/labelsAndPreds.count()    println("Accuracy = " + "%3.2f%%".format(accuracy*100))    //    println("*** Done ***") } } The result obtained when you run the program is as expected. The printout of the tree is interesting, as shown here: Running Spark Version 1.1.1 14/11/28 18:41:27 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=2061647216 [..] 14/11/28 18:41:27 INFO SparkContext: Job finished: count at Chapter0802.scala:56, took 0.260993 s 1309 14/11/28 18:41:27 INFO SparkContext: Starting job: first at Chapter0802.scala:59 [..] 14/11/28 18:41:27 INFO SparkContext: Job finished: first at Chapter0802.scala:59, took 0.016479 s 1.0 14/11/28 18:41:27 INFO SparkContext: Starting job: first at Chapter0802.scala:60 [..] 14/11/28 18:41:27 INFO SparkContext: Job finished: first at Chapter0802.scala:60, took 0.014408 s [1.0,0.0,0.0,0.0,0.0,211.3375] 14/11/28 18:41:27 INFO SparkContext: Starting job: take at DecisionTreeMetadata.scala:66 [..] 14/11/28 18:41:28 INFO DecisionTree: Internal timing for DecisionTree: 14/11/28 18:41:28 INFO DecisionTree:   init: 0.36408 total: 0.95518 extractNodeInfo: 7.3E-4 findSplitsBins: 0.249814 extractInfoForLowerLevels: 7.74E-4 findBestSplits: 0.565394 chooseSplits: 0.201012 aggregation: 0.362411 5 DecisionTreeModel classifier If (feature 1 <= 0.0)    If (feature 0 <= 2.0)    If (feature 5 <= 26.0)      If (feature 2 <= 1.0)      If (feature 0 <= 1.0)        Predict: 1.0      Else (feature 0 > 1.0)        Predict: 1.0      Else (feature 2 > 1.0)      Predict: 1.0    Else (feature 5 > 26.0)      If (feature 2 <= 1.0)      If (feature 5 <= 38.0021)        Predict: 1.0      Else (feature 5 > 38.0021)        Predict: 1.0      Else (feature 2 > 1.0)      If (feature 5 <= 79.42500000000001)        Predict: 1.0      Else (feature 5 > 79.42500000000001)        Predict: 1.0    Else (feature 0 > 2.0)    If (feature 5 <= 25.4667)      If (feature 5 <= 7.2292)      If (feature 5 <= 7.05)        Predict: 1.0      Else (feature 5 > 7.05)        Predict: 1.0      Else (feature 5 > 7.2292)      If (feature 5 <= 15.5646)        Predict: 0.0      Else (feature 5 > 15.5646)        Predict: 1.0    Else (feature 5 > 25.4667)      If (feature 5 <= 38.0021)      If (feature 5 <= 30.6958)        Predict: 0.0      Else (feature 5 > 30.6958)        Predict: 0.0      Else (feature 5 > 38.0021)      Predict: 0.0 Else (feature 1 > 0.0)    If (feature 0 <= 1.0)    If (feature 5 <= 26.0)      If (feature 5 <= 7.05)      If (feature 5 <= 0.0)        Predict: 0.0      Else (feature 5 > 0.0)        Predict: 0.0      Else (feature 5 > 7.05)      Predict: 0.0    Else (feature 5 > 26.0)      If (feature 5 <= 30.6958)      If (feature 2 <= 0.0)        Predict: 0.0      Else (feature 2 > 0.0)        Predict: 0.0      Else (feature 5 > 30.6958)      If (feature 2 <= 1.0)        Predict: 0.0      Else (feature 2 > 1.0)        Predict: 1.0    Else (feature 0 > 1.0)    If (feature 2 <= 0.0)      If (feature 5 <= 38.0021)      If (feature 5 <= 14.4583)        Predict: 0.0      Else (feature 5 > 14.4583)        Predict: 0.0      Else (feature 5 > 38.0021)      If (feature 0 <= 2.0)        Predict: 0.0      Else (feature 0 > 2.0)        Predict: 1.0    Else (feature 2 > 0.0)      If (feature 5 <= 26.0)      If (feature 2 <= 1.0)        Predict: 0.0      Else (feature 2 > 1.0)        Predict: 0.0      Else (feature 5 > 26.0)      If (feature 0 <= 2.0)        Predict: 0.0      Else (feature 0 > 2.0)        Predict: 0.0   14/11/28 18:41:28 INFO SparkContext: Starting job: reduce at Chapter0802.scala:79 [..] 14/11/28 18:41:28 INFO SparkContext: Job finished: count at Chapter0802.scala:79, took 0.077973 s Mean Squared Error = 0.200153 14/11/28 18:41:28 INFO SparkContext: Starting job: aggregate at Chapter0802.scala:84 [..] 14/11/28 18:41:28 INFO SparkContext: Job finished: count at Chapter0802.scala:85, took 0.042592 s Accuracy = 79.98% *** Done *** In the real world, one would create a training and a test dataset and train the model on the training dataset and then predict on the test dataset. Then we can calculate the mse and minimize it on various feature combinations, some of which could also be engineered features. Clustering Spark MLlib has implemented the k-means clustering algorithm. The model training and prediction interfaces are similar to other machine learning algorithms. Let's see how it works by going through an example. Let's use a sample data that has two dimensions x and y. The plot of the points would look like the following screenshot: From the preceding graph, we can see that four clusters form one solution. Let's try with k=2 and k=4. Let's see how the Spark clustering algorithm handles this dataset and the groupings: import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.{Vector,Vectors} import org.apache.spark.mllib.clustering.KMeans   object Chapter0803 { def parsePoints(inpLine : String) : Vector = {    val values = inpLine.split(',')    val x = values(0).toInt    val y = values(1).toInt    return Vectors.dense(x,y) } //   def main(args: Array[String]): Unit = {    val sc = new SparkContext("local","Chapter 8")    println(s"Running Spark Version ${sc.version}")    //    val dataFile = sc.textFile("/Users/ksankar/bdtc-2014/cluster-     points/cluster-points.csv")    val points = dataFile.map(_.trim).filter( _.length > 1).     map(line => parsePoints(line))    //  println(points.count())    //    var numClusters = 2    val numIterations = 20    var mdlKMeans = KMeans.train(points, numClusters,       numIterations)    //    println(mdlKMeans.clusterCenters)    //    var clusterPred = points.map(x=>mdlKMeans.predict(x))    var clusterMap = points.zip(clusterPred)    //    clusterMap.foreach(println)    //    clusterMap.saveAsTextFile("/Users/ksankar/bdtc-2014/cluster-     points/2-cluster.csv")    //    // Now let us try 4 centers:    //    numClusters = 4    mdlKMeans = KMeans.train(points, numClusters, numIterations)    clusterPred = points.map(x=>mdlKMeans.predict(x))    clusterMap = points.zip(clusterPred)    clusterMap.saveAsTextFile("/Users/ksankar/bdtc-2014/cluster-     points/4-cluster.csv")    clusterMap.foreach(println) } } The results of the run would be as shown in the next screenshot (your run could give slightly different results): The k=2 graph shown in the next screenshot looks as expected: With k=4 the results are as shown in the following screenshot: The plot shown in the following screenshot confirms that the clusters are obtained as expected. Spark does understand clustering! Bear in mind that the results could vary a little between runs because the clustering algorithm picks the centers randomly and grows from there. With k=4, the results are stable; but with k=2, there is room for partitioning the points in different ways. Try it out a few times and see the results. Recommendation The recommendation algorithms fall under five general mechanisms, namely, knowledge-based, demographic-based, content-based, collaborative filtering (item-based or user-based), and latent factor-based. Usually, the collaborative filtering is computationally intensive—Spark implements the Alternating Least Square (ALS) algorithm authored by Yehuda Koren, available at http://dl.acm.org/citation.cfm?id=1608614. It is user-based collaborative filtering using the method of learning latent factors, which can scale to a large dataset. Let's quickly use the movielens medium dataset to implement a recommendation using Spark. There are some interesting RDD transformations. Apart from that, the code is not that complex, as shown next: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ // for implicit   conversations import org.apache.spark.mllib.recommendation.Rating import org.apache.spark.mllib.recommendation.ALS   object Chapter0804 { def parseRating1(line : String) : (Int,Int,Double,Int) = {    //println(x)    val x = line.split("::")    val userId = x(0).toInt    val movieId = x(1).toInt    val rating = x(2).toDouble    val timeStamp = x(3).toInt/10    return (userId,movieId,rating,timeStamp) } // def parseRating(x : (Int,Int,Double,Int)) : Rating = {    val userId = x._1    val movieId = x._2    val rating = x._3    val timeStamp = x._4 // ignore    return new Rating(userId,movieId,rating) } // Now that we have the parsers in place, let's focus on the main program, as shown next: def main(args: Array[String]): Unit = {    val sc = new SparkContext("local","Chapter 8")    println(s"Running Spark Version ${sc.version}")    //    val moviesFile = sc.textFile("/Users/ksankar/bdtc-     2014/movielens/medium/movies.dat")    val moviesRDD = moviesFile.map(line => line.split("::"))    println(moviesRDD.count())    //    val ratingsFile = sc.textFile("/Users/ksankar/bdtc-     2014/movielens/medium/ratings.dat")    val ratingsRDD = ratingsFile.map(line => parseRating1(line))    println(ratingsRDD.count())    //    ratingsRDD.take(5).foreach(println) // always check the RDD    //    val numRatings = ratingsRDD.count()    val numUsers = ratingsRDD.map(r => r._1).distinct().count()    val numMovies = ratingsRDD.map(r => r._2).distinct().count()    println("Got %d ratings from %d users on %d movies.".          format(numRatings, numUsers, numMovies)) Split the dataset into training, validation, and test. We can use any random dataset. But here we will use the last digit of the timestamp: val trainSet = ratingsRDD.filter(x => (x._4 % 10) < 6) .map(x=>parseRating(x))    val validationSet = ratingsRDD.filter(x => (x._4 % 10) >= 6 &       (x._4 % 10) < 8).map(x=>parseRating(x))    val testSet = ratingsRDD.filter(x => (x._4 % 10) >= 8)     .map(x=>parseRating(x))    println("Training: "+ "%d".format(trainSet.count()) +      ", validation: " + "%d".format(validationSet.count()) + ",         test: " + "%d".format(testSet.count()) + ".")    //    // Now train the model using the training set:    val rank = 10    val numIterations = 20    val mdlALS = ALS.train(trainSet,rank,numIterations)    //    // prepare validation set for prediction    //    val userMovie = validationSet.map {      case Rating(user, movie, rate) =>(user, movie)    }    //    // Predict and convert to Key-Value PairRDD    val predictions = mdlALS.predict(userMovie).map {      case Rating(user, movie, rate) => ((user, movie), rate)    }    //    println(predictions.count())    predictions.take(5).foreach(println)    //    // Now convert the validation set to PairRDD:    //    val validationPairRDD = validationSet.map(r => ((r.user,       r.product), r.rating))    println(validationPairRDD.count())    validationPairRDD.take(5).foreach(println)    println(validationPairRDD.getClass())    println(predictions.getClass())    //    // Now join the validation set with predictions.    // Then we can figure out how good our recommendations are.    // Tip:    //   Need to import org.apache.spark.SparkContext._    //   Then MappedRDD would be converted implicitly to PairRDD    //    val ratingsAndPreds = validationPairRDD.join(predictions)    println(ratingsAndPreds.count())    ratingsAndPreds.take(3).foreach(println)    //    val mse = ratingsAndPreds.map(r => {      math.pow((r._2._1 - r._2._2),2)    }).reduce(_+_) / ratingsAndPreds.count()    val rmse = math.sqrt(mse)    println("MSE = %2.5f".format(mse) + " RMSE = %2.5f"     .format(rmse))    println("** Done **") } } The run results, as shown in the next screenshot, are obtained as expected: Check the following screenshot as well: Some more information is available at: The Goodby MapReduce article from Mahout News (https://mahout.apache.org/) https://spark.apache.org/docs/latest/mllib-guide.html A Collaborative Filtering ALS paper (http://dl.acm.org/citation.cfm?id=1608614) A good presentation on decision trees (http://spark-summit.org/wp-content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf) A recommended hands-on exercise from Spark Summit 2014 (https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html) Summary In this article, we looked at the most common machine learning algorithms. Naturally, ML is a vast subject and requires lot more study, experimentation, and practical experience on interesting data science problems. Two books that are relevant to Spark Machine Learning are Packt's own books Machine Learning with Spark, Nick Pentreath, and O'Reilly's Advanced Analytics with Spark, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills. Both are excellent books that you can refer to. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [article] The Spark programming model [article] Using the Spark Shell [article]
Read more
  • 0
  • 0
  • 7085
article-image-understanding-hadoop-backup-and-recovery-needs
Packt
10 Aug 2015
25 min read
Save for later

Understanding Hadoop Backup and Recovery Needs

Packt
10 Aug 2015
25 min read
In this article by Gaurav Barot, Chintan Mehta, and Amij Patel, authors of the book Hadoop Backup and Recovery Solutions, we will discuss backup and recovery needs. In the present age of information explosion, data is the backbone of business organizations of all sizes. We need a complete data backup and recovery system and a strategy to ensure that critical data is available and accessible when the organizations need it. Data must be protected against loss, damage, theft, and unauthorized changes. If disaster strikes, data recovery must be swift and smooth so that business does not get impacted. Every organization has its own data backup and recovery needs, and priorities based on the applications and systems they are using. Today's IT organizations face the challenge of implementing reliable backup and recovery solutions in the most efficient, cost-effective manner. To meet this challenge, we need to carefully define our business requirements and recovery objectives before deciding on the right backup and recovery strategies or technologies to deploy. (For more resources related to this topic, see here.) Before jumping onto the implementation approach, we first need to know about the backup and recovery strategies and how to efficiently plan them. Understanding the backup and recovery philosophies Backup and recovery is becoming more challenging and complicated, especially with the explosion of data growth and increasing need for data security today. Imagine big players such as Facebook, Yahoo! (the first to implement Hadoop), eBay, and more; how challenging it will be for them to handle unprecedented volumes and velocities of unstructured data, something which traditional relational databases can't handle and deliver. To emphasize the importance of backup, let's take a look at a study conducted in 2009. This was the time when Hadoop was evolving and a handful of bugs still existed in Hadoop. Yahoo! had about 20,000 nodes running Apache Hadoop in 10 different clusters. HDFS lost only 650 blocks, out of 329 million total blocks. Now hold on a second. These blocks were lost due to the bugs found in the Hadoop package. So, imagine what the scenario would be now. I am sure you will bet on losing hardly a block. Being a backup manager, your utmost target is to think, make, strategize, and execute a foolproof backup strategy capable of retrieving data after any disaster. Solely speaking, the plan of the strategy is to protect the files in HDFS against disastrous situations and revamp the files back to their normal state, just like James Bond resurrects after so many blows and probably death-like situations. Coming back to the backup manager's role, the following are the activities of this role: Testing out various case scenarios to forestall any threats, if any, in the future Building a stable recovery point and setup for backup and recovery situations Preplanning and daily organization of the backup schedule Constantly supervising the backup and recovery process and avoiding threats, if any Repairing and constructing solutions for backup processes The ability to reheal, that is, recover from data threats, if they arise (the resurrection power) Data protection is one of the activities and it includes the tasks of maintaining data replicas for long-term storage Resettling data from one destination to another Basically, backup and recovery strategies should cover all the areas mentioned here. For any system data, application, or configuration, transaction logs are mission critical, though it depends on the datasets, configurations, and applications that are used to design the backup and recovery strategies. Hadoop is all about big data processing. After gathering some exabytes for data processing, the following are the obvious questions that we may come up with: What's the best way to back up data? Do we really need to take a backup of these large chunks of data? Where will we find more storage space if the current storage space runs out? Will we have to maintain distributed systems? What if our backup storage unit gets corrupted? The answer to the preceding questions depends on the situation you may be facing; let's see a few situations. One of the situations is where you may be dealing with a plethora of data. Hadoop is used for fact-finding semantics and data is in abundance. Here, the span of data is short; it is short lived and important sources of the data are already backed up. Such is the scenario wherein the policy of not backing up data at all is feasible, as there are already three copies (replicas) in our data nodes (HDFS). Moreover, since Hadoop is still vulnerable to human error, a backup of configuration files and NameNode metadata (dfs.name.dir) should be created. You may find yourself facing a situation where the data center on which Hadoop runs crashes and the data is not available as of now; this results in a failure to connect with mission-critical data. A possible solution here is to back up Hadoop, like any other cluster (the Hadoop command is Hadoop). Replication of data using DistCp To replicate data, the distcp command writes data to two different clusters. Let's look at the distcp command with a few examples or options. DistCp is a handy tool used for large inter/intra cluster copying. It basically expands a list of files to input in order to map tasks, each of which will copy files that are specified in the source list. Let's understand how to use distcp with some of the basic examples. The most common use case of distcp is intercluster copying. Let's see an example: bash$ hadoop distcp2 hdfs://ka-16:8020/parth/ghiya hdfs://ka-001:8020/knowarth/parth This command will expand the namespace under /parth/ghiya on the ka-16 NameNode into the temporary file, get its content, divide them among a set of map tasks, and start copying the process on each task tracker from ka-16 to ka-001. The command used for copying can be generalized as follows: hadoop distcp2 hftp://namenode-location:50070/basePath hdfs://namenode-location Here, hftp://namenode-location:50070/basePath is the source and hdfs://namenode-location is the destination. In the preceding command, namenode-location refers to the hostname and 50070 is the NameNode's HTTP server post. Updating and overwriting using DistCp The -update option is used when we want to copy files from the source that don't exist on the target or have some different contents, which we do not want to erase. The -overwrite option overwrites the target files even if they exist at the source. The files can be invoked by simply adding -update and -overwrite. In the example, we used distcp2, which is an advanced version of DistCp. The process will go smoothly even if we use the distcp command. Now, let's look at two versions of DistCp, the legacy DistCp or just DistCp and the new DistCp or the DistCp2: During the intercluster copy process, files that were skipped during the copy process have all their file attributes (permissions, owner group information, and so on) unchanged when we copy using legacy DistCp or just DistCp. This, however, is not the case in new DistCp. These values are now updated even if a file is skipped. Empty root directories among the source inputs were not created in the target folder in legacy DistCp, which is not the case anymore in the new DistCp. There is a common misconception that Hadoop protects data loss; therefore, we don't need to back up the data in the Hadoop cluster. Since Hadoop replicates data three times by default, this sounds like a safe statement; however, it is not 100 percent safe. While Hadoop protects from hardware failure on the data nodes—meaning that if one entire node goes down, you will not lose any data—there are other ways in which data loss may occur. Data loss may occur due to various reasons, such as Hadoop being highly susceptible to human errors, corrupted data writes, accidental deletions, rack failures, and many such instances. Any of these reasons are likely to cause data loss. Consider an example where a corrupt application can destroy all data replications. During the process, it will attempt to compute each replication and on not finding a possible match, it will delete the replica. User deletions are another example of how data can be lost, as Hadoop's trash mechanism is not enabled by default. Also, one of the most complicated and expensive-to-implement aspects of protecting data in Hadoop is the disaster recovery plan. There are many different approaches to this, and determining which approach is right requires a balance between cost, complexity, and recovery time. A real-life scenario can be Facebook. The data that Facebook holds increases exponentially from 15 TB to 30 PB, that is, 3,000 times the Library of Congress. With increasing data, the problem faced was physical movement of the machines to the new data center, which required man power. Plus, it also impacted services for a period of time. Data availability in a short period of time is a requirement for any service; that's when Facebook started exploring Hadoop. To conquer the problem while dealing with such large repositories of data is yet another headache. The reason why Hadoop was invented was to keep the data bound to neighborhoods on commodity servers and reasonable local storage, and to provide maximum availability to data within the neighborhood. So, a data plan is incomplete without data backup and recovery planning. A big data execution using Hadoop states a situation wherein the focus on the potential to recover from a crisis is mandatory. The backup philosophy We need to determine whether Hadoop, the processes and applications that run on top of it (Pig, Hive, HDFS, and more), and specifically the data stored in HDFS are mission critical. If the data center where Hadoop is running disappeared, will the business stop? Some of the key points that have to be taken into consideration have been explained in the sections that follow; by combining these points, we will arrive at the core of the backup philosophy. Changes since the last backup Considering the backup philosophy that we need to construct, the first thing we are going to look at are changes. We have a sound application running and then we add some changes. In case our system crashes and we need to go back to our last safe state, our backup strategy should have a clause of the changes that have been made. These changes can be either database changes or configuration changes. Our clause should include the following points in order to construct a sound backup strategy: Changes we made since our last backup The count of files changed Ensure that our changes are tracked The possibility of bugs in user applications since the last change implemented, which may cause hindrance and it may be necessary to go back to the last safe state After applying new changes to the last backup, if the application doesn't work as expected, then high priority should be given to the activity of taking the application back to its last safe state or backup. This ensures that the user is not interrupted while using the application or product. The rate of new data arrival The next thing we are going to look at is how many changes we are dealing with. Is our application being updated so much that we are not able to decide what the last stable version was? Data is produced at a surpassing rate. Consider Facebook, which alone produces 250 TB of data a day. Data production occurs at an exponential rate. Soon, terms such as zettabytes will come upon a common place. Our clause should include the following points in order to construct a sound backup: The rate at which new data is arriving The need for backing up each and every change The time factor involved in backup between two changes Policies to have a reserve backup storage The size of the cluster The size of a cluster is yet another important factor, wherein we will have to select cluster size such that it will allow us to optimize the environment for our purpose with exceptional results. Recalling the Yahoo! example, Yahoo! has 10 clusters all over the world, covering 20,000 nodes. Also, Yahoo! has the maximum number of nodes in its large clusters. Our clause should include the following points in order to construct a sound backup: Selecting the right resource, which will allow us to optimize our environment. The selection of the right resources will vary as per need. Say, for instance, users with I/O-intensive workloads will go for more spindles per core. A Hadoop cluster contains four types of roles, that is, NameNode, JobTracker, TaskTracker, and DataNode. Handling the complexities of optimizing a distributed data center. Priority of the datasets The next thing we are going to look at are the new datasets, which are arriving. With the increase in the rate of new data arrivals, we always face a dilemma of what to backup. Are we tracking all the changes in the backup? Now, if are we backing up all the changes, will our performance be compromised? Our clause should include the following points in order to construct a sound backup: Making the right backup of the dataset Taking backups at a rate that will not compromise performance Selecting the datasets or parts of datasets The next thing we are going to look at is what exactly is backed up. When we deal with large chunks of data, there's always a thought in our mind: Did we miss anything while selecting the datasets or parts of datasets that have not been backed up yet? Our clause should include the following points in order to construct a sound backup: Backup of necessary configuration files Backup of files and application changes The timeliness of data backups With such a huge amount of data collected daily (Facebook), the time interval between backups is yet another important factor. Do we back up our data daily? In two days? In three days? Should we backup small chunks of data daily, or should we back up larger chunks at a later period? Our clause should include the following points in order to construct a sound backup: Dealing with any impacts if the time interval between two backups is large Monitoring a timely backup strategy and going through it The frequency of data backups depends on various aspects. Firstly, it depends on the application and usage. If it is I/O intensive, we may need more backups, as each dataset is not worth losing. If it is not so I/O intensive, we may keep the frequency low. We can determine the timeliness of data backups from the following points: The amount of data that we need to backup The rate at which new updates are coming Determining the window of possible data loss and making it as low as possible Critical datasets that need to be backed up Configuration and permission files that need to be backed up Reducing the window of possible data loss The next thing we are going to look at is how to minimize the window of possible data loss. If our backup frequency is great then what are the chances of data loss? What's our chance of recovering the latest files? Our clause should include the following points in order to construct a sound backup: The potential to recover latest files in the case of a disaster Having a low data-loss probability Backup consistency The next thing we are going to look at is backup consistency. The probability of invalid backups should be less or even better zero. This is because if invalid backups are not tracked, then copies of invalid backups will be made further, which will again disrupt our backup process. Our clause should include the following points in order to construct a sound backup: Avoid copying data when it's being changed Possibly, construct a shell script, which takes timely backups Ensure that the shell script is bug-free Avoiding invalid backups We are going to continue the discussion on invalid backups. As you saw, HDFS makes three copies of our backup for the recovery process. What if the original backup was flawed with errors or bugs? The three copies will be corrupted copies; now, when we recover these flawed copies, the result indeed will be a catastrophe. Our clause should include the following points in order to construct a sound backup: Avoid having a long backup frequency Have the right backup process, and probably having an automated shell script Track unnecessary backups If our backup clause covers all the preceding mentioned points, we surely are on the way to making a good backup strategy. A good backup policy basically covers all these points; so, if a disaster occurs, it always aims to go to the last stable state. That's all about backups. Moving on, let's say a disaster occurs and we need to go to the last stable state. Let's have a look at the recovery philosophy and all the points that make a sound recovery strategy. The recovery philosophy After a deadly storm, we always try to recover from the after-effects of the storm. Similarly, after a disaster, we try to recover from the effects of the disaster. In just one moment, storage capacity which was a boon turns into a curse and just another expensive, useless thing. Starting off with the best question, what will be the best recovery philosophy? Well, it's obvious that the best philosophy will be one wherein we may never have to perform recovery at all. Also, there may be scenarios where we may need to do a manual recovery. Let's look at the possible levels of recovery before moving on to recovery in Hadoop: Recovery to the flawless state Recovery to the last supervised state Recovery to a possible past state Recovery to a sound state Recovery to a stable state So, obviously we want our recovery state to be flawless. But if it's not achieved, we are willing to compromise a little and allow the recovery to go to a possible past state we are aware of. Now, if that's not possible, again we are ready to compromise a little and allow it to go to the last possible sound state. That's how we deal with recovery: first aim for the best, and if not, then compromise a little. Just like the saying goes, "The bigger the storm, more is the work we have to do to recover," here also we can say "The bigger the disaster, more intense is the recovery plan we have to take." So, the recovery philosophy that we construct should cover the following points: An automation system setup that detects a crash and restores the system to the last working state, where the application runs as per expected behavior. The ability to track modified files and copy them. Track the sequences on files, just like an auditor trails his audits. Merge the files that are copied separately. Multiple version copies to maintain a version control. Should be able to treat the updates without impacting the application's security and protection. Delete the original copy only after carefully inspecting the changed copy. Treat new updates but first make sure they are fully functional and will not hinder anything else. If they hinder, then there should be a clause to go to the last safe state. Coming back to recovery in Hadoop, the first question we may think of is what happens when the NameNode goes down? When the NameNode goes down, so does the metadata file (the file that stores data about file owners and file permissions, where the file is stored on data nodes and more), and there will be no one present to route our read/write file request to the data node. Our goal will be to recover the metadata file. HDFS provides an efficient way to handle name node failures. There are basically two places where we can find metadata. First, fsimage and second, the edit logs. Our clause should include the following points: Maintain three copies of the name node. When we try to recover, we get four options, namely, continue, stop, quit, and always. Choose wisely. Give preference to save the safe part of the backups. If there is an ABORT! error, save the safe state. Hadoop provides four recovery modes based on the four options it provides (continue, stop, quit, and always): Continue: This allows you to continue over the bad parts. This option will let you cross over a few stray blocks and continue over to try to produce a full recovery mode. This can be the Prompt when found error mode. Stop: This allows you to stop the recovery process and make an image file of the copy. Now, the part that we stopped won't be recovered, because we are not allowing it to. In this case, we can say that we are having the safe-recovery mode. Quit: This exits the recovery process without making a backup at all. In this, we can say that we are having the no-recovery mode. Always: This is one step further than continue. Always selects continue by default and thus avoids stray blogs found further. This can be the prompt only once mode. We will look at these in further discussions. Now, you may think that the backup and recovery philosophy is cool, but wasn't Hadoop designed to handle these failures? Well, of course, it was invented for this purpose but there's always the possibility of a mashup at some level. Are we overconfident and not ready to take precaution, which can protect us, and are we just entrusting our data blindly with Hadoop? No, certainly we aren't. We are going to take every possible preventive step from our side. In the next topic, we look at the very same topic as to why we need preventive measures to back up Hadoop. Knowing the necessity of backing up Hadoop Change is the fundamental law of nature. There may come a time when Hadoop may be upgraded on the present cluster, as we see many system upgrades everywhere. As no upgrade is bug free, there is a probability that existing applications may not work the way they used to. There may be scenarios where we don't want to lose any data, let alone start HDFS from scratch. This is a scenario where backup is useful, so a user can go back to a point in time. Looking at the HDFS replication process, the NameNode handles the client request to write a file on a DataNode. The DataNode then replicates the block and writes the block to another DataNode. This DataNode repeats the same process. Thus, we have three copies of the same block. Now, how these DataNodes are selected for placing copies of blocks is another issue, which we are going to cover later in Rack awareness. You will see how to place these copies efficiently so as to handle situations such as hardware failure. But the bottom line is when our DataNode is down there's no need to panic; we still have a copy on a different DataNode. Now, this approach gives us various advantages such as: Security: This ensures that blocks are stored on two different DataNodes High write capacity: This writes only on a single DataNode; the replication factor is handled by the DataNode Read options: This denotes better options from where to read; the NameNode maintains records of all the locations of the copies and the distance from the NameNode Block circulation: The client writes only a single block; others are handled through the replication pipeline During the write operation on a DataNode, it receives data from the client as well as passes data to the next DataNode simultaneously; thus, our performance factor is not compromised. Data never passes through the NameNode. The NameNode takes the client's request to write data on a DataNode and processes the request by deciding on the division of files into blocks and the replication factor. The following figure shows the replication pipeline, wherein a block of the file is written and three different copies are made at different DataNode locations: After hearing such a foolproof plan and seeing so many advantages, we again arrive at the same question: is there a need for backup in Hadoop? Of course there is. There often exists a common mistaken belief that Hadoop shelters you against data loss, which gives you the freedom to not take backups in your Hadoop cluster. Hadoop, by convention, has a facility to replicate your data three times by default. Although reassuring, the statement is not safe and does not guarantee foolproof protection against data loss. Hadoop gives you the power to protect your data over hardware failures; the scenario wherein one disk, cluster, node, or region may go down, data will still be preserved for you. However, there are many scenarios where data loss may occur. Consider an example where a classic human-prone error can be the storage locations that the user provides during operations in Hive. If the user provides a location wherein data already exists and they perform a query on the same table, the entire existing data will be deleted, be it of size 1 GB or 1 TB. In the following figure, the client gives a read operation but we have a faulty program. Going through the process, the NameNode is going to see its metadata file for the location of the DataNode containing the block. But when it reads from the DataNode, it's not going to match the requirements, so the NameNode will classify that block as an under replicated block and move on to the next copy of the block. Oops, again we will have the same situation. This way, all the safe copies of the block will be transferred to under replicated blocks, thereby HDFS fails and we need some other backup strategy: When copies do not match the way NameNode explains, it discards the copy and replaces it with a fresh copy that it has. HDFS replicas are not your one-stop solution for protection against data loss. The needs for recovery Now, we need to decide up to what level we want to recover. Like you saw earlier, we have four modes available, which recover either to a safe copy, the last possible state, or no copy at all. Based on your needs decided in the disaster recovery plan we defined earlier, you need to take appropriate steps based on that. We need to look at the following factors: The performance impact (is it compromised?) How large is the data footprint that my recovery method leaves? What is the application downtime? Is there just one backup or are there incremental backups? Is it easy to implement? What is the average recovery time that the method provides? Based on the preceding aspects, we will decide which modes of recovery we need to implement. The following methods are available in Hadoop: Snapshots: Snapshots simply capture a moment in time and allow you to go back to the possible recovery state. Replication: This involves copying data from one cluster and moving it to another cluster, out of the vicinity of the first cluster, so that if one cluster is faulty, it doesn't have an impact on the other. Manual recovery: Probably, the most brutal one is moving data manually from one cluster to another. Clearly, its downsides are large footprints and large application downtime. API: There's always a custom development using the public API available. We will move on to the recovery areas in Hadoop. Understanding recovery areas Recovering data after some sort of disaster needs a well-defined business disaster recovery plan. So, the first step is to decide our business requirements, which will define the need for data availability, precision in data, and requirements for the uptime and downtime of the application. Any disaster recovery policy should basically cover areas as per requirements in the disaster recovery principal. Recovery areas define those portions without which an application won't be able to come back to its normal state. If you are armed and fed with proper information, you will be able to decide the priority of which areas need to be recovered. Recovery areas cover the following core components: Datasets NameNodes Applications Database sets in HBase Let's go back to the Facebook example. Facebook uses a customized version of MySQL for its home page and other interests. But when it comes to Facebook Messenger, Facebook uses the NoSQL database provided by Hadoop. Now, looking from that point of view, Facebook will have both those things in recovery areas and will need different steps to recover each of these areas. Summary In this article, we went through the backup and recovery philosophy and what all points a good backup philosophy should have. We went through what a recovery philosophy constitutes. We saw the modes available for recovery in Hadoop. Then, we looked at why backup is important even though HDFS provides the replication process. Lastly, we looked at the recovery needs and areas. Quite a journey, wasn't it? Well, hold on tight. These are just your first steps into Hadoop User Group (HUG). Resources for Article: Further resources on this subject: Cassandra Architecture [article] Oracle GoldenGate 12c — An Overview [article] Backup and Restore Improvements [article]
Read more
  • 0
  • 0
  • 7059

article-image-postgresqls-transaction-model
Packt
23 Oct 2009
7 min read
Save for later

PostgreSQL's Transaction Model

Packt
23 Oct 2009
7 min read
On Databases Databases come in many forms. The simplest definition of a database is any system of storing, organizing, and retrieving data. With this definition, things like memory, hard drives, file systems, files on those file systems (stored in plain text, tab-delimited, XML, JSON, or even BDB formats), and even applications like MySQL, PostgreSQL, and Oracle are considered databases. Databases allow users to: Store Data Organize Data Retrieve Data It is important to keep a broad perspective on what data and databases really are so that you can always choose the best solution for your particular problem. The SQL databases (MySQL, PostgreSQL, Oracle, and others) are remarkable because of the flexibility and performance they provide. In my work, I look to them first when developing an application, with an eye towards getting the data model right before optimization. Once the application is solid, and once I fully understand what parts of the data system are too slow or fast enough, then I can start building my own database on top of the file system or other existing technologies that will give me the kind of performance I need. PostgreSQL: Free, BSD -licensed popular database. http://postgresql.org/ MySQL: Free, GPL-licensed popular database. http://mysql.org Oracle: Commercial industrial database. http://oracle.com SQL Server: Microsoft's commercial database. http://www.microsoft.com/SQL/default.mspx Among the SQL databases, which one is best? There are many criteria I use to evaluate SQL databases, and the one I pay attention to most is how they comply (if at all) with the ACID model. And given the technical merits of the various SQL databases, I consistently choose PostgreSQL above all other SQL databases when given a choice. Allow me to explain why. The ACID Model ACID is an acronym, standing for the four words Atomicity, Consistency, Isolation, and Durability. These are fancy words for some very basic and essential concepts. Atomicity means that you either do all of the changes you want, or none of them, without leaving the database in some weird in-between state. When you take into account catastrophes like power failures or corruption, atomicity isn't as simple as it first seems. Consistency means that any state of the database will be internally consistent with the rules that constrain the data. That is, if you have a table with a primary key, then that table will not contain any violations of the primary key constraints after any transaction. Isolation means that you can be modifying many different parts of the database at the same time without affecting each other. (As a higher feature, there is Serialization, which requires that transactions occur one after the other, or at least the results of transactions.) Durability means that once a transaction completes, it is never lost, ever. Atomicity: All or nothing Consistency: Rules kept Isolation: No partials seen Durability: Doesn't disappear ACID compliance isn't rocket science, but it isn't trivial either. These requirements form a minimum standard absolutely necessary to provide a database for a reasonable application. That is, if you can't guarantee these things, then the users of your application are going to be frustrated since they assume, naturally, that the ACID model is followed. And if the users of the application get frustrated, then the developers of the application will get frustrated as they try to comply with the user's expectations. A lot of frustration can be avoided if the database simply complies with the principles of the ACID model. If the database gets it right, then the rest of the application will have no problem getting it right as well. Our users will be happy since their expectations of ACID compliance will be met. Remember: Users expect ACID! What Violating the ACID Model Looks Like To consider the importance of the ACID model, let's examine, briefly, what happens when the model is violated. When Atomicity isn't adhered to, users will see their data partially committed. For instance, they might find their online profile only partially modified, or their bank transfer partially transferred. This is, of course, devastating to the unwary user. When Consistency is violated, the rules that the data should follow aren't adhered to. Perhaps the number of friends shown doesn't match the friends they actually have in a social networking application. Or perhaps they see their bank balance doesn't match what the numbers add up to. Or worse, perhaps your order system is counting orders that don't even exist and not counting orders that do. When Isolation isn't guaranteed, they will either have to use a system where only one person can change something at a time, locking out all others, or they will see inconsistencies throughout the world of data, inconsistencies resulting from transactions that are in progress elsewhere. This will make the data unreliable just like violating Atomicity or Consistency. A bank user, for instance, will believe their transfer of funds was successful when in reality their money was simultaneously being withdrawn by another transaction. When Durability is lost, then users will never know if their transaction really went through, and won't mysteriously disappear down the road with all the trouble that entails. I am sure we have all had experiences dealing with data systems that didn't follow the ACID model. I remember the days when you had to save your files frequently, and even then you still weren't ensured that all of your data would be properly saved. I also recall applications that would make partial changes, or incomplete changes, and expose these inconsistent states to the user. In today's world, writing applications with faults like the above is simply inexcusable. There are too many tools out there that are readily available that make writing ACID compliant systems easy. One of those tools, probably the most popular of all, is the SQL database. Satisfying ACID with Transactions The principle way that databases comply with ACID requirements is through the concept of transactions. Ideally, each transaction would occur in an instant, updating the database according to the state of the database at that moment. In reality, this isn't possible. It takes time to accumulate the data and apply the changes. Typical transaction SQL commands: BEGIN: Start a new transaction COMMIT: Commit the transaction ROLLBACK: Roll back the transaction in progress Since multiple sessions can each be creating and applying a transaction simultaneously, special precautions have to be taken to ensure that the data that each transaction “sees” is consistent, and that the effects of each transaction appear all together or not at all. Special care is also taken to ensure that when a transaction is committed, the database will be put in a state where catastrophic events will not leave the transaction partially committed. Contrary to popular belief, there are a variety of ways that databases support transactions. It is well worth the time to read and understand PostgreSQL's two levels of transaction isolation and the four possible isolation levels in Section 12.2 of the PostgreSQL documentation. Note that some of the inferior levels of transaction isolation violate some extreme cases of ACID compliance for the sake of performance. These edge cases can be properly handled with appropriate use of row-locking techniques. Row-locking is an issue beyond this article. Keep in mind that the levels of transaction isolation are only what appear to users of the database. Inside the database, there is a remarkable variety of methods on actually implementing transactions. Consider that while you are in a transaction, making changes to the database, every other transaction has to see one version of the database while you see another. In effect, you have to have copies of some of the data lying around somewhere. Queries to that data have to know which version of the data to retrieve the copy, the original, or the modified version (and which modified version?) Changes to the data have to go somewhere the original, a copy, or some modified version (again, which?) Answering these questions leads to the various implementations of transactions in ACID compliant databases. For the purposes of this article, I will examine only two: Oracle's and PostgreSQL's implementations. If you are only familiar with Oracle, then hopefully you will learn something new and fascinating as you investigate PostgreSQL's method.
Read more
  • 0
  • 0
  • 7047
Modal Close icon
Modal Close icon